Add design doc for model deployment pipeline #21

jeancochrane · 2023-10-18T16:01:44Z

This PR adds a draft of a design doc proposing we build a new deployment pipeline to allow us to run models on AWS Batch, triggered by a GitHub Actions workflow either manually or via a PR.

The design as currently proposed is simple enough that I don't think it's worth pulling this PR in; rather, I'm opening it for the purposes of discussion, and once the design is approved and we begin work we can make sure that the final design is documented in the README of this repo whenever we do the implementation work.

Add sketch of a deployment design doc

docs/deployment-design.md

dfsnow · 2023-10-18T17:15:19Z

docs/deployment-design.md

+* Use the [`configure-aws-credentials`](https://github.com/aws-actions/configure-aws-credentials) action to authenticate with AWS
+* Run Terraform to make sure an AWS Batch job queue and job definition exist for the PR
+* Build and push a new Docker image to ECR
+* Use the AWS CLI to [submit a job](https://docs.aws.amazon.com/cli/latest/reference/batch/submit-job.html) to the Batch queue


There's some further complexity added here by DVC. We currently use DVC in two ways:

To do Data Versioning on the input data for the model. This hashes the data in input/ and pushes it to a dedicated S3 bucket using content-addressable storage. This uses the commands dvc push and dvc pull. Note that intermediate steps (model outputs, performance stats, etc.) are NOT currently cached, though they could be.

To run Data Pipelines using dvc repro. This runs the scripts in pipeline/ in the correct order and hashes their output in dvc.lock. This means if a run fails for some reason or upstream dependencies in DVC's DAG change, you can "resume" a run using local intermediate files.

So, some questions stemming from these two use cases:

How will we use DVC within the batch job? This might be as simple as calling dvc pull && dvc repro to run the full pipeline for each job.

How will we use DVC's data versioning for the input data? IMO we should avoid/prohibit building the input data inside a Batch job. But this means we need to pre-build the data, upload it to the S3 cache, then update dvc.lock to reflect the new location, and stop any PRs that don't have the data built.

Should we take advantage of DVC's versioning to cache intermediate objects from each pipeline stage? Caching each stage means that a run with no changes affecting pipeline/01-train.R will completely skip that stage, instead pulling from the cache. This could save us a lot of time but will use more storage and will add complexity.

There's also a separate question about whether or not to expand our usage of DVC. It supports Metrics and Experiments with native git integration. I think it's worth investigating these features to see if they add any value for our use case.

Good questions @dfsnow! Some responses below; I'll also update the doc to reflect these newer ideas.

How will we use DVC within the batch job? This might be as simple as calling dvc pull && dvc repro to run the full pipeline for each job.

That's my thinking as well, and it seems to align with the example workflow that DVC/CML provides showing how to use their tools in GitHub Actions.

How will we use DVC's data versioning for the input data? IMO we should avoid/prohibit building the input data inside a Batch job. But this means we need to pre-build the data, upload it to the S3 cache, then update dvc.lock to reflect the new location, and stop any PRs that don't have the data built.

Based on the Data Versioning documentation, it seems like the recommended workflow is:

Build/edit data locally

Run dvc push to push these changes to remote data storage and update .dvc metadata files in the local repo

Run git add && git commit && git push to update .dvc metadata files in version control

Run dvc pull in any downstream consumers of the changed data, e.g. on a CI run or on a server running the model

I don't love this pattern; in particular, I wish there were an easy way to build/edit the data in a CI workflow, but as far as I can tell that would require configuring the workflow to commit the .dvc metadata to the repo (I've followed this kind of pattern in the past and it works but it's fragile). But it does seem to me that this is how the tools are designed to be used.

This leads me to a related question -- why don't we currently follow step 3 (push .dvc metadata files to version control)? Is it just because we expect all model runs to build their own data?

Should we take advantage of DVC's versioning to cache intermediate objects from each pipeline stage? Caching each stage means that a run with no changes affecting pipeline/01-train.R will completely skip that stage, instead pulling from the cache. This could save us a lot of time but will use more storage and will add complexity.

I think we should definitely try this! Increased storage use seems fine to me since storage is so much cheaper than compute, so I expect caching intermediate outputs would end up saving us money. It doesn't seem like DVC wants to support this use case, but pushing and pulling the cache from S3 seems like it should work in theory. The added complexity does worry me a bit, though, so I think we should think of this as a nice-to-have rather than a core requirement.

There's also a separate question about whether or not to expand our usage of DVC. It supports Metrics and Experiments with native git integration. I think it's worth investigating these features to see if they add any value for our use case.

Metrics and Experiments both look cool, but they seem to just be an alternative implementation of functionality that we've already built into the pipeline, so I think we should only refactor to use them if we can identify a constraint of our implementation that would be improved by switching to DVC's implementation.

One feature that I do think we should test out with an eye toward adoption is CML runners. This would be an abstraction layer that factors out the awscli commands currently described in the doc, and if it works as advertised, it should let us get up and running with this stack much faster than we would otherwise. But I think a lot will depend on how easy it is to debug, so I'll plan to test it before we decide to commit to it.

I codified some of these thoughts in the design doc in c0ffa5a, but I'll leave this thread open in case we want to continue discussing here 👍🏻

That's my thinking as well, and it seems to align with the example workflow that DVC/CML provides showing how to use their tools in GitHub Actions.

Cool. I'm totally down for a simple workflow that:

Spins up a runner on a PR (after manual approval)

Runs dvc pull to fetch manually constructed data

Runs dvc repro to runs the full pipeline

Based on the Data Versioning documentation, it seems like the recommended workflow is:

Build/edit data locally

Run dvc push to push these changes to remote data storage and update .dvc metadata files in the local repo

Run git add && git commit && git push to update .dvc metadata files in version control

Run dvc pull in any downstream consumers of the changed data, e.g. on a CI run or on a server running the model

I don't love this pattern; in particular, I wish there were an easy way to build/edit the data in a CI workflow, but as far as I can tell that would require configuring the workflow to commit the .dvc metadata to the repo (I've followed this kind of pattern in the past and it works but it's fragile). But it does seem to me that this is how the tools are designed to be used.

This leads me to a related question -- why don't we currently follow step 3 (push .dvc metadata files to version control)? Is it just because we expect all model runs to build their own data?

I think this workflow is actually fine. The ingest stage of the pipeline (pipeline/00-ingest.R) takes forever to run, so I'd rather manually run it and push the updated metadata/results than re-build data inside CI.

As for your question re: metadata, we do push the metadata to GitHub. It's in the dvc.lock file. IIRC, DVC has two patterns for metadata storage: storing everything in a single dvc.lock or storing per-stage/output files in .dvc/. I don't remember why we chose the first pattern.

Should we take advantage of DVC's versioning to cache intermediate objects from each pipeline stage? Caching each stage means that a run with no changes affecting pipeline/01-train.R will completely skip that stage, instead pulling from the cache. This could save us a lot of time but will use more storage and will add complexity.

I think we should definitely try this! Increased storage use seems fine to me since storage is so much cheaper than compute, so I expect caching intermediate outputs would end up saving us money. It doesn't seem like DVC wants to support this use case, but pushing and pulling the cache from S3 seems like it should work in theory. The added complexity does worry me a bit, though, so I think we should think of this as a nice-to-have rather than a core requirement.

I actually think this wouldn't work because it would require us to push the updated dvc.lock back to GitHub in order to keep track of the cache state (or store it elsewhere). IMO, let's start with the simplest possible setup (as I described above).

There's also a separate question about whether or not to expand our usage of DVC. It supports Metrics and Experiments with native git integration. I think it's worth investigating these features to see if they add any value for our use case.

Metrics and Experiments both look cool, but they seem to just be an alternative implementation of functionality that we've already built into the pipeline, so I think we should only refactor to use them if we can identify a constraint of our implementation that would be improved by switching to DVC's implementation.

One feature that I do think we should test out with an eye toward adoption is CML runners. This would be an abstraction layer that factors out the awscli commands currently described in the doc, and if it works as advertised, it should let us get up and running with this stack much faster than we would otherwise. But I think a lot will depend on how easy it is to debug, so I'll plan to test it before we decide to commit to it.

I think this is correct. Let's ignore experiments and metrics for now and test out the CML runners, since I think they basically do exactly what we need.

Another note for the design doc: this year we want to use the following pattern:

Use an in-repo Quarto doc to analyze/diagnose/display single model performance. This will be created for each run and will be sent as a link along with the SNS notification at the end of a run.

Use Tableau for cross-model comparison, using the same dashboards as previous years.

Ah, this all makes sense @dfsnow! I made some modifications in c0ffa5a to remove the caching task and clarify the output task. A couple follow-up questions:

Is reports/model_qc/model_qc.qmd the Quarto doc that we want to generate for model performance, or do we need to create a new one?

Does any work need to be done on the Tableau cross-model dashboards, or can we expect that those will hook into the pipeline in the same way?

@jeancochrane

My plan is to make a new one, but it will likely include outputs from that doc.

Nope. As long as the main outputs to the Athena model.* tables don't change, Tableau doesn't need to change either.

docs/deployment-design.md

jeancochrane · 2023-10-20T17:51:57Z

The changes described in this doc are now codified in this issue, so I'm going to close this.

jeancochrane added 3 commits October 16, 2023 17:19

Create deployment-design.md

c0ffa5a

Add sketch of a deployment design doc

Flesh out deployment-design.md

5cfc7ca

Clean up deployment-design.md

16665fc

jeancochrane linked an issue Oct 18, 2023 that may be closed by this pull request

Draft design doc for 2024 modeling infrastructure #19

Closed

jeancochrane commented Oct 18, 2023

View reviewed changes

docs/deployment-design.md Outdated Show resolved Hide resolved

docs/deployment-design.md Outdated Show resolved Hide resolved

jeancochrane marked this pull request as ready for review October 18, 2023 16:19

jeancochrane requested review from dfsnow and wrridgeway as code owners October 18, 2023 16:19

dfsnow reviewed Oct 18, 2023

View reviewed changes

docs/deployment-design.md Outdated Show resolved Hide resolved

dfsnow reviewed Oct 18, 2023

View reviewed changes

Update deployment-design.md based on feedback from PR review

b77fe02

jeancochrane requested a review from dfsnow October 19, 2023 20:45

Clarify caching and output tasks in deployment-design.md

7d8302a

jeancochrane closed this Oct 20, 2023

jeancochrane mentioned this pull request Oct 20, 2023

Draft design doc for 2024 modeling infrastructure #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design doc for model deployment pipeline #21

Add design doc for model deployment pipeline #21

jeancochrane commented Oct 18, 2023 •

edited

Loading

dfsnow Oct 18, 2023

jeancochrane Oct 19, 2023 •

edited

Loading

jeancochrane Oct 19, 2023

dfsnow Oct 19, 2023

jeancochrane Oct 19, 2023

dfsnow Oct 19, 2023

jeancochrane commented Oct 20, 2023

Add design doc for model deployment pipeline #21

Add design doc for model deployment pipeline #21

Conversation

jeancochrane commented Oct 18, 2023 • edited Loading

dfsnow Oct 18, 2023

Choose a reason for hiding this comment

jeancochrane Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

jeancochrane Oct 19, 2023

Choose a reason for hiding this comment

dfsnow Oct 19, 2023

Choose a reason for hiding this comment

jeancochrane Oct 19, 2023

Choose a reason for hiding this comment

dfsnow Oct 19, 2023

Choose a reason for hiding this comment

jeancochrane commented Oct 20, 2023

jeancochrane commented Oct 18, 2023 •

edited

Loading

jeancochrane Oct 19, 2023 •

edited

Loading