-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for model deployment pipeline #21
Closed
jeancochrane
wants to merge
5
commits into
master
from
jeancochrane/19-draft-design-doc-for-2024-modeling-infrastructure
Closed
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
c0ffa5a
Create deployment-design.md
jeancochrane 5cfc7ca
Flesh out deployment-design.md
jeancochrane 16665fc
Clean up deployment-design.md
jeancochrane b77fe02
Update deployment-design.md based on feedback from PR review
jeancochrane 7d8302a
Clarify caching and output tasks in deployment-design.md
jeancochrane File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Design doc: Model deployment pipeline | ||
|
||
This doc represents a proposal for a simple CI/CD pipeline that we can use to deploy residential models and run experiments on them. | ||
|
||
## Background | ||
|
||
At a high level, our **existing process** for experimenting with changes to our models looks like this: | ||
|
||
* Models run on an on-prem server | ||
* Data scientists trigger model runs by SSHing into the server and running R scripts from cloned copies of this repo | ||
* Model inputs and parameters are hashed for reproducibility using DVC for [data versioning](https://dvc.org/doc/start/data-management/data-versioning) | ||
* R scripts are run in the correct sequence using DBC for [data pipelines](https://dvc.org/doc/start/data-management/data-pipelines) | ||
* Data scientists commit corresponding changes to model code after their experiment runs prove successful | ||
|
||
This process has the advantage of being simple, easy to maintain, and cheap; as a result it has been useful to our team during the recent past when we only had a few data scientists on staff and they needed to focus most of their effort on building a new model from the ground up. However, some of its **limitations** are becoming apparent as our team scales up and begins to expand our focus: | ||
|
||
* Our on-prem server only has enough resources to run one model at a time, so only one data scientist may be running modeling experiments at a given time | ||
* Further, our server has no notion of a job queue, so a data scientist who is waiting to run a model must notice that a previous run has completed and initiate their run manually | ||
* Our on-prem server does not have a GPU, so it can't make use of GPU-accelerated libraries like XGBoost | ||
* Model runs are decoupled from changes to code and data in version control, so data scientists have to remember to commit their changes correctly | ||
* Results of model runs are not easily accessible to PR reviewers | ||
|
||
The design described below aims to remove these limitations while retaining as much simplicity, maintainability, and affordability as possible. | ||
|
||
## Requirements | ||
|
||
At a high level, a model deployment pipeline should: | ||
|
||
* Integrate with our existing cloud infrastructure (GitHub and AWS) | ||
* Trigger model runs from pull request branches | ||
* Require code authors to approve model runs before they are initiated | ||
* Run the model on ephemeral, cheap, and isolated cloud infrastructure | ||
* Run multiple model runs simultaneously on separate hardware | ||
* Report model statistics back to the pull request that triggered a run | ||
|
||
## Design | ||
|
||
### Running the model | ||
|
||
Here is a rough sketch of a new model deployment pipeline: | ||
|
||
* Define a new workflow, `run-model.yaml`, that runs on: | ||
* Every commit to every pull request | ||
* The `workflow_dispatch` event | ||
* Set up the workflow so that it deploys to the `staging` environment and requires [manual approval](https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#using-environments-to-manually-trigger-workflow-jobs) | ||
* Define a job, `build-docker-image`, to build and push a new Docker image for the model code to GitHub Container Registry | ||
* Cache the build using `renv.lock` as the key | ||
* Define a job to run the model (implementation details in the following sections) | ||
|
||
See the following sections for two options for how we can run the model itself. | ||
|
||
#### Option 1: Use CML self-hosted runners | ||
|
||
CML's [self-hosted runners](https://cml.dev/doc/self-hosted-runners#allocating-cloud-compute-resources-with-cml) claim to provide an abstraction layer on top of GitHub Actions and AWS EC2 that would allow us to launch a spot EC2 instance and use it as a GitHub Actions runner. If it works as advertised and is easy to debug, it would allow us to spin up infrastructure for running the model with very little custom code. | ||
|
||
The steps involved here include: | ||
|
||
* Define a job, `launch-runner`, to start an AWS spot EC2 instance using [`cml runner`](https://cml.dev/doc/self-hosted-runners#allocating-cloud-compute-resources-with-cml) | ||
* Set sensible defaults for the [instance options](https://cml.dev/doc/ref/runner#options), but allow them to be overridden via workflow inputs | ||
* Define a job, `run-model`, to run the model on the EC2 instance created by CML | ||
* Set the `runs-on` key for the job to point at the runner | ||
* This will cause steps defined in the job to run on the remote runner | ||
* Run the model using `dvc pull` and `dvc repro` | ||
|
||
#### Option 2: Write custom code to run model jobs on AWS Batch | ||
|
||
If CML does not work as advertised, we can always implement our own version of its functionality. Define the following steps in a `run-model` job: | ||
|
||
* Run Terraform to make sure an AWS Batch job queue and job definition exist for the PR | ||
* The job definition should define the code that will be used to run the model itself, e.g. `dvc pull` and `dvc repro` | ||
* Use the AWS CLI to [submit a job](https://docs.aws.amazon.com/cli/latest/reference/batch/submit-job.html) to the Batch queue | ||
* Use the AWS CLI to [poll the job status](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/batch/describe-jobs.html) until it has a terminal status (`SUCCEEDED` or `FAILED`) | ||
* Once the job has at least a `RUNNING` status, use the `jobStreamName` parameter to print a link to its logs | ||
|
||
### Reporting model performance | ||
|
||
We would like to move toward the following pattern for evaluating model performance: | ||
|
||
1. Use an Quarto doc stored in the repo to analyze/diagnose/display single model performance. This will be created for each model run and will be sent as a link in the SNS notification at the end of a run. | ||
2. Use Tableau for cross-model comparison, using the same dashboards as previous years. | ||
|
||
Step 1 will require us to update `05-finalize.R` to generate the Quarto doc, upload it to S3, and adjust the SNS message body to include a link to it. | ||
|
||
### Caching intermediate data | ||
|
||
Caching intermediate data would allow us to only run model stages whose code, data, or dependencies have changed since the last model run. This has the potential to reduce the amount of compute we use and speed up experimentation. | ||
|
||
Remote caching is currently [not natively supported by DVC](https://github.com/iterative/dvc/issues/5665#issuecomment-811087810), but it is theoreitcally possible if we were to pull the [cache directory](https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) from S3 storage before each run and push it after successful runs. However, this pattern would also require updating the `dvc.lock` file on every run, so it's likely too fragile to be worth implementing right now. | ||
|
||
### Metrics and experiments | ||
|
||
While DVC now offers built-in tools for [metrics](https://dvc.org/doc/start/data-management/metrics-parameters-plots) and [experiments](https://dvc.org/doc/start/experiments), these features do not yet seem to offer any functionality above and beyond what we have already built into the modeling scripts, so we should wait to switch to them until we can identify a limitation of our current scripts that would be resolved faster by switching to DVC for metrics or experimentation. | ||
|
||
## Tasks | ||
|
||
We will create GitHub issues for the following tasks: | ||
|
||
* Add Docker image definition for the model | ||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Add GitHub workflow to deploy and run the model on commits to PRs | ||
* Note that there is a prototype of a similar workflow [on GitLab](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/master/.gitlab-ci.yml?ref_type=heads) that may be useful | ||
* Spike the CML self-hosted runner solution first, and move on to the custom solution if CML runners don't work as advertised | ||
* Update the Finalize step to generate a Quarto doc and link to it in SNS notification |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some further complexity added here by DVC. We currently use DVC in two ways:
input/
and pushes it to a dedicated S3 bucket using content-addressable storage. This uses the commandsdvc push
anddvc pull
. Note that intermediate steps (model outputs, performance stats, etc.) are NOT currently cached, though they could be.dvc repro
. This runs the scripts inpipeline/
in the correct order and hashes their output indvc.lock
. This means if a run fails for some reason or upstream dependencies in DVC's DAG change, you can "resume" a run using local intermediate files.So, some questions stemming from these two use cases:
dvc pull && dvc repro
to run the full pipeline for each job.dvc.lock
to reflect the new location, and stop any PRs that don't have the data built.pipeline/01-train.R
will completely skip that stage, instead pulling from the cache. This could save us a lot of time but will use more storage and will add complexity.There's also a separate question about whether or not to expand our usage of DVC. It supports Metrics and Experiments with native git integration. I think it's worth investigating these features to see if they add any value for our use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions @dfsnow! Some responses below; I'll also update the doc to reflect these newer ideas.
That's my thinking as well, and it seems to align with the example workflow that DVC/CML provides showing how to use their tools in GitHub Actions.
Based on the Data Versioning documentation, it seems like the recommended workflow is:
dvc push
to push these changes to remote data storage and update.dvc
metadata files in the local repogit add && git commit && git push
to update.dvc
metadata files in version controldvc pull
in any downstream consumers of the changed data, e.g. on a CI run or on a server running the modelI don't love this pattern; in particular, I wish there were an easy way to build/edit the data in a CI workflow, but as far as I can tell that would require configuring the workflow to commit the
.dvc
metadata to the repo (I've followed this kind of pattern in the past and it works but it's fragile). But it does seem to me that this is how the tools are designed to be used.This leads me to a related question -- why don't we currently follow step 3 (push
.dvc
metadata files to version control)? Is it just because we expect all model runs to build their own data?I think we should definitely try this! Increased storage use seems fine to me since storage is so much cheaper than compute, so I expect caching intermediate outputs would end up saving us money. It doesn't seem like DVC wants to support this use case, but pushing and pulling the cache from S3 seems like it should work in theory. The added complexity does worry me a bit, though, so I think we should think of this as a nice-to-have rather than a core requirement.
Metrics and Experiments both look cool, but they seem to just be an alternative implementation of functionality that we've already built into the pipeline, so I think we should only refactor to use them if we can identify a constraint of our implementation that would be improved by switching to DVC's implementation.
One feature that I do think we should test out with an eye toward adoption is CML runners. This would be an abstraction layer that factors out the awscli commands currently described in the doc, and if it works as advertised, it should let us get up and running with this stack much faster than we would otherwise. But I think a lot will depend on how easy it is to debug, so I'll plan to test it before we decide to commit to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I codified some of these thoughts in the design doc in c0ffa5a, but I'll leave this thread open in case we want to continue discussing here 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'm totally down for a simple workflow that:
dvc pull
to fetch manually constructed datadvc repro
to runs the full pipelineI think this workflow is actually fine. The ingest stage of the pipeline (
pipeline/00-ingest.R
) takes forever to run, so I'd rather manually run it and push the updated metadata/results than re-build data inside CI.As for your question re: metadata, we do push the metadata to GitHub. It's in the
dvc.lock
file. IIRC, DVC has two patterns for metadata storage: storing everything in a singledvc.lock
or storing per-stage/output files in.dvc/
. I don't remember why we chose the first pattern.I actually think this wouldn't work because it would require us to push the updated
dvc.lock
back to GitHub in order to keep track of the cache state (or store it elsewhere). IMO, let's start with the simplest possible setup (as I described above).I think this is correct. Let's ignore experiments and metrics for now and test out the CML runners, since I think they basically do exactly what we need.
Another note for the design doc: this year we want to use the following pattern:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this all makes sense @dfsnow! I made some modifications in c0ffa5a to remove the caching task and clarify the output task. A couple follow-up questions:
reports/model_qc/model_qc.qmd
the Quarto doc that we want to generate for model performance, or do we need to create a new one?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane
model.*
tables don't change, Tableau doesn't need to change either.