Skip to content

Commit

Permalink
Clarify caching and output tasks in deployment-design.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jeancochrane authored Oct 19, 2023
1 parent b77fe02 commit 7d8302a
Showing 1 changed file with 11 additions and 6 deletions.
17 changes: 11 additions & 6 deletions docs/deployment-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ This process has the advantage of being simple, easy to maintain, and cheap; as
* Our on-prem server only has enough resources to run one model at a time, so only one data scientist may be running modeling experiments at a given time
* Further, our server has no notion of a job queue, so a data scientist who is waiting to run a model must notice that a previous run has completed and initiate their run manually
* Our on-prem server does not have a GPU, so it can't make use of GPU-accelerated libraries like XGBoost
* Our DVC configuration has disabled caching of intermediate outputs, so model runs must always begin from the start of the pipeline, saving on storage but increasing execution time as a result
* Model runs are decoupled from changes to code and data in version control, so data scientists have to remember to commit their changes correctly
* Results of model runs are not easily accessible to PR reviewers

Expand Down Expand Up @@ -47,7 +46,6 @@ Here is a rough sketch of a new model deployment pipeline:
* Define a job, `build-docker-image`, to build and push a new Docker image for the model code to GitHub Container Registry
* Cache the build using `renv.lock` as the key
* Define a job to run the model (implementation details in the following sections)
* Print a link to S3 model evaluation outputs that will be visible in the GitHub Actions UI

See the following sections for two options for how we can run the model itself.

Expand All @@ -73,14 +71,21 @@ If CML does not work as advertised, we can always implement our own version of i
* Use the AWS CLI to [submit a job](https://docs.aws.amazon.com/cli/latest/reference/batch/submit-job.html) to the Batch queue
* Use the AWS CLI to [poll the job status](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/batch/describe-jobs.html) until it has a terminal status (`SUCCEEDED` or `FAILED`)
* Once the job has at least a `RUNNING` status, use the `jobStreamName` parameter to print a link to its logs

### Reporting model performance

We would like to move toward the following pattern for evaluating model performance:

1. Use an Quarto doc stored in the repo to analyze/diagnose/display single model performance. This will be created for each model run and will be sent as a link in the SNS notification at the end of a run.
2. Use Tableau for cross-model comparison, using the same dashboards as previous years.

Step 1 will require us to update `05-finalize.R` to generate the Quarto doc, upload it to S3, and adjust the SNS message body to include a link to it.

### Caching intermediate data

Caching intermediate data would allow us to only run model stages whose code, data, or dependencies have changed since the last model run. This has the potential to reduce the amount of compute we use and speed up experimentation.

Remote caching is currently [not natively supported by DVC](https://github.com/iterative/dvc/issues/5665#issuecomment-811087810), but it should be possible if we pull the [cache directory](https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) from S3 storage before each run and push it after successful runs. All branches should have their own cache, but PR branches should inherit the main branch cache before their first successful run. Note that to enable this caching, we would also need to remove the `cache: false` attribute that is currently set on all of our intermediate outputs. We would also need to start saving `.dvc` metadata files under version control so that model runs know which version of the data to pull.

We should consider this step to be an iterative improvement on the pipeline MVP, since it's not required to run the model and it may introduce more complexity than the reduction in runtime is worth.
Remote caching is currently [not natively supported by DVC](https://github.com/iterative/dvc/issues/5665#issuecomment-811087810), but it is theoreitcally possible if we were to pull the [cache directory](https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) from S3 storage before each run and push it after successful runs. However, this pattern would also require updating the `dvc.lock` file on every run, so it's likely too fragile to be worth implementing right now.

### Metrics and experiments

Expand All @@ -94,4 +99,4 @@ We will create GitHub issues for the following tasks:
* Add GitHub workflow to deploy and run the model on commits to PRs
* Note that there is a prototype of a similar workflow [on GitLab](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/master/.gitlab-ci.yml?ref_type=heads) that may be useful
* Spike the CML self-hosted runner solution first, and move on to the custom solution if CML runners don't work as advertised
* Time permitting: Add workflow cache for intermediate data
* Update the Finalize step to generate a Quarto doc and link to it in SNS notification

0 comments on commit 7d8302a

Please sign in to comment.