ccao-data · jeancochrane · Dec 1, 2023 · Nov 21, 2023 · Nov 21, 2023 · Nov 21, 2023
@@ -0,0 +1 @@
+Config/renv/profiles/reporting/dependencies: quarto, leaflet, plotly, sf
@@ -8,7 +8,13 @@ ENV RENV_PATHS_LIBRARY renv/library
 RUN apt-get update && apt-get install --no-install-recommends -y \
     libcurl4-openssl-dev libssl-dev libxml2-dev libgit2-dev git \
     libudunits2-dev python3-dev python3-pip libgdal-dev libgeos-dev \
-    libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc
+    libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc \
+    curl gdebi-core
+
+# Install Quarto
+RUN curl -o quarto-linux-amd64.deb -L \
+    https://github.com/quarto-dev/quarto-cli/releases/download/v1.3.450/quarto-1.3.450-linux-amd64.deb
+RUN gdebi -n quarto-linux-amd64.deb
 
 # Install pipenv for Python dependencies
 RUN pip install pipenv
@@ -26,11 +32,13 @@ RUN pipenv install --system --deploy
 
 # Copy R bootstrap files into the image
 COPY renv.lock .
+COPY renv/profiles/reporting/renv.lock reporting-renv.lock
 COPY .Rprofile .
 COPY renv/ renv/
 
 # Install R dependencies
 RUN Rscript -e 'renv::restore()'
+RUN Rscript -e 'renv::restore(lockfile = "reporting-renv.lock")'
 
 # Copy the directory into the container
 ADD ./ model-res-avm/

@@ -36,7 +36,10 @@ model_get_s3_artifacts_for_run <- function(run_id, year) {
   bucket <- strsplit(s3_objs[1], "/")[[1]][3]
 
   # First get anything partitioned only by year
-  s3_objs_limited <- grep(".parquet$|.zip$|.rds$", s3_objs, value = TRUE) %>%
+  s3_objs_limited <- grep(
+    ".parquet$|.zip$|.rds|.html$", s3_objs,
+    value = TRUE
+  ) %>%
     unname()
 
   # Next get the prefix of anything partitioned by year and run_id

@@ -59,6 +59,7 @@ graph LR
     evaluate("Evaluate")
     interpret("Interpret")
     finalize("Finalize")
+    upload("Upload")
     export("Export")
 
     ingest --> train
@@ -67,8 +68,9 @@ graph LR
     assess --> evaluate
     evaluate --> finalize
     interpret --> finalize
-    finalize --> aws
+    finalize --> upload
     finalize --> export
+    upload --> aws
     aws --> ingest
     aws --> export
 ```
@@ -87,9 +89,11 @@ All inputs and outputs are stored on AWS S3 using a unique run identifier. Each
 
 4. **Interpret**: Calculate SHAP values for all the estimated values from the assess stage. These are the _per feature_ contribution to the predicted value for an _individual observation_ (usually a single PIN). Also calculate the aggregate feature importance for the entire model. The primary output of this stage is a data frame of the contributions of each feature for each property.
 
-5. **Finalize**: Add metadata and then upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run's performance can be visualized using the CCAO's internal Tableau dashboards.
+5. **Finalize**: Save run timings and metadata and render a Quarto document containing a model performance report to `reports/performance.html`.
 
-6. **Export**: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of the main pipeline.
+6. **Upload**: Upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run's performance can be visualized using the CCAO's internal Tableau dashboards. NOTE: This stage is only run internally, since it requires access to the CCAO Data AWS account.
+
+7. **Export**: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of the main pipeline.
 
 ## Choices Made
 
@@ -250,7 +254,10 @@ dbt_manifest <- fromJSON(
 get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
   # Retrieve the description for a column `colname` either from a set of
   # dbt DAG nodes (`dag_nodes`) or a set of hardcoded descriptions
-  # (`hardcoded_descriptions`)
+  # (`hardcoded_descriptions`). Column descriptions that come from dbt DAG nodes
+  # will be truncated starting from the first period to reflect the fact that
+  # we use periods in our dbt documentation to separate high-level column
+  # summaries from their detailed notes
   #
   # Prefer the hardcoded descriptions, if they exist
   if (colname %in% hardcoded_descriptions$column) {
@@ -267,7 +274,11 @@ get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
       if (column_name == colname) {
         description <- node$columns[[column_name]]$description
         if (!is.null(description) && trimws(description) != "") {
-          return(gsub("\n", " ", description))
+          # Strip everything after the first period, since we use the first
+          # period as a delimiter separating a column's high-level summary from
+          # its detailed notes in our dbt docs
+          summary_description <- strsplit(description, ".", fixed = TRUE)[[1]][1]
+          return(gsub("\n", " ", summary_description))
         }
       }
     }
@@ -464,7 +475,7 @@ This repository represents a significant departure from the old [residential mod
 
 ### [`assessment-year-2022`](https://github.com/ccao-data/model-res-avm/tree/2022-assessment-year)
 
-* Moved previously separate processes into this repository and improved their integration with the overall modeling process. For example, the [etl_res_data](https://gitlab.com/ccao-data-science---modeling/processes/etl_res_data) process was moved to [pipeline/00-ingest.R](pipeline/00-ingest.R), while the process to [finalize model values](https://gitlab.com/ccao-data-science---modeling/processes/finalize_model_values) was moved to [pipeline/06-export.R](pipeline/06-export.R).
+* Moved previously separate processes into this repository and improved their integration with the overall modeling process. For example, the [etl_res_data](https://gitlab.com/ccao-data-science---modeling/processes/etl_res_data) process was moved to [pipeline/00-ingest.R](pipeline/00-ingest.R), while the process to [finalize model values](https://gitlab.com/ccao-data-science---modeling/processes/finalize_model_values) was moved to [pipeline/07-export.R](pipeline/07-export.R).
 * Added [DVC](https://dvc.org/) support/integration. This repository uses DVC in 2 ways:
   1. All input data in [`input/`](input/) is versioned, tracked, and stored using DVC. Previous input data sets are stored in perpetuity on S3.
   2. [DVC pipelines](https://dvc.org/doc/user-guide/project-structure/pipelines-files) are used to sequentially run R pipeline scripts and track/cache inputs and outputs.
@@ -487,6 +498,13 @@ This repository represents a significant departure from the old [residential mod
 * Dropped explicit spatial lag generation in the ingest stage.
 * Lots of other bugfixes and minor improvements.
 
+### Upcoming
+
+* Infrastructure improvements
+  * Added [`build-and-run-model`](https://github.com/ccao-data/model-res-avm/actions/workflows/build-and-run-model.yaml) workflow to run the model using GitHub Actions and AWS Batch.
+  * Added [`delete-model-run`](https://github.com/ccao-data/model-res-avm/actions/workflows/delete-model-runs.yaml) workflow to delete test run artifacts in S3 using GitHub Actions.
+  * Updated [pipeline/05-finalize](pipeline/05-finalize.R) step to render a performance report using Quarto and factored S3/SNS operations out into [pipeline/06-upload.R](pipeline/06-upload.R).
+
 # Ongoing Issues
 
 The CCAO faces a number of ongoing issues which make modeling difficult. Some of these issues are in the process of being solved; others are less tractable. We list them here for the sake of transparency and to provide a sense of the challenges we face.
@@ -609,12 +627,15 @@ The code in this repository is written primarily in [R](https://www.r-project.or
 
 If you're on Windows, you'll also need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) in order to build the necessary packages. You may also want to (optionally) install [DVC](https://dvc.org/doc/install) to pull data and run pipelines.
 
+We also publish a Docker image containing model code and all of the dependencies necessary to run it. If you're comfortable using Docker, you can skip the installation steps below and instead pull the image from `ghcr.io/ccao-data/model-res-avm:master` to run the latest version of the model.
+
 ## Installation
 
 1. Clone this repository using git, or simply download it using the button at the top of the page.
 2. Set your working directory to the local folder containing this repository's files, either using R's `setwd()` command or (preferably) using RStudio's [projects](https://support.posit.co/hc/en-us/articles/200526207-Using-Projects).
 3. Install `renv`, R's package manager, by running `install.packages("renv")`.
 4. Install all R package dependencies using `renv` by running `renv::restore()`. This step may take awhile. Linux users will likely need to install dependencies (via apt, yum, etc.) to build from source.
+5. The `finalize` step of the model pipeline requires some additional dependencies for generating a model performance report. Install these additional dependencies by running `renv::restore(lockfile = "renv/profiles/reporting/renv.lock")`. These dependencies must be installed in addition to the core dependencies installed in step 4. If dependencies are not installed, the report will fail to generate and the pipeline stage will print the error message to the report file at `reports/performance.html`; the pipeline will continue to execute in spite of the failure.
 
 For installation issues, particularly related to package installation and dependencies, see [Troubleshooting](#troubleshooting).
 
@@ -625,8 +646,8 @@ For installation issues, particularly related to package installation and depend
 To use this repository, simply open the [pipeline/](./pipeline) directory and run the R scripts in order. Non-CCAO users can skip the following stages:
 
 * [`pipeline/00-ingest.R`](pipeline/00-ingest.R) - Requires access to CCAO internal AWS services to pull data. See [Getting Data](#getting-data) if you are a member of the public.
-* [`pipeline/05-finalize.R`](pipeline/05-finalize.R) - Requires access to CCAO internal AWS services to upload model results.
-* [`pipeline/06-export.R`](pipeline/06-export.R) - Only required for CCAO internal processes. 
+* [`pipeline/06-upload.R`](pipeline/06-upload.R) - Requires access to CCAO internal AWS services to upload model results.
+* [`pipeline/07-export.R`](pipeline/07-export.R) - Only required for CCAO internal processes. 
 
 #### Using DVC
 
@@ -667,7 +688,7 @@ Each R script has a set of associated parameters (tracked via `dvc.yaml`). DVC w
 
 ## Output
 
-The full model pipeline produces a large number of outputs. A full list of these outputs and their purpose can be found in [`misc/file_dict.csv`](misc/file_dict.csv). For public users, all outputs are saved in the [`output/`](output/) directory, where they can be further used/examined after a model run. For CCAO employees, all outputs are uploaded to S3 via the [finalize stage](pipeline/05-finalize.R). Uploaded Parquet files are converted into the following Athena tables:
+The full model pipeline produces a large number of outputs. A full list of these outputs and their purpose can be found in [`misc/file_dict.csv`](misc/file_dict.csv). For public users, all outputs are saved in the [`output/`](output/) directory, where they can be further used/examined after a model run. For CCAO employees, all outputs are uploaded to S3 via the [upload stage](pipeline/06-upload). Uploaded Parquet files are converted into the following Athena tables:
 
 #### Athena Tables
 
@@ -743,6 +764,25 @@ Both [Tidymodels](https://tune.tidymodels.org/articles/extras/optimizations.html
 * The number of threads is set via the [num_threads](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_threads) parameter, which is passed to the model using the `set_args()` function from `parsnip`. By default, `num_threads` is equal to the full number of physical cores available. More (or faster) cores will decrease total training time.
 * This repository uses the CPU version of LightGBM included with the [LightGBM R package](https://lightgbm.readthedocs.io/en/latest/R/index.html). If you'd like to use the GPU version you'll need to [build it yourself](https://lightgbm.readthedocs.io/en/latest/R/index.html#installing-a-gpu-enabled-build) or wait for the [upcoming CUDA release](https://github.com/microsoft/LightGBM/issues/5153).
 
+## Updating R dependencies
+
+We use multiple renv lockfiles in order to manage R dependencies:
+
+1. **`renv.lock`** is the canonical list of dependencies that are used by the **core model pipeline**. Any dependencies that are required to run the model itself should be defined in this lockfile.
+2. **`renv/profiles/reporting/renv.lock`** is the canonical list of dependencies that are used to **generate a model performance report** in the `finalize` step of the pipeline. Any dependencies that are required to generate that report or others like it should be defined in this lockfile.
+
+Our goal in maintaining multiple lockfiles is to keep the list of dependencies that are required to run the model as short as possibile. This choice adds overhead to the process of updating R dependencies, but incurs the benefit of a more maintainable model over the long term.
+
+The process for **updating core model pipeline dependencies** is straightforward: Running `renv::install("<dependency_name>")` and `renv::snapshot()` will ensure that the dependency gets added or updated in `renv.lock`, as long is it is imported somewhere in the model pipeline via a `library(<dependency_name>)` call.
+
+The process for updating *dependencies for other lockfiles** is more complex, since it requires the use of a separate profile when running renv commands. Determine the name of the profile you'd like to update (`<profile_name>` in the code that follows) and run the following commands:
+
+1. Run `renv::activate(profile = "<profile_name>")` to set the renv profile to `<profile_name>`
+2. Make sure that the dependency is defined in the `DESCRIPTION` file under the `Config/renv/profiles/<profile_name>/dependencies` key
+3. Run `renv::install("<dependency_name>")` to add or update the dependency as necessary
+4. Run `renv::snapshot(type = "explicit")` to update the reporting lockfile with the dependencies defined in the `DESCRIPTION` file
+5. Run `renv::activate()` if you would like to switch back to the default renv profile
+
 ## Troubleshooting
 
 The dependencies for this repository are numerous and not all of them may install correctly. Here are some common install issues (as seen in the R console) as well as their respective resolutions: