Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate and publish a quarto doc with performance results on each model run #62

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
b16a3e6
Generate and upload model performance report in finalize pipeline step
jeancochrane Nov 21, 2023
186fe1e
Merge branch 'master' into 24-infra-updates-generate-and-publish-a-qu…
jeancochrane Nov 21, 2023
331b241
Include .html files in model_get_s3_artifacts_for_run
jeancochrane Nov 21, 2023
0144f28
Refactor repo to support reports/renv.lock lockfile
jeancochrane Nov 22, 2023
af287d2
Remove unnecessary changes to renv/activate.R
jeancochrane Nov 24, 2023
b616399
Fix missing column in performance report row of misc/file_dict.csv
jeancochrane Nov 24, 2023
25f2850
Update README with instructions on updating R dependencies
jeancochrane Nov 24, 2023
9f1c1ca
Add quarto to DESCRIPTION dependencies
jeancochrane Nov 24, 2023
aedd58b
Move reports/renv.lock -> renv/profiles/reporting/renv.lock
jeancochrane Nov 24, 2023
7ae4f6b
Properly style R/helpers.R
jeancochrane Nov 24, 2023
fd6538b
Install Quarto in Dockerfile
jeancochrane Nov 24, 2023
7b39d2f
Use the correct path to performance.qmd in 05-finalize.R step
jeancochrane Nov 27, 2023
65948dd
Move performance.qmd to the top level of the `reports/` subdir
jeancochrane Nov 29, 2023
dab451c
Factor out report generation into 05-report.R pipeline stage
jeancochrane Nov 29, 2023
d41f6b9
Presign the Quarto report URL in 05-finalize.R
jeancochrane Nov 29, 2023
626d34d
Temporarily adjust Dockerfile CMD to test paws
jeancochrane Nov 29, 2023
1eba9ca
Revert "Temporarily adjust Dockerfile CMD to test paws"
jeancochrane Nov 30, 2023
dde61a6
Revert "Presign the Quarto report URL in 05-finalize.R"
jeancochrane Nov 30, 2023
24cf223
Factor S3/SNS operations out into new 06-upload.R stage
jeancochrane Nov 30, 2023
6772f45
Fix typo in README.Rmd and regenerate README
jeancochrane Nov 30, 2023
ec1c35f
Fix mixed up deps/outputs between finalize and upload stages
jeancochrane Nov 30, 2023
ae750d5
Add missing run_id variable to upload pipeline stage
jeancochrane Dec 1, 2023
3140824
Partition Quarto performance report S3 uploads by year
jeancochrane Dec 1, 2023
25c8d91
Strip everything after the first period in README feature table notes
jeancochrane Dec 1, 2023
7458bab
Clean up some typos in README
jeancochrane Dec 1, 2023
5da06da
Generalize `Updating R dependencies` section of the README
jeancochrane Dec 1, 2023
31dc99d
Generate tictoc timings for finalize pipeline stage
jeancochrane Dec 1, 2023
18cfbce
Rerender README.md
jeancochrane Dec 1, 2023
049a642
Merge branch 'master' into 24-infra-updates-generate-and-publish-a-qu…
jeancochrane Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Config/renv/profiles/reporting/dependencies: quarto, leaflet, plotly, sf
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
10 changes: 9 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ ENV RENV_PATHS_LIBRARY renv/library
RUN apt-get update && apt-get install --no-install-recommends -y \
libcurl4-openssl-dev libssl-dev libxml2-dev libgit2-dev git \
libudunits2-dev python3-dev python3-pip libgdal-dev libgeos-dev \
libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc
libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc \
curl gdebi-core

# Install Quarto
RUN curl -o quarto-linux-amd64.deb -L \
https://github.com/quarto-dev/quarto-cli/releases/download/v1.3.450/quarto-1.3.450-linux-amd64.deb
RUN gdebi -n quarto-linux-amd64.deb
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved

# Install pipenv for Python dependencies
RUN pip install pipenv
Expand All @@ -26,11 +32,13 @@ RUN pipenv install --system --deploy

# Copy R bootstrap files into the image
COPY renv.lock .
COPY renv/profiles/reporting/renv.lock reporting-renv.lock
COPY .Rprofile .
COPY renv/ renv/

# Install R dependencies
RUN Rscript -e 'renv::restore()'
RUN Rscript -e 'renv::restore(lockfile = "reporting-renv.lock")'

# Copy the directory into the container
ADD ./ model-res-avm/
Expand Down
5 changes: 4 additions & 1 deletion R/helpers.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,10 @@ model_get_s3_artifacts_for_run <- function(run_id, year) {
bucket <- strsplit(s3_objs[1], "/")[[1]][3]

# First get anything partitioned only by year
s3_objs_limited <- grep(".parquet$|.zip$|.rds$", s3_objs, value = TRUE) %>%
s3_objs_limited <- grep(
".parquet$|.zip$|.rds|.html$", s3_objs,
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
value = TRUE
) %>%
unname()

# Next get the prefix of anything partitioned by year and run_id
Expand Down
58 changes: 49 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ graph LR
evaluate("Evaluate")
interpret("Interpret")
finalize("Finalize")
upload("Upload")
export("Export")

ingest --> train
Expand All @@ -67,8 +68,9 @@ graph LR
assess --> evaluate
evaluate --> finalize
interpret --> finalize
finalize --> aws
finalize --> upload
finalize --> export
upload --> aws
aws --> ingest
aws --> export
```
Expand All @@ -87,9 +89,11 @@ All inputs and outputs are stored on AWS S3 using a unique run identifier. Each

4. **Interpret**: Calculate SHAP values for all the estimated values from the assess stage. These are the _per feature_ contribution to the predicted value for an _individual observation_ (usually a single PIN). Also calculate the aggregate feature importance for the entire model. The primary output of this stage is a data frame of the contributions of each feature for each property.

5. **Finalize**: Add metadata and then upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run's performance can be visualized using the CCAO's internal Tableau dashboards.
5. **Finalize**: Save run timings and metadata and render a Quarto document containing a model performance report to `reports/performance.html`.

6. **Export**: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of the main pipeline.
6. **Upload**: Upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run's performance can be visualized using the CCAO's internal Tableau dashboards. NOTE: This stage is only run internally, since it requires access to the CCAO Data AWS account.

7. **Export**: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of the main pipeline.

## Choices Made

Expand Down Expand Up @@ -250,7 +254,10 @@ dbt_manifest <- fromJSON(
get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
# Retrieve the description for a column `colname` either from a set of
# dbt DAG nodes (`dag_nodes`) or a set of hardcoded descriptions
# (`hardcoded_descriptions`)
# (`hardcoded_descriptions`). Column descriptions that come from dbt DAG nodes
# will be truncated starting from the first period to reflect the fact that
# we use periods in our dbt documentation to separate high-level column
# summaries from their detailed notes
#
# Prefer the hardcoded descriptions, if they exist
if (colname %in% hardcoded_descriptions$column) {
Expand All @@ -267,7 +274,11 @@ get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
if (column_name == colname) {
description <- node$columns[[column_name]]$description
if (!is.null(description) && trimws(description) != "") {
return(gsub("\n", " ", description))
# Strip everything after the first period, since we use the first
# period as a delimiter separating a column's high-level summary from
# its detailed notes in our dbt docs
summary_description <- strsplit(description, ".", fixed = TRUE)[[1]][1]
return(gsub("\n", " ", summary_description))
}
}
}
Expand Down Expand Up @@ -464,7 +475,7 @@ This repository represents a significant departure from the old [residential mod

### [`assessment-year-2022`](https://github.com/ccao-data/model-res-avm/tree/2022-assessment-year)

* Moved previously separate processes into this repository and improved their integration with the overall modeling process. For example, the [etl_res_data](https://gitlab.com/ccao-data-science---modeling/processes/etl_res_data) process was moved to [pipeline/00-ingest.R](pipeline/00-ingest.R), while the process to [finalize model values](https://gitlab.com/ccao-data-science---modeling/processes/finalize_model_values) was moved to [pipeline/06-export.R](pipeline/06-export.R).
* Moved previously separate processes into this repository and improved their integration with the overall modeling process. For example, the [etl_res_data](https://gitlab.com/ccao-data-science---modeling/processes/etl_res_data) process was moved to [pipeline/00-ingest.R](pipeline/00-ingest.R), while the process to [finalize model values](https://gitlab.com/ccao-data-science---modeling/processes/finalize_model_values) was moved to [pipeline/07-export.R](pipeline/07-export.R).
* Added [DVC](https://dvc.org/) support/integration. This repository uses DVC in 2 ways:
1. All input data in [`input/`](input/) is versioned, tracked, and stored using DVC. Previous input data sets are stored in perpetuity on S3.
2. [DVC pipelines](https://dvc.org/doc/user-guide/project-structure/pipelines-files) are used to sequentially run R pipeline scripts and track/cache inputs and outputs.
Expand All @@ -487,6 +498,13 @@ This repository represents a significant departure from the old [residential mod
* Dropped explicit spatial lag generation in the ingest stage.
* Lots of other bugfixes and minor improvements.

### Upcoming

* Infrastructure improvements
* Added [`build-and-run-model`](https://github.com/ccao-data/model-res-avm/actions/workflows/build-and-run-model.yaml) workflow to run the model using GitHub Actions and AWS Batch.
* Added [`delete-model-run`](https://github.com/ccao-data/model-res-avm/actions/workflows/delete-model-runs.yaml) workflow to delete test run artifacts in S3 using GitHub Actions.
* Updated [pipeline/05-finalize](pipeline/05-finalize.R) step to render a performance report using Quarto and factored S3/SNS operations out into [pipeline/06-upload.R](pipeline/06-upload.R).

# Ongoing Issues

The CCAO faces a number of ongoing issues which make modeling difficult. Some of these issues are in the process of being solved; others are less tractable. We list them here for the sake of transparency and to provide a sense of the challenges we face.
Expand Down Expand Up @@ -609,12 +627,15 @@ The code in this repository is written primarily in [R](https://www.r-project.or

If you're on Windows, you'll also need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) in order to build the necessary packages. You may also want to (optionally) install [DVC](https://dvc.org/doc/install) to pull data and run pipelines.

We also publish a Docker image containing model code and all of the dependencies necessary to run it. If you're comfortable using Docker, you can skip the installation steps below and instead pull the image from `ghcr.io/ccao-data/model-res-avm:master` to run the latest version of the model.

## Installation

1. Clone this repository using git, or simply download it using the button at the top of the page.
2. Set your working directory to the local folder containing this repository's files, either using R's `setwd()` command or (preferably) using RStudio's [projects](https://support.posit.co/hc/en-us/articles/200526207-Using-Projects).
3. Install `renv`, R's package manager, by running `install.packages("renv")`.
4. Install all R package dependencies using `renv` by running `renv::restore()`. This step may take awhile. Linux users will likely need to install dependencies (via apt, yum, etc.) to build from source.
5. The `finalize` step of the model pipeline requires some additional dependencies for generating a model performance report. Install these additional dependencies by running `renv::restore(lockfile = "renv/profiles/reporting/renv.lock")`. These dependencies must be installed in addition to the core dependencies installed in step 4. If dependencies are not installed, the report will fail to generate and the pipeline stage will print the error message to the report file at `reports/performance.html`; the pipeline will continue to execute in spite of the failure.

For installation issues, particularly related to package installation and dependencies, see [Troubleshooting](#troubleshooting).

Expand All @@ -625,8 +646,8 @@ For installation issues, particularly related to package installation and depend
To use this repository, simply open the [pipeline/](./pipeline) directory and run the R scripts in order. Non-CCAO users can skip the following stages:

* [`pipeline/00-ingest.R`](pipeline/00-ingest.R) - Requires access to CCAO internal AWS services to pull data. See [Getting Data](#getting-data) if you are a member of the public.
* [`pipeline/05-finalize.R`](pipeline/05-finalize.R) - Requires access to CCAO internal AWS services to upload model results.
* [`pipeline/06-export.R`](pipeline/06-export.R) - Only required for CCAO internal processes.
* [`pipeline/06-upload.R`](pipeline/06-upload.R) - Requires access to CCAO internal AWS services to upload model results.
* [`pipeline/07-export.R`](pipeline/07-export.R) - Only required for CCAO internal processes.

#### Using DVC

Expand Down Expand Up @@ -667,7 +688,7 @@ Each R script has a set of associated parameters (tracked via `dvc.yaml`). DVC w

## Output

The full model pipeline produces a large number of outputs. A full list of these outputs and their purpose can be found in [`misc/file_dict.csv`](misc/file_dict.csv). For public users, all outputs are saved in the [`output/`](output/) directory, where they can be further used/examined after a model run. For CCAO employees, all outputs are uploaded to S3 via the [finalize stage](pipeline/05-finalize.R). Uploaded Parquet files are converted into the following Athena tables:
The full model pipeline produces a large number of outputs. A full list of these outputs and their purpose can be found in [`misc/file_dict.csv`](misc/file_dict.csv). For public users, all outputs are saved in the [`output/`](output/) directory, where they can be further used/examined after a model run. For CCAO employees, all outputs are uploaded to S3 via the [upload stage](pipeline/06-upload). Uploaded Parquet files are converted into the following Athena tables:

#### Athena Tables

Expand Down Expand Up @@ -743,6 +764,25 @@ Both [Tidymodels](https://tune.tidymodels.org/articles/extras/optimizations.html
* The number of threads is set via the [num_threads](https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_threads) parameter, which is passed to the model using the `set_args()` function from `parsnip`. By default, `num_threads` is equal to the full number of physical cores available. More (or faster) cores will decrease total training time.
* This repository uses the CPU version of LightGBM included with the [LightGBM R package](https://lightgbm.readthedocs.io/en/latest/R/index.html). If you'd like to use the GPU version you'll need to [build it yourself](https://lightgbm.readthedocs.io/en/latest/R/index.html#installing-a-gpu-enabled-build) or wait for the [upcoming CUDA release](https://github.com/microsoft/LightGBM/issues/5153).

## Updating R dependencies

We use multiple renv lockfiles in order to manage R dependencies:

1. **`renv.lock`** is the canonical list of dependencies that are used by the **core model pipeline**. Any dependencies that are required to run the model itself should be defined in this lockfile.
2. **`renv/profiles/reporting/renv.lock`** is the canonical list of dependencies that are used to **generate a model performance report** in the `finalize` step of the pipeline. Any dependencies that are required to generate that report or others like it should be defined in this lockfile.

Our goal in maintaining multiple lockfiles is to keep the list of dependencies that are required to run the model as short as possibile. This choice adds overhead to the process of updating R dependencies, but incurs the benefit of a more maintainable model over the long term.

The process for **updating core model pipeline dependencies** is straightforward: Running `renv::install("<dependency_name>")` and `renv::snapshot()` will ensure that the dependency gets added or updated in `renv.lock`, as long is it is imported somewhere in the model pipeline via a `library(<dependency_name>)` call.

The process for updating *dependencies for other lockfiles** is more complex, since it requires the use of a separate profile when running renv commands. Determine the name of the profile you'd like to update (`<profile_name>` in the code that follows) and run the following commands:

1. Run `renv::activate(profile = "<profile_name>")` to set the renv profile to `<profile_name>`
2. Make sure that the dependency is defined in the `DESCRIPTION` file under the `Config/renv/profiles/<profile_name>/dependencies` key
3. Run `renv::install("<dependency_name>")` to add or update the dependency as necessary
4. Run `renv::snapshot(type = "explicit")` to update the reporting lockfile with the dependencies defined in the `DESCRIPTION` file
5. Run `renv::activate()` if you would like to switch back to the default renv profile

## Troubleshooting

The dependencies for this repository are numerous and not all of them may install correctly. Here are some common install issues (as seen in the R console) as well as their respective resolutions:
Expand Down
Loading
Loading