Skip to content

Commit

Permalink
Add Dockerfile and build-and-run-model workflow for CI model runs (#9)
Browse files Browse the repository at this point in the history
* Simplify Pipenv dependencies

* Add Dockerfile

* Add build-and-run-model workflow

* Temporarily pin build-and-run-model to development version of build-and-run-batch-job

* Temporarily restrict dvc repro to the train step, for testing

* Revert "Temporarily restrict dvc repro to the train step, for testing"

This reverts commit 299b40b.

* Unfreeze train step in dvc.yaml

* Fix typo in loc_cook_municipality_name feature in params.yaml

* Update README to reflect change in municipality name feature

* Lint README.Rmd

* Add print statements to finalize step to debug S3 failures

* Add aws.ec2metadata packaage so that ECS can authenticate with S3

* Revert "Add print statements to finalize step to debug S3 failures"

This reverts commit a5a64f5.

* Pin build-and-run-model to main branch of ccao-data/actions

* Rename loc_cook_municipality_name back to loc_tax_municipality_name

* Revert "Pin build-and-run-model to main branch of ccao-data/actions"

This reverts commit 82c3081.

* Add ingest dependencies to renv.lock

* Revert "Add ingest dependencies to renv.lock"

This reverts commit 33f83a5.

* Revert "Rename loc_cook_municipality_name back to loc_tax_municipality_name"

This reverts commit 82d6155.

* Revert "Revert "Pin build-and-run-model to main branch of ccao-data/actions""

This reverts commit e801405.

* Revert "Revert "Revert "Pin build-and-run-model to main branch of ccao-data/actions"""

This reverts commit a852f22.

* Revert "Revert "Revert "Revert "Pin build-and-run-model to main branch of ccao-data/actions""""

This reverts commit 3f2114e.
  • Loading branch information
jeancochrane authored Nov 14, 2023
1 parent fc33def commit 6cdc38f
Show file tree
Hide file tree
Showing 13 changed files with 1,242 additions and 979 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/build-and-run-model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Workflow that builds a Docker image containing the model code,
# pushes it to the GitHub Container Registry, and then optionally uses
# that container image to run the model using an AWS Batch job.
#
# Images are built on every commit to a PR or main branch in order to ensure
# that the build continues to work properly, but Batch jobs are gated behind
# a `deploy` environment that requires manual approval from a
# @ccao-data/core-team member.

name: build-and-run-model

on:
pull_request:
types: [opened, reopened, synchronize, closed]
workflow_dispatch:
push:
branches: [master]

jobs:
build-and-run-model:
permissions:
# contents:read and id-token:write permissions are needed to interact
# with GitHub's OIDC Token endpoint so that we can authenticate with AWS
contents: read
id-token: write
# While packages:write is usually not required for workflows, it is
# required in order to allow the reusable called workflow to push to
# GitHub Container Registry
packages: write
uses: ccao-data/actions/.github/workflows/build-and-run-batch-job.yaml
with:
vcpu: "16.0"
memory: "65536"
role-duration-seconds: 14400 # Worst-case time for a full model run
secrets:
AWS_IAM_ROLE_TO_ASSUME_ARN: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
AWS_ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }}
45 changes: 45 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
FROM rocker/r-ver:4.3.1

# Use PPM for binary installs
ENV RENV_CONFIG_REPOS_OVERRIDE "https://packagemanager.posit.co/cran/__linux__/jammy/latest"
ENV RENV_PATHS_LIBRARY renv/library

# Install system dependencies
RUN apt-get update && apt-get install --no-install-recommends -y \
libcurl4-openssl-dev libssl-dev libxml2-dev libgit2-dev git \
libudunits2-dev python3-dev python3-pip libgdal-dev libgeos-dev \
libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc

# Install pipenv for Python dependencies
RUN pip install pipenv

# Copy pipenv files into the image. The reason this is a separate step from
# the later step that adds files from the working directory is because we want
# to avoid having to reinstall dependencies every time a file in the directory
# changes, as Docker will bust the cache of every layer following a layer that
# needs to change
COPY Pipfile .
COPY Pipfile.lock .

# Install Python dependencies
RUN pipenv install --system --deploy

# Copy R bootstrap files into the image
COPY renv.lock .
COPY .Rprofile .
COPY renv/ renv/

# Install R dependencies
RUN Rscript -e 'renv::restore()'

# Copy the directory into the container
ADD ./ model-condo-avm/

# Copy R dependencies into the app directory
RUN rm -Rf model-condo-avm/renv
RUN mv renv model-condo-avm/

# Set the working directory to the app dir
WORKDIR model-condo-avm/

CMD dvc pull && dvc repro
6 changes: 1 addition & 5 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,7 @@ verify_ssl = true
name = "pypi"

[packages]
numpy = "*"
pandas = "*"
scipy = "*"
scikit-learn = "*"
dvc = "*"
"dvc[s3]" = "*"

[dev-packages]

Expand Down
2,087 changes: 1,131 additions & 956 deletions Pipfile.lock

Large diffs are not rendered by default.

16 changes: 7 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Table of Contents"
output:
output:
github_document:
toc: true
toc_depth: 3
Expand Down Expand Up @@ -46,7 +46,7 @@ The repository itself contains the [code](./pipeline) and [data](./input) for th

## Differences Compared to the Residential Model

The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics.
The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics.

The only information our office has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier:

Expand Down Expand Up @@ -106,11 +106,9 @@ ccao::vars_dict %>%
)
) %>%
mutate(`Unique to Condo Model` = ifelse(
var_name_model != "loc_tax_municipality_name" & (
var_name_model %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership")
),
var_name_model %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
"X", ""
)) %>%
arrange(desc(`Unique to Condo Model`), Category) %>%
Expand All @@ -124,7 +122,7 @@ For the most part, condos are valued the same way as single- and multi-family re

However, because the CCAO has so [little information about individual units](#differences-compared-to-the-residential-model), we must rely on the [condominium percentage of ownership](#features-used) to differentiate between units in a building. This feature is effectively the proportion of the building's overall value held by a unit. It is created when a condominium declaration is filed with the County (usually by the developer of the building). The critical assumption underlying the condo valuation process is that percentage of ownership correlates with current market value.

Percentage of ownership is used in two ways:
Percentage of ownership is used in two ways:

1. It is used directly as a predictor/feature in the regression model to estimate differing unit values within the same building.
2. It is used to reapportion unit values directly i.e. the value of a unit is ultimately equal to `% of ownership * total building value`.
Expand Down Expand Up @@ -186,7 +184,7 @@ The condo model relies on sales within the same building to calculate [strata](#

Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales.

### Buildings Without Sales
### Buildings Without Sales

When no sales have occurred in a building in the 5 years prior to assessment, the building's strata features are imputed. The model will look at nearby buildings that have similar unit counts/age and then try to assign an appropriate strata to the target building.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,12 +152,12 @@ the 2023 assessment model.
| Percent Population Mobility, Moved From Within Same County in Past Year | acs5 | numeric | |
| Longitude | loc | numeric | |
| Latitude | loc | numeric | |
| Municipality Name | loc | character | |
| FEMA Special Flood Hazard Area | loc | logical | |
| First Street Factor | loc | numeric | |
| First Street Risk Direction | loc | numeric | |
| School Elementary District GEOID | loc | character | |
| School Secondary District GEOID | loc | character | |
| Municipality Name | loc | character | |
| CMAP Walkability Score (No Transit) | loc | numeric | |
| CMAP Walkability Total Score | loc | numeric | |
| Airport Noise DNL | loc | numeric | |
Expand Down
2 changes: 1 addition & 1 deletion dvc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ stages:
cache: false
- output/workflow/recipe/model_workflow_recipe.rds:
cache: false
frozen: true

assess:
cmd: Rscript pipeline/02-assess.R
desc: >
Expand Down
4 changes: 2 additions & 2 deletions params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ model:
"char_full_baths",
"loc_longitude",
"loc_latitude",
"loc_tax_municipality_name",
"loc_cook_municipality_name",
"loc_env_flood_fema_sfha",
"loc_env_flood_fs_factor",
"loc_env_flood_fs_risk_direction",
Expand Down Expand Up @@ -244,7 +244,7 @@ model:
categorical: [
"meta_township_code",
"meta_nbhd_code",
"loc_tax_municipality_name",
"loc_cook_municipality_name",
"loc_school_elementary_district_geoid",
"loc_school_secondary_district_geoid",
"time_sale_quarter_of_year",
Expand Down
2 changes: 1 addition & 1 deletion pipeline/01-train.R
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ test %>%
select(
meta_year, meta_pin, meta_pin10, meta_class, meta_card_num, meta_lline_num,
meta_triad_code, meta_township_code, meta_nbhd_code,
loc_tax_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_cook_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_census_tract_geoid, loc_school_elementary_district_geoid,
loc_school_secondary_district_geoid, loc_school_unified_district_geoid,
char_unit_sf, char_building_sf, char_full_baths, char_half_baths,
Expand Down
2 changes: 1 addition & 1 deletion pipeline/02-assess.R
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ assessment_data_pin <- assessment_data_merged %>%
# Keep locations, prior year values, and indicators
loc_longitude, loc_latitude,
starts_with(c(
"loc_property_", "loc_tax_", "loc_ward_", "loc_chicago_",
"loc_property_", "loc_cook_", "loc_ward_", "loc_chicago_",
"loc_census", "loc_school_", "prior_", "ind_"
)),
meta_pin10_5yr_num_sale,
Expand Down
6 changes: 3 additions & 3 deletions pipeline/03-evaluate.R
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ col_rename_dict <- c(
"geography_id" = "meta_township_code",
"geography_id" = "meta_township_name",
"geography_id" = "meta_nbhd_code",
"geography_id" = "loc_tax_municipality_name",
"geography_id" = "loc_cook_municipality_name",
"geography_id" = "loc_ward_num",
"geography_id" = "loc_census_puma_geoid",
"geography_id" = "loc_census_tract_geoid",
Expand Down Expand Up @@ -80,7 +80,7 @@ if (run_type == "full") {
assessment_data_pin <- read_parquet(paths$output$assessment_pin$local) %>%
select(
meta_pin, meta_class, meta_triad_code, meta_township_code, meta_nbhd_code,
loc_tax_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_cook_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_census_tract_geoid, loc_school_elementary_district_geoid,
loc_school_secondary_district_geoid, loc_school_unified_district_geoid,
char_total_bldg_sf, char_unit_sf, prior_far_tot, prior_near_tot,
Expand Down Expand Up @@ -343,7 +343,7 @@ gen_agg_stats_quantile <- function(data, truth, estimate,
# class or no class option for each level
geographies_quosures <- rlang::quos(
meta_township_code,
meta_nbhd_code, loc_tax_municipality_name,
meta_nbhd_code, loc_cook_municipality_name,
loc_ward_num, loc_census_puma_geoid, loc_census_tract_geoid,
loc_school_elementary_district_geoid, loc_school_secondary_district_geoid,
loc_school_unified_district_geoid,
Expand Down
1 change: 1 addition & 0 deletions pipeline/05-finalize.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
suppressPackageStartupMessages({
library(arrow)
library(aws.s3)
library(aws.ec2metadata)
library(ccao)
library(dplyr)
library(here)
Expand Down
11 changes: 11 additions & 0 deletions renv.lock
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,17 @@
],
"Hash": "89761c5b8e6ce66e7176cbb70a892df8"
},
"aws.ec2metadata": {
"Package": "aws.ec2metadata",
"Version": "0.2.0",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"curl",
"jsonlite"
],
"Hash": "2e630add89f47540d9da2240710e2702"
},
"aws.s3": {
"Package": "aws.s3",
"Version": "0.3.21",
Expand Down

0 comments on commit 6cdc38f

Please sign in to comment.