Skip to content

Commit

Permalink
Use dbt DAG to populate notes in the README feature table
Browse files Browse the repository at this point in the history
  • Loading branch information
jeancochrane committed Oct 4, 2023
1 parent d1144b8 commit 1c32a60
Show file tree
Hide file tree
Showing 2 changed files with 164 additions and 105 deletions.
78 changes: 68 additions & 10 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -217,28 +217,86 @@ Model accuracy for each parameter combination is measured on a validation set us

The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`.


```{r feature_guide, message=FALSE, results='asis', echo=FALSE}
library(dplyr)
library(tidyr)
library(yaml)
library(jsonlite)
library(purrr)
library(tibble)
# Some values are derived in the model itself, so they are not documented
# in the dbt DAG and need to be documented here
hardcoded_descriptions <- tribble(
~"column", ~"description",
"sale_year", "Sale year calculated as the number of years since 0 B.C.E",
"sale_day", "Sale day calculated as the number of days since January 1st, 1997",

Check warning on line 234 in README.Rmd

View workflow job for this annotation

GitHub Actions / pre-commit

file=/home/runner/work/model-res-avm/model-res-avm/README.Rmd,line=234,col=81,[line_length_linter] Lines should not be more than 80 characters.
"sale_quarter_of_year", "Character encoding of quarter of year (Q1 - Q4)",
"sale_month_of_year", "Character encoding of month of year (Jan - Dec)",
"sale_day_of_year", "Numeric encoding of day of year (1 - 365)",
"sale_day_of_month", "Numeric encoding of day of month (1 - 31)",
"sale_day_of_week", "Numeric encoding of day of week (1 - 7)",
"sale_post_covid", "Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020)"

Check warning on line 240 in README.Rmd

View workflow job for this annotation

GitHub Actions / pre-commit

file=/home/runner/work/model-res-avm/model-res-avm/README.Rmd,line=240,col=81,[line_length_linter] Lines should not be more than 80 characters.
)
# Load the dbt DAG from our prod docs site
dbt_manifest <- fromJSON("https://ccao-data.github.io/data-architecture/manifest.json")

Check warning on line 244 in README.Rmd

View workflow job for this annotation

GitHub Actions / pre-commit

file=/home/runner/work/model-res-avm/model-res-avm/README.Rmd,line=244,col=81,[line_length_linter] Lines should not be more than 80 characters.
get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {

Check warning on line 246 in README.Rmd

View workflow job for this annotation

GitHub Actions / pre-commit

file=/home/runner/work/model-res-avm/model-res-avm/README.Rmd,line=246,col=1,[cyclocomp_linter] Functions should have cyclomatic complexity of less than 15, this has 19.
# Retrieve the description for a column `colname` either from a set of
# dbt DAG nodes (`dag_nodes`) or a set of hardcoded descriptions
# (`hardcoded_descriptions`)
#
# Prefer the hardcoded descriptions, if they exist
if (colname %in% hardcoded_descriptions$column) {
return(
hardcoded_descriptions[
match(colname, hardcoded_descriptions$column),
]$description
)
}
# If no hardcoded description exists, fall back to checking the dbt DAG
for (node_name in ls(dag_nodes)) {
node <- dag_nodes[[node_name]]
for (column_name in ls(node$columns)) {
if (column_name == colname) {
description <- node$columns[[column_name]]$description
if (!is.null(description) && trimws(description) != "") {
return(gsub("\n", " ", description))
}
}
}
}
# No match in either the hardcoded descriptions or the dbt DAG, so fall
# back to an empty string
return("")
}
params <- read_yaml("params.yaml")
ccao::vars_dict %>%
filter(
var_is_predictor,
var_name_model != "meta_sale_price",
var_model_type %in% c("all", "res")
) %>%
param_tbl <- as_tibble(params$model$predictor$all)
# Make a vector of column descriptions that we can add to the param tibble
# as a new column
param_notes <- param_tbl$value %>%
ccao::vars_rename(names_from = "model", names_to = "athena") %>%
map(\(x) get_column_description(x, dbt_manifest$nodes, hardcoded_descriptions)) %>%

Check warning on line 284 in README.Rmd

View workflow job for this annotation

GitHub Actions / pre-commit

file=/home/runner/work/model-res-avm/model-res-avm/README.Rmd,line=284,col=81,[line_length_linter] Lines should not be more than 80 characters.
unlist
param_tbl %>%
add_column(description=param_notes) %>%
inner_join(
as_tibble(params$model$predictor$all),
by = c("var_name_model" = "value")
ccao::vars_dict,
by = c("value" = "var_name_model")
) %>%
group_by(var_name_pretty) %>%
mutate(row = paste0("X", row_number())) %>%
distinct(
`Feature Name` = var_name_pretty,
Category = var_type,
Type = var_data_type,
Notes = var_notes,
Notes = description,
var_value, row
) %>%
mutate(Category = recode(
Expand All @@ -253,7 +311,7 @@ ccao::vars_dict %>%
values_from = var_value
) %>%
unite("Possible Values", starts_with("X"), sep = ", ", na.rm = TRUE) %>%
mutate(Notes = replace_na(Notes, "")) %>%
mutate(Notes = replace_na(Notes, list(""))) %>%
arrange(Category) %>%
relocate(Notes, .after = everything()) %>%
knitr::kable(format = "markdown")
Expand Down
Loading

0 comments on commit 1c32a60

Please sign in to comment.