This directory stores the configuration for building our data catalog using dbt.
- 🖼️ Background: What does the data catalog do?
- 💻 How to develop the catalog
- ➕ How to add a new model
- 🔨 How to rebuild models using GitHub Actions
- 🧪 How to add and run tests and QC reports
- 🐛 Debugging tips
- 📖 Data documentation
- 📝 Design doc for our decision to develop our catalog with dbt
- 🧪 Generic tests we use for testing
The data catalog accomplishes a few main goals:
The models defined in models/
represent the SQL operations that
we run in order to transform raw data into output data for use in statistical
modeling and reporting. These transformations comprise a DAG that we use for
documenting our data and rebuilding it efficiently. We use the dbt run
command to build these
models into views and tables in AWS Athena.
Note
When we talk about "models" in these docs, we generally mean the resources that dbt calls "models", namely the definitions of the tables and views in our Athena warehouse. In contrast, we will use the phrase "statistical models" wherever we mean to discuss the algorithms that we use to predict property values.
The tests defined in the schema.yml
files in the models/
directory
set the specification for our source data and its transformations. These
specs allow us to build confidence in the
integrity of the data that we use and publish. We use the
dbt test
command to run these
tests against our Athena warehouse.
The DAG definition is parsed by dbt and used to build our data
documentation. We use the
dbt docs generate
command to generate these docs.
The workflows, actions, and scripts defined in the .github/
directory work together to perform all of our dbt operations
automatically, and to integrate with the development cycle such that new
commits to the main branch of this repository automatically deploy changes to
our tables, views, tests, and docs. Automated tasks include:
- (Re)building any models that have been added or modified since the last commit
to the main branch (the
build-and-test-dbt
workflow) - Running tests for any models that have been added or modified since the last
commit, including tests that have themselves been added or modified (the
build-and-test-dbt
workflow) - Running tests for all models once per day (the
test-dbt-models
workflow) - Checking the freshness of our source
data once per day
(the
test-dbt-source-freshness
workflow) - Generating and deploying data documentation to our docs
site on every commit to the
main branch (the
deploy-dbt-docs
workflow) - Cleaning up temporary resources in our Athena warehouse whenever a pull
request is merged into the main branch (the
cleanup-dbt-resources
workflow)
These instructions are for Ubuntu, which is the only platform we've tested.
For background, see the docs on installing dbt with pip. (You don't need to follow these docs in order to install dbt; the steps that follow will take care of that for you.)
- Python3 with
uv
installed (pre-installed on the CCAO server) - AWS CLI installed
locally
- You'll also need permissions for Athena, Glue, and S3
aws-mfa
installed locally
Run the following commands in this directory:
uv venv
source .venv/bin/activate
uv python install
uv pip install .
dbt deps
To run dbt commands, make sure you have the virtual environment activated:
source .venv/bin/activate
You must also authenticate with AWS using MFA if you haven't already today:
aws-mfa
We use the dbt build
command to build tables and
views (called models and
seeds in dbt jargon) in our
Athena data warehouse. See the following sections for specific instructions
on how to build development and
production models.
When passed no arguments, dbt build
will default to building
all tables and views in development schemas dedicated to your user. The full
build takes about three hours and thirty minutes, so we don't recommend running
it from scratch.
Instead, start by copying the production dbt state file (also known as the manifest file):
aws s3 cp s3://ccao-dbt-cache-us-east-1/master-cache/manifest.json master-cache/manifest.json
Then, use dbt clone
to
clone the production tables and views into your development environment:
dbt clone --state master-cache
This will copy all production views and tables into a new set of Athena schemas
prefixed with your Unix $USER
name (e.g. z_dev_jecochr_default
for the
default
schema when dbt
is run on Jean's machine).
Once you've copied prod tables and views into your development schemas, you can rebuild specific tables and views using dbt's node selection syntax.
Use --select
to build one specific table/view, or a group of tables/views.
Here are some example commands that use --select
to build a subset of all
tables/views:
# This builds just the vw_pin_universe view
dbt build --select default.vw_pin_universe --resource-types model
# This builds vw_pin_universe as well as vw_pin10_location
dbt build --select default.vw_pin_universe location.vw_pin10_location --resource-types model
# This builds all models and seeds in the default schema
dbt build --select default.* --resource-types model seed
Note
If you are building a Python model,
your model may require external dependencies be available on S3.
To make these dependencies available to your model, run the
build-and-test-dbt
workflow on your branch to deploy any Python dependencies
that you've added to the config.packages
attribute
on your model.
By default, all dbt
commands will run against the dev
environment (called
a target in
dbt jargon), which namespaces the resources it creates by prefixing database
names with z_dev_
and your Unix $USER
name.
You should almost never have to manually build tables and views in our production environment, since this repository is configured to automatically deploy production models using GitHub Actions for continuous integration. However, in the rare case that you need to manually build models in production, see 🔨 How to rebuild models using GitHub Actions.
If you'd like to remove your development resources to keep our Athena data sources tidy, you have two options: Delete all of your development Athena databases, or delete a selection of Athena databases.
To delete all the resources in your local environment (i.e. every Athena
database with a name matching the pattern z_dev_$USER_$SCHEMA
):
../.github/scripts/cleanup_dbt_resources.sh dev
To instead delete a selected database, use the aws glue delete-database
command:
aws glue delete-database z_dev_jecochr_default
Note that these two operations will only delete Athena databases, and will leave
intact any parquet files that your queries created in S3. If you would like to
remove those files as well, delete them in the S3 console or using the aws s3 rm
command.
Run tests for all models:
dbt test
Run all tests for one model:
dbt test --select default.vw_pin_universe
Run only one test:
dbt test --select default_vw_pin_universe_unique_by_14_digit_pin_and_year
Run a test against the prod models:
dbt test --select default_vw_pin_universe_unique_by_14_digit_pin_and_year --target prod
Run tests for dbt macros:
dbt run-operation test_all
Note that we configure dbt's asset-paths
attribute in
order to link to images in our documentation. Some of those images, like the
Mermaid diagram defined in assets/dataflow-diagram.md
, are generated
automatically during the deploy-dbt-docs
deployment workflow. To generate
them locally, make sure you have
mermaid-cli
installed (we
recommend a local
installation) and
run the following command:
for file in assets/*.mmd; do
./node_modules/.bin/mmdc -i "$file" -o "${file/.mmd/.svg}"
done
Then, generate the documentation:
dbt docs generate
This will create a set of static files in the target/
subdirectory that can
be used to serve the docs site.
To serve the docs locally:
dbt docs serve
Then, navigate to http://localhost:8080 to view the site.
To request the addition of a new model, open an issue using the Add a new dbt model issue template. The assignee should follow the checklist in the body of the issue in order to add the model to the DAG.
There are a few subtleties to consider when requesting a new model, outlined below.
We default to SQL models, since they are simple and well-supported, but in some cases we make use of Python models instead. Prefer a Python model if all of the following conditions are true:
- The model requires complex transformations that are simpler to express using pandas than using SQL
- The model only depends on (i.e. joins to) other models materialized as tables, and does not depend on any models materialized as views
- The model's pandas code only imports third-party packages that are either
preinstalled in the Athena PySpark
environment
or that are pure Python (i.e. that do not include any C extensions or code in
other languages)
- The most common packages that we need that are not pure Python are
geospatial analysis packages like
geopandas
- The most common packages that we need that are not pure Python are
geospatial analysis packages like
If your Python model needs to use a third-party pure Python package that is not preinstalled in the Athena PySpark environment, you can configure the dependency to be automatically deployed to our S3 bucket that stores PySpark dependencies as part of the dbt build workflow on GitHub Actions. Follow these steps to include your dependency:
- Update the
config.packages
array on your model definition in your model'sschema.yml
file to add elements for each of the packages you want to install- Make sure to provide a specific version for each package so that our builds are deterministic
- Unlike a typical
pip install
call, the dependency resolver will not automatically install your dependency's dependencies, so check the dependency's documentation to see if you need to manually specify any other dependencies in order for your dependency to work
# Example -- replace `model.name` with your model name, `dependency_name` with
# your dependency name, and `X.Y.Z` with the version of the dependency you want
# to install
models:
- name: database_name.table_name
config:
packages:
- "dependency_name==X.Y.Z"
- Add an
sc.addPyFile
call to the top of the Python code that represents your model's query definition so that PySpark will make the dependency available in the context of your code
# Example -- replace `dependency_name` with your dependency name and `X.Y.Z`
# with the version of the dependency you want to import
# type: ignore
sc.addPyFile( # noqa: F821
"s3://ccao-athena-dependencies-us-east-1/dependency_name==X.Y.Z.zip"
)
- Call
import dependency_name
as normal in your script to make use of the dependency
# Example -- replace `dependency_name` with your dependency name
import dependency_name
See the reporting.ratio_stats
model for an example of this type of
configuration.
There are a number of different ways of materializing tables in Athena using dbt; see the dbt docs for more detail.
So far our DAG only uses view and table materialization, although we are interested in eventually incorporating incremental materialization as well.
The choice between view and table materialization depends on the runtime and downstream consumption of the model query. There is no hard and fast rule, but as a general guideline, consider materializing a model as a table if queries using the table take longer than 30s to execute, or if the model is consumed by another model that is itself computationally intensive.
Our DAG is configured to materialize models as views by default, so extra configuration is only required for non-view materialization.
Models should be namespaced according to the database that the model lives in
(e.g. location.tax
for the tax
table in the location
database.) Since
dbt does not yet support namespacing for refs, we include the database as a
prefix in the model name to simulate real namespacing, and we override the
generate_alias_name
macro to strip out this fake namespace when generating
names of tables and views in Athena
(docs).
In addition to database namespacing, views should be named with a vw_
prefix
(e.g. location.vw_pin10_location
) to mark them as a view, while tables do not
require any prefix (e.g. location.tax
).
Finally, for the sake of consistency and ease of interpretation, all tables and views should be named using the singular case e.g. location.tax
rather than location.taxes
.
Models are generally defined in the schema.yml
file within each database
subdirectory. Resources related to each model should be defined inline (with
the exception of columns, see Column descriptions.
For complicated models with many columns or tests, we split schema.yml
files into individual files per model. These files should be contained in a
schema/
subdirectory within each database directory, and should be named
using the fully namespaced model name. For example, the model definition
for iasworld.sales
lives in models/iasworld/schema/iasworld.sales.yml
.
All new models should include, at minimum, a
description
of the model itself. We store these model-level descriptions as docs
blocks
within the docs.md
file of each schema subdirectory.
Descriptions related to models in a schema/
subdirectory should still live
in docs.md
. For example, the description for default.vw_pin_universe
lives
in models/default/docs.md
.
New models should also include descriptions for each column. Since the first few characters of a column description will be shown in the documentation in a dedicated column on the "Columns" table, column descriptions should always start with a sentence that is short and simple. This allows docs readers to scan the "Columns" table and understand what the column represents at a high level.
Column descriptions can live in three separate places with the following hierarchy:
models/shared_columns.md
- Definitions shared across all databases and modelsmodels/$DATABASE/columns.md
- Definitions shared across a single databasemodels/$DATABASE/schema.yml
ORmodels/$DATABASE/schema/$DATABASE-$MODEL.yml
- Definitions specific to a single model
We use the following pattern to determine where to define each column description:
- If a description is shared by three or more resources across multiple
databases, its text should be defined as a docs block in
models/shared_columns.md
. The docs block identifier for each column should have ashared_column_
prefix. - If a description is shared by three or more resources across multiple
models in the same database, its text should be defined as a
docs block in
models/$DATABASE/columns.md
. The docs block identifier for each column should have acolumn_
prefix. - If a description is shared between two or fewer columns, its text should
be defined inline in the
schema.yml
file under thedescription
key for the column.
New models should generally be added with accompanying tests to ensure the underlying data and transformations are correct. For more information on testing, see 🧪 How to add and run tests and QC reports.
GitHub Actions can be used to manually rebuild part or all of our dbt DAG. To use this functionality:
- Go to the
build-and-test-dbt
workflow page - Click the Run workflow dropdown on the right-hand side of the screen
- Populate the input box following the instructions below
- Click Run workflow, then click the created workflow run to view progress
The workflow input box expects a space-separated list of dbt model names or selectors.
Multiple models can be passed at the same time, as the input box values are
passed directly to dbt build
. Model names must include the database schema name. Some possible inputs include:
default.vw_pin_sale
- Rebuild a single viewdefault.vw_pin_sale default.vw_pin_universe
- Rebuild two views at once+default.vw_pin_history
- Rebuild a view and all its upstream dependencieslocation.*
- Rebuild all views under thelocation
schemapath:models
- Rebuild the full DAG (:warning: takes a long time!)
For more possible inputs using dbt node selection, see the documentation site.
We test the integrity of our raw data and our transformations using a few different types of tests and reports:
Test type | Description | Implemented with | Runs on | Runs when | Outputs | Fails on | Who fixes | Use when |
---|---|---|---|---|---|---|---|---|
iasWorld tests | Test that hard-and-fast assumptions about iasWorld tables are correct. | dbt data tests | GitHub Actions, or command line | Daily after Spark ingest, or on demand | Excel workbook and metadata parquet files | Doesn't fail unless tests encounter an error | Valuations | Test logic exclusively references iasWorld tables, and test failures indicate clear, fixable problems |
Unit tests | Test that the transformation logic inside a model definition produces the correct output on a specific set of hypothetical input data. | dbt unit tests | GitHub Actions | Every PR or commits to main branch that modify models/tests | GitHub Actions failure notification | Any test failure | Data | Test logic is checking transformation behavior and the full universe of possible input values is easy to express in a few examples |
Data integrity tests | Test that hard-and-fast assumptions about non-iasWorld tables, or tables that merge iasWorld and non-iasWorld data, are correct. | dbt data tests | GitHub Actions | Weekly on a schedule, or on PRs and commits to main branch that modify models/tests | GitHub Actions failure notification | Any test failure | Data | Test logic references data outside iasWorld, or the full universe of possible input values cannot be easily expressed in a few examples |
QC reports | Query for suspicious cases that might indicate a problem with our data, but that can't be confirmed automatically. | dbt models | Command line | On demand | One or more Excel workbooks | Doesn't fail unless queries encounter an error | Valuations | Failures do not necessarily indicate data problems, and a stakeholder has agreed to review output |
The distinctions between these types of tests and reports can be subtle. Here is a rough guide to how we think about choosing the right type of test or report for a given task:
- If you can express the test's logic exclusively in the context of an iasWorld table, and if you think an iasWorld data owner could immediately act on the results of the test without having to know any additional context other than attributes of the row in the iasWorld table that failed the test, you should define the test as an iasWorld test.
- If the test is checking that a transformation produces a correct value, and if you can represent the full universe of possible raw values that could produce the transformed value in a few simple examples, you should define the test as a unit test.
- If you can't be sure whether a failure indicates a data problem, and if you have partnered with a stakeholder who is committed to reviewing the failures, you should define the check as a QC report.
- In all other cases, you should define the test as a data integrity test.
Here are some real-world examples of each type of test or report:
- iasWorld tests: Test that the Basement Type column in the
iasworld.dweldat
table is not null - Unit tests: Test that the
default.vw_pin_appeal
view strips non-alphanumeric characters from the class code - Data integrity tests: Test that the
default.vw_pin_condo_char
view is unique by PIN and year - QC reports: Query for condo PINs that are nonlivable but that have characteristics corresponding a livable unit
The following sections describe how to add and run each of these types of tests and reports.
Our iasWorld data test suite checks that hard-and-fast assumptions about data in our iasWorld system of record are correct.
For help running iasWorld tests, see Running iasWorld tests. For help adding iasWorld tests, see Adding iasWorld tests and Choosing a generic test for your data test.
The iasWorld test suite can be run using the run_iasworld_data_tests
script.
This script runs the tests and reads the metadata for the run to output a number of
different artifacts with information about the tests:
- An Excel workbook with detailed information on each failure to aid in resolving data problems
- Parquet files representing metadata tables that can be uploaded to S3 for aggregate analysis
There are two instances when iasWorld tests typically run:
- Once per day by the
test-dbt-models
GitHub workflow, which pushes Parquet output to S3 in order to support our analysis of test failures over time - On demand by a Data team member whenever a Valuations staff member requests a copy of the Excel workbook for a township, usually right before the town closes
Since the first instance is a scheduled job that requires no intervention, the following steps describe how to respond to a request from Valuations staff for a fresh copy of the test failure output before town closing.
Typically, Valuations staff will ask for test output for a specific township. We'll refer to the
township code for this township
using the bash variable $TOWNSHIP_CODE
.
Run the tests locally using the run_iasworld_data_tests
script:
# Make sure you're in the dbt subdirectory with the virtualenv activated
cd dbt
source venv/bin/activate
# Run the script
python3 scripts/run_iasworld_data_tests.py --township $TOWNSHIP_CODE
Then, check the Excel workbook that the script produced to make sure it's formatted correctly, and send it to Valuations staff for review.
We implement iasWorld tests using dbt tests
to check that hard-and-fast assumptions about our raw iasWorld data are correct.
We prefer adding tests inline in schema.yml
config files using generic
tests,
rather than singular
tests.
There are a few specific modifications a test author needs to make to
ensure that the run_iasworld_data_tests
script can properly run the test and
interpret its results:
- One of either the test or the model that the test is defined on must be
tagged with
the tag
data_test_iasworld
- Prefer tagging the model, and fall back to tagging the test if for some reason the model cannot be tagged (e.g. if it has some non-QC tests defined on it)
- If you would like to disable a data test but you don't want to remove it
altogether, you can tag it or its model with
data_test_iasworld_exclude_from_workbook
, which will prevent the test (or all of the model's tests, if you tagged the model) from running as part of theselect_data_test_iasworld
selector
- The test definition must supply a few specific parameters:
name
must be set and follow the patterniasworld_<table_name>_<test_description>
additional_select_columns
must be set to an array of strings representing any extra columns that need to be output by the test for display in the workbook- Generics typically select any columns mentioned by other parameters,
but if you are unsure which columns will be selected by default
(meaning they do not need to be included in
additional_select_columns
), consult our documentation for the generic test you're using
- Generics typically select any columns mentioned by other parameters,
but if you are unsure which columns will be selected by default
(meaning they do not need to be included in
config.where
should typically set to provide a filter expression that restricts tests to unique rows and to rows matching a date range set by thedata_test_iasworld_year_start
anddata_test_iasworld_year_end
project variablesmeta
should be set with a few specific string attributes:description
(required): A short human-readable description of the testcategory
(optional): A workbook category for the test, required if a category is not defined for the test's generic in theTEST_CATEGORIES
constant in therun_iasworld_data_tests
scripttable_name
(optional): The name of the table to report in the output workbook, if the workbook should report a different table name than the name of the model that the test is defined on
See the iasworld_pardat_class_in_ccao_class_dict
test
for an example of a test that sets these attributes.
Due to the similarity of parameters defined on iasWorld tests, we make extensive use of YAML anchors and aliases to define symbols for commonly-used values. See here for a brief explanation of the YAML anchor and alias syntax.
Writing a data test in a schema.yml
file requires a generic
test
to define the underlying test logic. Our generic tests are defined
in the tests/generic/
directory. Before writing a test, look at
the documentation for our generics to see if
any of them meet your needs.
If a generic test does not meet your needs but seems like it could be easily extended to meet your needs (say, if it inner joins two tables but you would like to be able to configure it to left join those tables instead) you can modify the macro that defines the generic test as part of your PR to make the change that you need.
If no generic tests meet your needs and none can be easily modified to do so, you have two options:
- Define a new model in the
models/qc/
directory that can use a pre-existing generic. This is a good option if, say, you need to join two or more tables in a complex way that is specific to your test and not easily generalizable. With this approach, you can perform that join in the model, and then the generic test doesn't need to know anything about it. - Write a new generic test. If you decide to take this approach,
make sure to read the docs on writing custom generic
tests.
This is a good option if you think that the logic you need
for your test will be easily generalizable to other models
and other tests. You'll also need to follow a few extra steps that are specific
to our environment:
- Add a default category for your generic test in
the
TEST_CATEGORIES
constant in therun_iasworld_data_tests
script - Make sure that your generic test supports the
additional_select_columns
parameter that most of our generic tests support, making use of theformat_additional_select_columns
macro to format the parameter when applying it to yourSELECT
condition
- Add a default category for your generic test in
the
Data integrity tests check hard-and-fast assumptions about all of our data outside of iasWorld. This includes data that come from external sources like seeds or third-party data providers, as well as the views that we define as transformations on top of iasWorld data sources to clean them up and join them to non-iasWorld data.
We implement data integrity tests using dbt
tests. As with our
iasWorld tests, we prefer adding data tests inline in
schema.yml
config files using generic
tests,
rather than singular
tests.
Some examples of data integrity tests include:
- Check that the
default.vw_card_res_char
view is unique by the columnspin
,year
, andcard
(default_vw_card_res_char_unique_by_card_pin_and_year
) - Check that our transformations of condo characteristic data in the
default.vw_pin_condo_char
view don't ever produce anull
value forcard
(default_vw_pin_condo_char_card_not_null
) - Check that GEOIDs are the correct length in the
location.vw_pin10_location
view (location_vw_pin10_location_7_digit_ids_are_correct_length
)
There are two different times in which data integrity tests run automatically:
- In the
build-and-test-dbt
GitHub workflow as part of our CI suite when a test's model changes, or when the test itself changes - Once per week in the
test-dbt-models
GitHub workflow, to proactively warn the team about any data problems that we need to fix
You can also run the tests locally using the select_data_test_non_iasworld
selector:
dbt test --selector select_data_test_non_iasworld
In contrast to iasWorld tests, data integrity tests do not require any special tags or attributes because we do not process their results in a structured fashion.
See Choosing a generic test for your data test for help with choosing a generic test.
Unit tests help ensure that the transformations we apply on top of our raw data do not introduce errors. We use dbt's unit testing feature to implement these tests. Unit tests currently do not require any special tags or attributes, although this may change in the future as we build out a more extensive suite of tests.
Unit tests run during the build-and-test-dbt
GitHub workflow as part of
our CI suite whenever a PR or commit to the main branch adds or modifies a
model or unit test. As such, unit tests will only run automatically in cases
where it's possible that a code change might accidentally violate the
assumptions of the test.
Run the unit tests locally using the unit_test
resource type:
dbt test --select resource_type:unit_test
QC reports help us investigate suspicious data that might indicate a problem, but
that can't be confirmed automatically. We implement QC reports using dedicated
dbt models that are configured with attributes that can be parsed by the
export_models
script and other scripts that
build on it for specific workflows, like the export_qc_town_close_reports
script.
We run QC reports when Valuations staff ask for them, which most often occurs before a major event in the Valuations calendar like the close of a township.
The export_models
script is the foundation for
our QC reports. The script expects certain Python requirements, which can be installed
by running uv pip install .[dbt_tests]
in a virtual
environment.
The script exposes a few options that help to export the right data:
--select
(required unless--selector
is set): This option can control which models the script will export, along with the--selector
option. This option is equivalent to the dbt--select
option, and any valid dbt--select
expression will work for this option.--selector
(required unless--select
is set): This is an alternate way to control which models the script will export using a dbt selector. This option is equivalent to the dbt--selector
option. One of the--select
or--selector
options must be set in order to run the script, but both cannot be set or else the script will raise an error.--where
(optional): This option controls which rows the script will return for the selected model in a similar fashion as a SQLWHERE
clause. Any expression that could follow aWHERE
keyword in a SQL filter condition will work for this option.--rebuild
or--no-rebuild
(optional): This flag determines whether or not the script will rebuild the selected models usingdbt run
prior to export. It defaults to false (--no-rebuild
) and is most useful in rare cases where the underlying models that comprise the reports have been edited since the last run, typically during the period when a QC report is under active development.--target
(optional): The name of the dbt target to run queries against. Defaults todev
. You should never have to set this option when running the script locally.
Other scripts that build on export_models
for specific workflows tend to
expose similar options. See the documentation for these workflows below for
more details.
We run town close reports using the scripts/export_qc_town_close_reports.py
script, which builds on top of
export_models
. As such, export_qc_town_close_reports
expects the same set
of Python requirements as export_models
, which can be installed in a virtual
environment by running uv pip install .[dbt_tests]
.
The script exposes the following options, many of which are the same as
export_models
:
--township
(optional): One or more space-separated township codes to use for filtering results. If you omit this parameter, the script will default to exporting reports for all towns.--year
(optional): The year to use for filtering results. Defaults to the current year.--target
(optional): The name of the dbt target to run queries against. Defaults todev
. You should never have to set this option when running the script locally.--rebuild
or--no-rebuild
(optional): This flag determines whether or not the script will rebuild the selected models usingdbt run
prior to export. It defaults to false (--no-rebuild
) and is most useful in rare cases where the underlying models that comprise the reports have been edited since the last run, typically during the period when a QC report is under active development.--print-table-refresh-command
(optional): Instructs the script to print a command that can be run on the server to refresh underlying iasWorld tables. Will not export any reports when set. Useful if you want to refresh iasWorld table data before running exports. See Refreshing iasWorld tables prior to running town close QC reports for more details.output-dir
(optional): The Unix-formatted path to the directory where the script will store output files. Defaults to./export/output/
.
Assuming a township code defined by $TOWNSHIP_CODE
and a tax year defined by
$TAXYR
, the following command will generate town close reports for the township/year combo:
python3 scripts/export_qc_town_close_reports.py --township "$TOWNSHIP_CODE" --year "$TAXYR"
You can omit the --year
flag and the script will default to the current year of data:
python3 scripts/export_qc_town_close_reports.py --township "$TOWNSHIP_CODE"
Omit all options to generate reports for all towns in the current year:
python3 scripts/export_qc_town_close_reports.py
In both cases, the script will output the reports to the dbt/export/output/
directory, and will print the names of the reports that it exports during execution.
The queries that generate town close reports run against our data warehouse, which
ingests data from iasWorld overnight once daily. Sometimes a Valuations staff member
will request a report during the middle of the workday, and they will need the most
recent data, which will not exist in our warehouse yet. In these cases, you can use
the --print-table-refresh-command
flag to output a command that you can run on the server to
refresh any iasWorld tables that the town close reports rely on. Note that when
you pass --print-table-refresh-command
to the script, it will not export any reports, and
will instead exit immediately after printing the refresh command.
The following command will print a refresh command that can be run on the server
to refresh the iasWorld tables that comprise our town close reports for the
township with code $TOWNSHIP_CODE
and the current year of data:
python3 scripts/export_qc_town_close_reports.py --township "$TOWNSHIP_CODE" --print-table-refresh-command
You should see output like this, which you can run in the context of the
service-spark-iasworld
repository on the server in order to refresh iasWorld tables:
Run the following commands on the Data Team server as the shiny-server user:
cd /path/to/service-spark-iasworld/
docker compose --profile prod up -d
docker exec spark-node-master-prod ./submit.sh
--upload-data --upload-logs --run-glue-crawler
--json-string {"aprval": {"table_name": "iasworld.aprval", "min_year": 2024, "cur": ["Y"], ...
We define the AHSAP change in value QC report using one model, qc.vw_change_in_ahsap_values
,
which we filter for a specific township name (like "Hyde Park") and tax year during export.
Here's an example of how to export that model for a township name defined by $TOWNSHIP_NAME
and a tax year defined by $TAXYR
:
python3 scripts/export_models.py --select qc.vw_change_in_ahsap_values --where "taxyr = '$TAXYR' and township_name = '$TOWNSHIP_NAME'"
The script will output the reports to the dbt/export/output/
directory, and will print the
name of the report that it exports during execution.
Since QC reports are built on top of models, adding a new QC report can be as simple
as adding a new model and exporting it using the export_models
script.
You should default to adding your model to the qc
schema and subdirectory, unless there is
a good reason to define it elsewhere. For details on how to add a model, see
➕ How to add a new model.
There are a number of configuration options that allow you to control the format of your model during export:
- Tagging: Model tags are not
required for QC reports, but they are helpful in cases where we need to export more than one
report at a time. For example, Valuations typically requests all of the town close QC
reports at the same time, so we tag each model with the
qc_report_town_close
tag such that we can select them all at once when running theexport_models
script. For consistency, prefer tags that start with theqc_report_*
prefix. - Filtering: Since the
export_models
script can filter your model using the--where
option, you should define your model such that it selects any fields that you want to use for filtering in theSELECT
clause. It's common to filter reports bytaxyr
and one of eithertownship_name
ortownship_code
. - Formatting: You can set a few different optional configs on the
meta
attribute of your model's schema definition in order to control the format of the output workbook:meta.export_name
: The base name that the script will use for the output file, not including the file extension. The script will output the file todbt/export/output/{meta.export_name}.xlsx
. If unset, defaults to the name of the model.meta.export_template
: The base name for an Excel file that the script will use as a template to populate with data, not including the file extension. The script will read this file fromdbt/export/templates/{meta.export_template}.xlsx
. Templates are useful if you want to apply custom headers, column widths, or other column formatting to the output that are not otherwise configurable by themeta.export_format
config attribute described below. If unset, the script will search for a template with the same name as the model; if it does not find a template, it will default to a simple layout with filterable columns and striped rows.meta.export_format
: An object with the following schema that controls the format of the output workbook:columns
(required): A list of one or more columns to format, each of which should be an object with the following schema:index
(required): The letter index of the column to be formatted, likeA
orAB
.name
(optional): The name of the column as it appears in the header of the workbook. The script does not use this attribute and instead usesindex
, but we set it in order to make the column config object more readable.horizontal_align
(optional): The horizontal alignment to set on the column, one ofleft
orright
.number_format
(optional): The number format to apply to the column. See the openpyxl source code for a list of options
Here's an example of a model schema definition that sets all of the different optional and required formatting options for a new QC report:
models:
- name: qc.vw_qc_report_new
description: '{{ doc("view_vw_qc_report_new") }}'
config:
tags:
- qc_report_new
meta:
export_name: QC Report (New)
export_template: qc_report_new.xslx
export_format:
columns:
- index: B
name: Percent Change
horizontal_align: right
number_format: "0.00%"
In the case of this model, the export_models
script:
- Will export the model if either
--select qc.vw_qc_report_new
or--select tag:qc_report_new
is set - Will use the template
dbt/export/templates/qc_report_new.xlsx
to populate data - Will export the output workbook to
dbt/export/output/QC Report (New).xlsx
- Will right-align column B, a column with the name
Percent Change
- Will format column B as a percentage with two decimal places
Most of our dbt tests are simple SQL statements that we run against our models in order to confirm that models conform to spec. If a test is failing, you can run or edit the underlying query in order to investigate the failure and determine whether the root cause is a code change we made, new data that was pushed to the system of record, or a misunderstanding about the data specification.
To edit or run the query underlying a test, first run the test in isolation:
dbt test --select <test_name>
Then, navigate to the Recent
queries
tab in Athena. Your test will likely be one of the most recent queries; it
will also start with the string -- /* {"app": "dbt", ...
, which can be
helpful for spotting it in the list of recent queries.
Open the query in the Athena query editor, and edit or run it as necessary to debug the test failure.
To quickly rule out a failure related to a code change, you can switch to the
main branch of this repository (or to an earlier commit where we know tests
passed, if tests are failing on the main branch) and rerun the test against prod
using the --target prod
option. If the test continues to fail in the same
fashion, then we can be confident that the root cause is the data and not the
code change.
The cleanup-dbt-resources
workflow removes all AWS resources that were created
by GitHub Actions for a pull request once that pull request has been merged into
the main branch of the repository. On rare occasions, this workflow might fail
due to changes to our dbt setup that invalidate the assumptions of the workflow.
There are two ways to clean up a PR's resources manually:
- Using the AWS console: Login to the AWS console and navigate to the
Athena homepage. Select
Data sources
in the sidebar. Click on theAwsDataCatalog
resource. In theAssociated databases
table, select each data source that matches the database pattern for your pull request (i.e. prefixed withz_ci_
plus the name of your branch) and click theDelete
button in the top right-hand corner of the table. - Using the command-line: If the workflow has failed, it most likely means
there is a bug in the
.github/scripts/cleanup_dbt_resources.sh
script (source code). Once you've identified and fixed the bug, confirm it works by running the following command to clean up the resources created by the pull request:
HEAD_REF=$(git branch --show-current) ../.github/scripts/cleanup_dbt_resources.sh ci
If you get this error:
Compilation Error
dbt found two schema.yml entries for the same resource named location.vw_pin10_location. Resources and their associated columns may only be described a single time. To fix this, remove one of the resource entries for location.vw_pin10_location in this file:
- models/location/schema.yml
It usually means that dbt's state has unresolvable conflicts with the current
state of your working directory. To resolve this, run dbt clean
to clear your
dbt state, reinstall dbt dependencies with dbt deps
, and then try rerunning
the command that raised the error.
When attempting to build models, you may occasionally run into the following error indicating that a model that your selected model depends on does not exist:
Runtime Error in model default.vw_card_res_char (models/default/default.vw_card_res_char.sql)
line 24:10: Table 'awsdatacatalog.z_dev_jecochr_default.vw_pin_land' does not exist
The error may look like this if an entire schema is missing:
Runtime Error in model default.vw_pin_universe (models/default/default.vw_pin_universe.sql)
line 130:11: Schema 'z_dev_jecochr_location' does not exist
To resolve this error, you can prefix your selected model's names with a plus
sign (+
) to instruct dbt to (re)build its dependency models as well. However,
note that this will rebuild all dependency models, even ones that already
exist in your development environment, so if your model depends on another model
that is compute-intensive (basically, anything in the location
or proximity
schemas) you should use the --exclude
option to exclude
these compute-intensive models from being rebuilt:
dbt build --select +model.vw_pin_shared_input --exclude location.* proximity.* --resource-types model seed
If you'd like to know what dbt is doing under the hood, you can use the --log-level
parameter to enable debug
logging when running dbt commands:
dbt --log-level debug build --select model.vw_pin_shared_input