Skip to content

Latest commit

 

History

History
342 lines (258 loc) · 16.2 KB

README.md

File metadata and controls

342 lines (258 loc) · 16.2 KB

Generic tests

This directory stores the custom generic tests that we use to define our test suite.

Available generic tests

test_accepted_range

Asserts that a column's values fall inside an expected range. Any combination of min_value and max_value is allowed, and the range can be inclusive or exclusive.

Parameters:

  • min_value (optional number): Lower bound for the range. Defaults to no lower bound.
  • max_value (optional number): Upper bound for the range. Defaults to no upper bound.
  • inclusive (optional boolean): Whether the range is inclusive. Defaults to true.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_accepted_values

Asserts that a column's values are all present in a canonical list of values. The opposite of test_not_accepted_values.

Parameters:

  • values (required list of any value): Canonical list of allowed values.
  • quote (optional boolean): Whether to single-quote all elements of values, i.e. whether to convert them to strings. Defaults to true.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_column_is_subset_of_external_column

Asserts that a column is a subset of a column in an external relation. Rows that have no match in the external relation's column will be flagged as failures.

Parameters:

  • external_model (required string): The external relation to use for comparison. Use a ref() or source() call to specify this relation so that the DAG can understand the relationship.
  • external_column (required string): The name of the column on external_model to use for comparison.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_column_length

Asserts that all columns in a list of column_names have the correct length. Returns columns with the naming pattern len_<column_name> representing the length of each column for all rows where one of the columns has an incorrect length, e.g. if column_names = ["foo", "bar"] the test will return two additional columns named len_foo and len_bar.

Since this test operates on a list of column_names instead of a scalar column_name, it must be defined on the table level rather than on the column level.

Parameters:

  • column_names (required list of strings): The list of columns to check for proper length.
  • length (required integer): The length that the column values should be.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on and the autogenerated len_<column_name> columns will always be selected regardless of this value.

test_columns_match

Asserts that two or more columns in the same relation have the same value for each row.

Parameters:

  • matching_column_names (required list of strings): The list of columns to check for identical values.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on and the columns defined in matching_column_names will always be selected regardless of this value.

test_count_is_consistent

Asserts that the count of a given column is the same when grouped by another column, for example that the number of distinct township codes is the same across years. Returns the grouping column and a column called count with the count of rows for that group.

Parameters:

  • group_column (required string): The column to use for grouping.

test_expression_is_false

Asserts that a valid SQL expression is false for all rows. In other words, filters for rows where a given expression is true. Often useful for idiosyncratic comparisons across columns that are not easily generalized into generic tests.

Parameters:

  • expression (required string): A valid SQL expression to apply to the column or table.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_expression_is_true

Asserts that a valid SQL expression is true for all rows. In other words, filters for rows where a given expression is false. Often useful for idiosyncratic comparisons across columns that are not easily generalized into generic tests.

Parameters:

  • expression (required string): A valid SQL expression to apply to the column or table.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_is_null

Asserts that a column contains only null values. The opposite of test_not_null.

Parameters:

  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_no_extra_whitespace

Asserts that one or more string columns do not contain extraneous whitespace. Returns all of the columns that are configured in column_names.

Since this test operates on a list of column_names instead of a scalar column_name, it must be defined on the table level rather than on the column level.

Parameters:

  • column_names (required list of strings): The list of columns to check for extra whitespace.
  • allow_interior_space (optional boolean): If true, will only check for leading and trailing whitespace, and otherwise will also check for multiple consecutive spaces in the interior of the string. Defaults to false.

test_not_accepted_values

Asserts that there are no rows that match the given values. The opposite of test_accepted_values.

Parameters:

  • values (required list of any value): Canonical list of disallowed values.
  • quote (optional boolean): Whether to single-quote all elements of values, i.e. whether to convert them to strings. Defaults to true.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_not_null

Asserts that there are no null values present in a column. The opposite of test_is_null.

Parameters:

  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_relationships

Asserts that all of the records in a child table have a corresponding record in a parent table. This property is referred to as "referential integrity".

  • to (required string): The external relation to use for comparison. Use a ref() or source() call to specify this relation so that the DAG can understand the relationship.
  • field (required string): The name of the column on to to use for comparison.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on will always be selected regardless of this value.

test_res_class_matches_pardat

For all residential parcels in a given model, test that there is at least one class code that matches a class code for that parcel in iasworld.pardat.

The test filters for residential parcels by anti-joining the model against comdat using parid and taxyr; as a result, it filters out mixed-use parcels as well.

Parameters:

  • major_class_only (optional boolean): Compare only the first digit of classes. When set to false, compare the first three digits instead. Defaults to false.
  • parid_column_name (optional string): The name of the column on the base model that corresponds to pardat.parid, in case the model uses a different name scheme. Defaults to parid.
  • taxyr_column_name (optional string): The name of the column on the base model that corresponds to pardat.taxyr, in case the model uses a different name scheme. Defaults to taxyr.
  • join_type (optional string): The type of join to use when joining to pardat, e.g. "inner" or "left". Defaults to "left".
  • additional_pardat_filter (optional string): A SQL string representing additional conditions to apply in the WHERE clause of the subquery that selects from pardat to join to the model, e.g. "class != 'EX' AND class != 'RR'". Note that cur = 'Y' and deactivat IS NULL are already set prior to this parameter being applied, hence the "additional" in the param name.
  • additional_select_columns (optional list of dictionaries): Additional columns to select for failure output. The column the test is defined on and pardat.class will always be selected regardless of this value. Columns must be represented as dictionaries with the following attributes:
    • column (required string): The name of the column to select.
    • agg_func (required string): The aggregation function to use for aggregating column values, like max or array_agg. Necessary because results are automatically grouped by parid and taxyr.
    • alias (optional string): The name of the column to use for output. Necessary because aggregation functions as represented by agg_func require aliases in SQL. Defaults to <agg_func>_<column_name>.

test_row_count

Asserts that row counts for a model or column are above a certain value.

Parameters:

  • above (required integer): The minimum row count (inclusive) for the model or column.

test_row_values_match_after_join

Asserts that row values match after joining two tables.

Row values can be a subset of the values in the joined table, e.g. if a PIN in table A has one row with class = "212" and the same PIN in table B has two rows, one with class = "212" and one with class = "211", then table A passes the test.

Parameters:

  • external_model (required string): The name of the model to join to.
  • external_column_name (required string): The name of the column in external_model to join to.
  • join_condition (required string): The ON (or USING) portion of a JOIN clause, represented as a string. Note that in the case where ON is used, columns in the base model should be formatted like model.<column> while columns in the external model should be formatted like external_model.<column>, e.g. ON model.pin = external_model.parid. This is not necessary in the case of a USING expression, since USING does not need to refer to table names for the purposes of namespacing columns.
  • group_by (optional list of strings): The columns from the base model to pass to the GROUP BY function used in the test query. Unlike join_condition, these column names do not have to be prefixed with model.*, since they are assumed to come from the base model for the test and not the external model.
  • join_type (optional string): The type of join to use, e.g. "inner" or "left". Defaults to "inner".
  • column_alias (optional string): An alias to use when selecting the column from the base model for output. An alias is required in this case because the column must be aggregated. Defaults to "model_col".
  • external_column_alias (optional string): An alias to use when selecting the column from the external model for output. Defaults to "external_model_col".
  • additional_select_columns (optional list of dictionaries): Additional columns to select for failure output. model.<column_name>, external_model.<external_column_name>, and the columns specified in the group_by parameter will always be selected regardless of this value. Columns must be represented as dictionaries with the following attributes:
    • column (required string): The name of the column to select.
    • agg_func (required string): The aggregation function to use for aggregating column values, like max or array_agg. Necessary because results are automatically grouped by the columns specified in the group_by parameter.
    • alias (optional string): The name of the column to use for output. Necessary because aggregation functions as represented by agg_func require aliases in SQL. Defaults to <agg_func>_<column_name>.

test_sequential_values

Asserts that a column contains sequential values. Can be used for both numeric values and datetime values.

Parameters:

  • interval (optional integer): The expected gap in units between two sequential values. Defaults to 1.
  • datepart (optional string): When present, indicates that values are datetimes and describes the unit of dates that should be used by interval to establish expected gaps, e.g. "hour" or "day".
  • group_by_columns (optional list of strings): The group of columns to use for partitioning in the window function that is used to lag the base column. Defaults to an empty list.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. The column the test is defined on, the columns in group_by_columns, and value for the column in the preceding value in the sequence (aliased to previous_<column_name>) will always be selected regardless of this value.

test_unique_combination_of_columns

Asserts that the combination of columns always produces unique rows in a relation. For example, the combination of parid and taxyr might produce unique rows even though neither column is unique in isolation.

Since this test operates on a combination_of_columns list instead of a scalar column_name, it must be defined on the table level rather than on the column level.

Parameters:

  • combination_of_columns (required list of strings): One or more columns that are unique as a group.
  • allowed_duplicates (optional integer): The maximum number of duplicates that is considered acceptable for the purposes of uniqueness. Defaults to 1.
  • additional_select_columns (optional list of strings): Additional columns to select for failure output. Regardless of this value, the columns defined by combination_of_columns along with an automatically generated column num_duplicates will always be selected.

test_value_is_present

Asserts that a given expression returns a non-zero number of rows.

Since this test operates on an expression instead of a column_name, it must be defined on the table level rather than on the column level.

Parameters:

  • expression (required string): A valid SQL string representing the expression that should return a non-zero number of rows.