A configurable version of the built in kedro catalog create
cli. Default
types can be configured in the projects settings.py, to get these types rather
than MemoryDataSets
.
Table of Contents
pip install kedro-auto-catalog
Configure the project defaults in src/<project_name>/settings.py
with this
dict.
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
To auto create catalog entries for the __default__
pipeline, run this from the command line.
kedro auto-catalog -p __default__
If you want a reminder of what to do, use the --help
.
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]
Create Data Catalog YAML configuration with missing datasets.
Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.
The catalog configuration will be saved to
`<conf_source>/<env>/catalog/<pipeline_name>.yml` file.
Configure the project defaults in `src/<project_name>/settings.py` with this
dict.
Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
Using the
kedro-spaceflights
example, running kedro auto-catalog -p __default__
yields the following
catalog in conf/base/catalog/__default__.yml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
If we use the example configuration with "subdirs": ["raw", "intermediate", "primary"]
and "layers": ["raw", "intermediate", "primary"]
, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to raw_y_test
, it will put y_test.parquet
in the raw
directory, and in the raw layer.
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
kedro-auto-catalog
is distributed under the terms of the MIT license.