Exploring 158 Years of Canadian Trademark Data

View interactive dashboard

Overview

The oldest trademark ever registered in Canada, according to public records from the Canadian Intellectual Property Office (CIPO), is the word trademark 'IMPERIAL,' owned by Unilever Canada Inc. and registered on July 29, 1865, for a brand of soap. On a quarterly basis, the CIPO releases trademark researcher datasets containing records of trademark applications and registrations.

This project explores a subset of these datasets by processing trademark data from 1865 to 2023 and visualizing them as:

Registered trademarks by category over decades and years.
Trademarks applications and registrations over decades and years.
Trademarks registered versus disputed by ranking and interested party.

Data Stack

This batch ELT pipeline is comprised of several Google Cloud Platform (GCP) services for ingestion, transformation, and serving. The data pipeline orchestrated by Cloud Composer 2 (managed Airflow) utilizes datasets in each DAG to enable fully automated data-aware scheduling. The source CSV files containing 13 million records are automatically downloaded to the Cloud Storage data lake for processing. Dataflow decompresses the files while Spark jobs apply a schema and transform each file into Parquet format where they are then loaded into the data warehouse. The BigQuery tables are partitioned and clustered before transformations are applied in dbt. With Cosmos, the entire dbt project is encapsulated within an Airflow task group, enhancing transparency of the dbt data lineage graph and providing finer-grained control over the model materialization process. The visualizations in Looker Studio feature cross-filtering, data drilling (on decade/year), and controls for interactive analysis.

Cloud Composer 2.6.6 (Airflow 2.7.3)
Dataproc (managed Spark)
BigQuery
Dataflow
Cloud Storage
dbt-core
Cosmos
Looker Studio

Data Models

dbt lineage graph

dbt lineage graph as Airflow task group

Datasets

File	Description	Records	Source
TM_application_main_2024-03-06.csv	Contains basic information about the trademark application filed, including the primary key (`Application Number`).	1,971,623	Download
TM_interested_party_2024-03-06.csv	Contains detailed information about the interested parties (Applicant, Registrant, Agent, etc.)	4,604,423	Download
TM_cipo_classification_2024-03-06.csv	Contains the Nice Classifications of the Trademark.	6,262,267	Download
TM_opposition_case_2024-03-06.csv	Contains information on the opposition case, including details of the plaintiff and defendant.	40,216	Download
cipo_status_codes.csv	Mapping of CIPO status code IDs and descriptions.	42	Link
wipo_status_codes.csv	Mapping of WIPO (World Intellectual Property Organization) status code IDs and descriptions.	17	Link
party_type_codes.csv	Mapping of party type code IDs and descriptions.	7	Link
nice_classification_codes.csv	Mapping of Nice classification of goods and service IDs and descriptions.	46	Link

Instructions

Note

These instructions have only been tested on macOS. YMMV on other platforms.

✅ Before you begin

Have an active GCP account with billing enabled.
You've installed Terraform.

gcloud CLI is installed.

# macOS install using Homebrew
brew install --cask google-cloud-sdk

You have GNU Make 3.81 or newer installed.

🌱 Set environment variables

Decide on a project name and set it as the $PROJECT_ID environment variable.
```
export PROJECT_ID={{YOUR_PROJECT_NAME}}
```
Set the $GCP_EMAIL environment variable to the email associated with your active GCP account.
```
export GCP_EMAIL={{YOUR_GCP_EMAIL}}
```
Set the $BILLING_ACCOUNT_ID environment variable to the value of the ACCOUNT_ID returned by gcloud billing accounts list that you wish to link to this Google Cloud project.
```
export BILLING_ACCOUNT_ID={{ACCOUNT_ID}}
```
Set $GCP_REGION to your desired GCP region.
```
export GCP_REGION={{REGION}}
```

🔧 Make install

Run the following commands from the root project directory.

Verify the environment variables are correctly set.
```
make env-test
```
Initialize a new GCP project and service account.
```
make gcp-up
```
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the absolute path of the service account key. Terraform will need this to authenticate.
```
export GOOGLE_APPLICATION_CREDENTIALS={{full_path_to_keyfile}}
```
On macOS:
```
export GOOGLE_APPLICATION_CREDENTIALS=$(realpath keys/owner-sa-key.json)
```
Enable all the Google APIs required by the project.
```
make enable-gcp-services
```
Provision infrastructure. Type yes to approve actions. This step can take 40+ minutes to complete.
```
make -f tf.Makefile up
```
If this step fails, you can try running:
```
make -f tf.Makefile retry
```
Complete dbt-core setup.
```
make dbt-setup
```

🚀 Initialize Airflow DAGs

Navigate to your Cloud Composer environments on Google Cloud console.
In the Airflow webserver column, follow the link to access the Airflow UI.
To initialize the data pipeline, start the upload_raw_trademark_files_to_gcs DAG by activating the ▶️ (Trigger DAG) button under the Actions column. Subsequent DAGs will be automatically triggered upon successful completion of upstream DAGs. Execution of the entire pipeline can take over 20 minutes.
Progress of each DAG can be monitored from the Airflow UI.

Important

The dbt DAG is a task group composed of 20 tasks and dependencies. Depending on resource availability, you may be required to manually rerun them.

💥 Teardown

Deprovision project related infrastructure. Type yes to approve actions.
```
make -f tf.Makefile down
```
Delete the GCP project. Type Y to confirm.
```
make gcp-down
```

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
assets		assets
dags		dags
keys		keys
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
setup.sh		setup.sh
tf.Makefile		tf.Makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring 158 Years of Canadian Trademark Data

Overview

Data Stack

Data Models

Datasets

Instructions

✅ Before you begin

🌱 Set environment variables

🔧 Make install

🚀 Initialize Airflow DAGs

💥 Teardown

See also

About

Languages

wndrlxx/ca-trademarks-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Exploring 158 Years of Canadian Trademark Data

Overview

Data Stack

Data Models

Datasets

Instructions

✅ Before you begin

🌱 Set environment variables

🔧 Make install

🚀 Initialize Airflow DAGs

💥 Teardown

See also

About

Topics

Resources

Stars

Watchers

Forks

Languages