The oldest trademark ever registered in Canada, according to public records from the Canadian Intellectual Property Office (CIPO), is the word trademark 'IMPERIAL,' owned by Unilever Canada Inc. and registered on July 29, 1865, for a brand of soap. On a quarterly basis, the CIPO releases trademark researcher datasets containing records of trademark applications and registrations.
This project explores a subset of these datasets by processing trademark data from 1865 to 2023 and visualizing them as:
- Registered trademarks by category over decades and years.
- Trademarks applications and registrations over decades and years.
- Trademarks registered versus disputed by ranking and interested party.
This batch ELT pipeline is comprised of several Google Cloud Platform (GCP) services for ingestion, transformation, and serving. The data pipeline orchestrated by Cloud Composer 2 (managed Airflow) utilizes datasets in each DAG to enable fully automated data-aware scheduling. The source CSV files containing 13 million records are automatically downloaded to the Cloud Storage data lake for processing. Dataflow decompresses the files while Spark jobs apply a schema and transform each file into Parquet format where they are then loaded into the data warehouse. The BigQuery tables are partitioned and clustered before transformations are applied in dbt. With Cosmos, the entire dbt project is encapsulated within an Airflow task group, enhancing transparency of the dbt data lineage graph and providing finer-grained control over the model materialization process. The visualizations in Looker Studio feature cross-filtering, data drilling (on decade/year), and controls for interactive analysis.
- Cloud Composer 2.6.6 (Airflow 2.7.3)
- Dataproc (managed Spark)
- BigQuery
- Dataflow
- Cloud Storage
- dbt-core
- Cosmos
- Looker Studio
dbt lineage graph
dbt lineage graph as Airflow task group
File | Description | Records | Source |
---|---|---|---|
TM_application_main_2024-03-06.csv | Contains basic information about the trademark application filed, including the primary key (Application Number ). |
1,971,623 | Download |
TM_interested_party_2024-03-06.csv | Contains detailed information about the interested parties (Applicant, Registrant, Agent, etc.) | 4,604,423 | Download |
TM_cipo_classification_2024-03-06.csv | Contains the Nice Classifications of the Trademark. | 6,262,267 | Download |
TM_opposition_case_2024-03-06.csv | Contains information on the opposition case, including details of the plaintiff and defendant. | 40,216 | Download |
cipo_status_codes.csv | Mapping of CIPO status code IDs and descriptions. | 42 | Link |
wipo_status_codes.csv | Mapping of WIPO (World Intellectual Property Organization) status code IDs and descriptions. | 17 | Link |
party_type_codes.csv | Mapping of party type code IDs and descriptions. | 7 | Link |
nice_classification_codes.csv | Mapping of Nice classification of goods and service IDs and descriptions. | 46 | Link |
Note
These instructions have only been tested on macOS. YMMV on other platforms.
- Have an active GCP account with billing enabled.
- You've installed Terraform.
- gcloud CLI is installed.
# macOS install using Homebrew brew install --cask google-cloud-sdk
- You have GNU Make 3.81 or newer installed.
- Decide on a project name and set it as the
$PROJECT_ID
environment variable.export PROJECT_ID={{YOUR_PROJECT_NAME}}
- Set the
$GCP_EMAIL
environment variable to the email associated with your active GCP account.export GCP_EMAIL={{YOUR_GCP_EMAIL}}
- Set the
$BILLING_ACCOUNT_ID
environment variable to the value of theACCOUNT_ID
returned bygcloud billing accounts list
that you wish to link to this Google Cloud project.export BILLING_ACCOUNT_ID={{ACCOUNT_ID}}
- Set
$GCP_REGION
to your desired GCP region.export GCP_REGION={{REGION}}
Run the following commands from the root project directory.
-
Verify the environment variables are correctly set.
make env-test
-
Initialize a new GCP project and service account.
make gcp-up
-
Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the absolute path of the service account key. Terraform will need this to authenticate.export GOOGLE_APPLICATION_CREDENTIALS={{full_path_to_keyfile}}
On macOS:
export GOOGLE_APPLICATION_CREDENTIALS=$(realpath keys/owner-sa-key.json)
-
Enable all the Google APIs required by the project.
make enable-gcp-services
-
Provision infrastructure. Type
yes
to approve actions. This step can take 40+ minutes to complete.make -f tf.Makefile up
If this step fails, you can try running:
make -f tf.Makefile retry
-
Complete dbt-core setup.
make dbt-setup
- Navigate to your Cloud Composer environments on Google Cloud console.
- In the Airflow webserver column, follow the link to access the Airflow UI.
- To initialize the data pipeline, start the upload_raw_trademark_files_to_gcs
DAG by activating the
▶️ (Trigger DAG) button under the Actions column. Subsequent DAGs will be automatically triggered upon successful completion of upstream DAGs. Execution of the entire pipeline can take over 20 minutes. - Progress of each DAG can be monitored from the Airflow UI.
Important
The dbt
DAG is a task group composed of 20 tasks and dependencies. Depending
on resource availability, you may be required to
manually rerun them.
- Deprovision project related infrastructure. Type
yes
to approve actions.make -f tf.Makefile down
- Delete the GCP project. Type
Y
to confirm.make gcp-down