• In this project, our objective is to build a data pipeline for multi-model databases for JSON data format as well as graph database using Microsoft Azure Cosmos DB SQL and Gremlin APIs.
• We have the OList Brazilian e-commerce dataset from Kaggle for building and simulating our data pipeline.
• The first step in our data pipeline is to extract data from the CSV files, perform data cleansing and load the data to SQL server using Talend Open Studio.
• In the next step, data orchestration is performed using Apache Airflow to load data into Blob Containers using Python scripts. Once the json files are created in Blob container, the Data Factory scripts load the data to Cosmos DB.
• The data from Cosmos DB is then used for performing visualization and reporting using Power BI and Gremlin IDE.
• We have scheduled our Talend workflow jobs using Windows Task Scheduler to run daily at 14:00 PM EST. This will kick off all the ETL and staging workflow and load data into staging SQL Server Database locally.
• Next, in Apache Airflow we have setup DAG to schedule data pipeline job run, which will load the JSON file from SQL Server using SQL API for Cosmos DB into the Azure Blob Containers at daily 15:00 PM EST.
• As soon as the JSON file is created in Azure Blob Container, it will trigger an Azure Data Factory pipeline workflow automatically and load the JSON data into Azure Cosmos DB with ‘upsert’ configuration which will handle Delta load and insert/update the data accordingly.
• Also, Gremlin Data Load script is scheduled using Windows Scheduler to run daily 15:00 PM EST, which will load the report data into Azure Cosmos DB for Gremlin API graph db.
• Microsoft SQL Server
• Talend Open Studio for DI
• Microsoft Azure (Blob Container, Data Factory, CosmosDB for SQL & Gremlin API)
• Apache Airflow
• Power BI
• Gremlin Graph Database IDE