Project Summary

Use Apache-Airflow to schedule tasks to extract data in JSON format from S3 buckets, transform it based on a star schema and load it into a data warehouse on AWS Redshift. Data consists of songs and user activity on a music streaming app. User activity and metadata on songs are recorded in separate JSON files which are kept in S3 buckets. Airflow is used to schedule tasks to: create tables, load data into staging tables, load data from staging table into fact and dimension tables, conducting rudimentary quality checks on the data. The DAG is shown below:

Execution

Clone the project and open the folder:

Open and run the Jupyter notebook create_cluster.ipynb to create a new Redshift cluster and connect to it. Alternatively, run python create_cluster.py in the terminal.
Cluster creation will take some time. Once it has been created and a connection established, the endpoint (HOST) and the Amazon Resource Name (ARN) will be printed. Copy these and fill out the relevant fields in the config file dwh.cfg, namely ARN and DWH_HOST.
Ensure that Docker is running and type docker-compose up in your terminal.
Once the Airflow webserver is running, type 0.0.0.0:8080 in your browser to access the Airflow web UI.
Create connections for accessing S3 and Redshift using the Admin dropdown. These should respectively be called 'aws-credentials' and 'redshift'.
Turn on the DAG to initiate the ETL.

Files

create_cluster.ipynb: Creates a Redshift cluster and connect to it. Alternativel, use create_cluster.py.
docker-compose.yml: Builds the required images.
dags/etl_dag.py: Contains the DAG and associated tasks.
sql/create_tables.sql: SQL queries for creating tables.
plugins/helpers/sql_queries.py: Helper class with SQL queries used for inserting data.
plugins/operators/: Contains files with custom operators for moving data from S3 to redshift, loading the fact and dimension tables and conducting data quality checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Summary

Execution

Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
airflow_2.2.1		airflow_2.2.1
dags		dags
images		images
plugins		plugins
.gitignore		.gitignore
README.md		README.md
create_cluster.ipynb		create_cluster.ipynb
create_cluster.py		create_cluster.py
docker-compose.yml		docker-compose.yml

sunnykan/airflow-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Project Summary

Execution

Files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages