In this project, I build a relational database schema and an ETL pipeline for a music streaming app called Sparkify.
We start with two categories of data: song data and log data. Song data is the metadata we have on artists and songs, such as artist name, artist country, song title, song duration, etc. Log data is the event logs on user activity coming from the app, such as which song a particular user listened to.
-
Creating the data model: The data model follows a star schema, consisting of one fact table (songplays) and four dimension tables (users, songs, time, artists), as shown in the ERD below. It is optimized for queries on song play analysis.
-
Building the ETL pipeline: The ETL process consists of extracting files residing in local directories, transforming the data from JSON into pandas dataframes and loading them onto a local postgres instance.
create_tables.py
: Spins up a local postgres database called sparkifydb and creates the tables in the schemasql_queries.py
: Contains all the queries used to create, drop and insert data into our tablestest.ipnyb
: Used to test whether our database operations worked successfullyetl.ipnyb
: ETL done only on one file from each of data repositories. This section is mainly for exploring and testing before fully automating the ETLetl.py
: Processes the entirety of log and song data
- glob
- psycopg2
- postgres
- pandas
- json
- os