sparkify_data_modelling

Project Summary

In this project, I build a relational database schema and an ETL pipeline for a music streaming app called Sparkify.

We start with two categories of data: song data and log data. Song data is the metadata we have on artists and songs, such as artist name, artist country, song title, song duration, etc. Log data is the event logs on user activity coming from the app, such as which song a particular user listened to.

Project Components:

Creating the data model: The data model follows a star schema, consisting of one fact table (songplays) and four dimension tables (users, songs, time, artists), as shown in the ERD below. It is optimized for queries on song play analysis.
Building the ETL pipeline: The ETL process consists of extracting files residing in local directories, transforming the data from JSON into pandas dataframes and loading them onto a local postgres instance.

Tech Stack:

glob
psycopg2
postgres
pandas
json
os

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
.DS_Store		.DS_Store
README.md		README.md
create_tables.py		create_tables.py
erd.png		erd.png
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkify_data_modelling

Project Summary

Project Components:

Directory

Tech Stack:

About

Releases

Packages

Languages

dunyaoguz/relational_data_modelling

Folders and files

Latest commit

History

Repository files navigation

sparkify_data_modelling

Project Summary

Project Components:

Directory

Tech Stack:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages