Skip to content

Latest commit

 

History

History
74 lines (44 loc) · 2.91 KB

File metadata and controls

74 lines (44 loc) · 2.91 KB

Data-Engineering-Capstone-Project

Data Engineering Capstone Project - Udacity Data Engineering Expert Track.

In this project, I gathered some datasets to work with, explored this data, assessed and cleaned it, defined and built the best data model to work with, and ran ETL to model the data.

Project Details:

The purpose of the Udacity Data Engineering capstone project, is to combine all tools, technologies, and what I learned throughout the program.

So, I used Apache Spark, AWS services, Python, Datawarehouse Modeling, and big data concepts to work on it. First I gathered some datasets to work with, then explored this data, assessed and cleaned it. After that, I defined and built the best data model to work with. Finally, I ran the ETL to model the data.

Project Datasets:

In this project, I worked with four datasets. The main dataset includes Immigration to the United States, and supplementary datasets includes data on Airport codes, U.S. city Demographics, and Temperature data.

  • I94 Immigration Data:

    This data comes from the US National Tourism and Trade Office. You can find it here.

  • World Temperature Data:

    This dataset came from Kaggle. You can read more about it here.

  • U.S. City Demographic Data:

    This data comes from OpenSoft. You can read more about it here.

  • Airport Code Table::

    This is a simple table of airport codes and corresponding cities. It comes from here.

Tools and Technologies:

  • Apache Airflow.
  • Apache Spark.
  • AWS Services.
  • Python 3.
  • ETL: Extract, Transform, Load Data.
  • Data Warehouse Concepts.
  • Data Modeling.
  • Cloud Computing Concepts.
  • Big Data and NoSQL concepts.

Project Steps:

Step 1: Scope the Project and Gather Data.

  • Identifiy and gather the data needed for the project. Explain the end use cases to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)

Step 2: Explore and Assess the Data.

  • Explor the data to identify data quality issues, like missing values, duplicate data, etc.
  • Data Cleaning.

Step 3: Define the Data Model.

  • Mapping out the conceptual data model and explain why this model is used.
  • planing to pipeline the data into the data model.

Step 4: Run ETL to Model the Data.

  • Create the data pipelines and the data model.
  • Include a data dictionary.
  • Run data quality checks to ensure the pipeline ran as expected.
  • Integrity constraints on the relational database (e.g., unique key, data type, etc.).
  • Unit tests for the scripts to ensure they are doing the right thing.
  • Source/count checks to ensure completeness.