Builded an ETL pipeline using Python, Pandas, Python dictionary methods and regular expressions to extract and transform the data. Created four CSV files and use the CSV file data to create an ERD and a table schema. Finally, uploaded the CSV file data into a Postgres database. It involves extracting data from multiple sources, cleaning and transforming the data using Jupyter Notebook with pandas, numpy, and datetime packages, and loading the cleaned data into a relational database using pgAdmin.
Hey, let's play with "Crowdfunding" data files and win the Data Science by extracting completely, transforming appropriately and loading in the PgAdmin engine as Postgres database.
Current trends in the world of economic funding and investment have revealed exponential growth with regard to crowdfunding. Crowdfunding works by taking small sums of capital from a variety of people in order to fund up and coming business ideas and projects. Crowdfunding campaigns have proven to be very successful by raising funds without the upfront fees.
This ETL process for 'Crowdfunding' data break into two deliverables.
Deliverable 1: Extract & Transform Using Jupyter Notebook.
Deliverable 2: Load the data to Postgres Database.
Before you begin, ensure you have the following installed:
- Python 3.6 or higher
- Numpy
- JSON
- Pandas (for data analysis)
We get the data resources from two files 'crowdfundig.xlsx' and 'contacts.xlsx' using Pandas.
Extract and transform the crowdfunding.xlsx data to create a 'crowdfunding_info_df' DataFrame.
Split each "category & sub-category" column value into "category" and "subcategory".
To create the category and subcategory identification numbers, use a list comprehension to add the "cat" string or the "subcat" string to each number in the category or the subcategory array, respectively.
Create the category DataFrame as 'category_df' and subcatgeory DataFrame as 'subcategory_df'.
Show the 'category_df' DataFrame with top five rows.
Show the 'subcategory_df' DataFrame with top five rows.
Create a copy of the 'crowdfunding_info_df' to transform the crowdfunding.xlsx data.
Rename the specific columns and sets their appropriate data types for 'campaign_df' DataFrame.
Convert the 'launched_date' and 'end_date' columns to UTC datetime format.
Drop the unwanted columns for campaign DataFrame.
Confirm the number of columns after dropping in the DataFrame and then export the campaign DataFrame as campaign.csv.
Extracting and transforming the data from the 'contacts.xlsx' excel data.
Iterate through the 'contact_info_df' DataFrame to get the data values of all rows in a list.
Create a 'new_contact_info_df' DataFrame for contacts data.
Split each "name" column value into a first and last name, and place each in a new column.
Reorder the columns and Display the new DataFrame i.e. 'new_contact_info_df' with first ten rows.
Extracting and transforming the data from the 'contacts.xlsx' excel data into 'regex_contact_info_df' DataFrame.
Extract the four-digit contact ID number. Extract the "contact_id", "name", and "email" columns by using regular expressions.
Split each "name" column value into a first and a last name, and place each in a new column.
Display the created 'new_regex_contact_info' DataFrame with first ten rows.
- PostgreSQL database
- pgAdmin
Inspect the four CSV files, and then sketch an Entity Relationship Diagram of the tables.
To create the sketch, we use a QuickDBD as one of best tool for design ERD.
Here is how the database model prepared.
During the inspection of the data files, we discovered some dependencies between the tables with repect to the columns of data.
We set the 'contact_id' as primary key for 'contacts' table, however in the 'campaign' table this acts as a foreign key as the 'campaing' data is dependent to 'contacts' data. Considering this dependency, we set many to one relationship between them.
Likewise, the 'campaign' table is dependent on the category and subcategory table with 'category_id' and 'subcategory_id' as foreign keys. Considering the unique values the same fields are primary keys in the 'category' and 'subcategory' table.
Create the 'crowdfunding_db' database in pgAdmin using SQL.
Create a table schema for each of the four CSV files.
Specify the data types, primary keys, foreign keys, and other constraints.
Create the tables in the correct order to handle the foreign keys.
we can see how successfully the process for all four files completed.
Initially, we wrote the below 'select' queries in order to fetch the data from their respective tables.
Tables data at a glance:
Display the results of 'contacts' table for the first 10 rows.
Exhibit the results of 'category' table for the first 10 rows.
List the results of 'subcategory' table for the first 10 rows.
Show the results of 'campaign' table for the first 10 rows.