Skip to content

Latest commit

 

History

History
79 lines (65 loc) · 3.57 KB

README.md

File metadata and controls

79 lines (65 loc) · 3.57 KB

Crowdfunding ETL Data Analysis


Overview

The project involves building an ETL pipeline to extract and transform crowdfunding data and then load it into a PostgreSQL database. This process extensively uses Python and Pandas for data transformation and includes handling foreign key constraints to ensure data integrity during data loading.


Installation

  • Tools and Languages:

    • Used Python, Pandas library, and SQL.
    • Installed VS Code and pgAdmin4 for development and database management.
  • Dependencies:

    • Imported pandas, NumPy, json, and re (regex module) libraries.
  • Database:

    • Used PostgreSQL, managed via pgAdmin4.

Code examples

  • Example 1 (comprehension method)

      # Use a list comprehension to add "cat" to each category_id. 
          cat_ids = [f'cat{cat_id}' for cat_id in category_ids]
      # Use a list comprehension to add "subcat" to each subcategory_id.
          scat_ids = [f'subcat{subcat_id}' for subcat_id in subcategory_ids]
      # Displaying results 
          print(cat_ids)
          print(scat_ids)
  • Example 2 (comprehension method)

     # Initialize an empty lists to store the list values and keys
         dict_values = []
         column_names = []
     # Iterate through the DataFrame.
         for i, row in contact_info_df.iterrows():
         data = row.iloc[0]
     # Convert each row to a Python dictionary.
         converted_data = json.loads(data)
    
     # Use a list comprehension to get the keys from the converted data.
         columns = [k for k,v in converted_data.items()]
     # Use a list comprehension to get the values for each row.
         row_values = [v for k, v in converted_data.items()]
     # Append the keys and list values to the lists created in step 1.  
         column_names.append(columns)
         dict_values.append(row_values)
     # Print out the list of values for each row and columns names
       print(column_names[0])
       print(dict_values)
  • Example 3 (Regex)

     # Extract the name of the contact and add it to a new column.
         contacts_next = pd.DataFrame(contacts_df)
     # Function to extract name from contact_info using regex
         def extract_name(contact_info): #define the function
     # Set paramenters of the search: any range and any characters and whitespaces after "name" except the double quote.
         match = re.search(r'"name":\s*"([^"]+)"', contact_info) 
         return match.group(1) if match else '' #save results of the search if the match is found
     # Apply the function to the contact_info column and create a new column 'name'
         contacts_next['name'] = contacts_next['contact_info'].apply(extract_name)
     # Display the first few rows of the new DataFrame
         print(contacts_next.head())

Roadmap

  • The ETL_Mini_Project_OSLinden file contains the main ETL code using Pandas libraries, comprehensive data manipulation, and Regex.
  • The crowdfunding_db_schema is a SQL code to upload extracted and transformed data frames.
  • The Resources folder includes original data files and transformed data frames.
  • The Images_SQL_Data_Upload folder features screenshots showing that the schema was executed and data was successfully uploaded into each table.