using Pentaho Data Integration (PDI)/Kettle and Microsoft SQL Server 18 ⚙
.: 📄 Dataset taken from Kaggle :.
- About Project
- Objectives
- Data Set Description
- Connection Configuration
- ETL Process
- Star Schema
- Before & After ETL Comparison
-
This repository contains files to create data warehouse such as:
- ETL files using Pentaho Data Integration (PDI)
- Codes to create OLAP (SQL)
- Codes to select data from OLTP (SQL)
- Codes to perform random testing (SQL)
for credit card applicant. The dataset is provided by Seanny (rikdifos).
-
This project will also create:
- 2 dimension tables (Applicant_Dimension and CreditRecord_Dimension),
- Time dimension (Time_Dimension), and
- 1 fact table (CreditCard_Fact).
using PDI and Microsoft SQL Server 18.
- Perform ETL using PDI for both datasets.
- Create time dimension using PDI.
- Create fact table using PDI.
- The dataset description can be seen here.
username: sa
pass: qwer
- Importing application table from OLTP.
- Sort data based on applicant ID.
- Filter duplicate applicant ID.
- Replace some values to make it easier to understand.
- Add new columns with constant date (October 1, 2021).
- Calculate DOB and date of applicant start working based on current date (October 1, 2021).
- Calculate age of applicant based on current year (2021).
- Filter applicant data which has null values.
- Filter applicant data who is less than 21 y.o.
- Adding Index Applicant (to replace ID as primary key).
- Select columns that will entered OLAP.
- Exporting application table to OLAP (Application Dimension).
- Importing credit record table from OLTP.
- Sort data based on applicant ID.
- Add new columns with constant date (October 1, 2021).
- Calculate loan payment's month based on current date (October 1, 2021).
- Adding CreditRecord_ID (to replace Applicant ID as primary key).
- Select columns that will entered OLAP.
- Exporting application table to OLAP (Credit Record Dimension).
- Generate a column with specific date (January 1, 2016).
- Add row with sequence from 1 to 99999.
- Caluclating start date with sequence data to make next date (ex: January 2, 2016; January 3, 2016)
- Creating new columns (Day, Months, and Year).
- Creating month number and month name.
- Combine 'Month' from Calculator node to 'No_Month' from Data Grid node.
- Creating time ID using JavaScript code.
- Select columns that will entered OLAP.
- Exporting time dimension to OLAP.
- Importing Credit Record dimension from OLAP.
- Importing Application dimension from OLAP.
- Join both dimension tables based on applicant ID.
- Filter applicant ID that doesn't exists in both tables.
- Importing Time dimension from OLAP.
- Join application & credit record dimension with time dimension.
- Replace C, X, 0 with 'Good Debt' (C: loan for that month is already paid; X: no loan for that month; 0: loan is 1 to 29 days overdue).
- Replace 1, 2, 3, 4, 5 with 'Bad Debt' (1: loan is 30 to 59 days overdue; 2: loan is 60 to 89 days overdue; 3: loan is 90 to 119 days overdue; 4: loan is 120 to 149 days overdue; 5: loan is more than 150 days overdue)
- Creating 2 copies from 'Status' column ('Good_Debt' and 'Bad_Debt').
- Good_Debt: Good Debt will be change to 1, while Bad Debt will be change to 0
- Bad_Debt: Good Debt will be change to 0, while Bad Debt will be change to 1
- To create date & time when ETL was performed.
- Select columns that will entered OLAP.
- Exporting fact table to OLAP.
- This section will show the data structure before & after ETL.
👉 If you find this project useful, please ⭐ this repository 😆!
👉 More about myself: here