Credit Card Fraud Detection
is the chosen graduation project idea in DEPI (Digital Egypt Pioneers Initiative) by the teammates.
Click To Expand
Credit card fraud is a major source of financial trouble not only for consumers, but for banks too.
Credit card fraud detection is a set of tools and protocols which card issuers use to detect suspicious activity that could indicate a fraud attempt. These tools are generally proactive, aiming to stop credit card fraud before it starts. They also help to prevent financial losses caused by credit card fraud
Credit Card Fraud Detection helps to prevent financial losses caused by credit card fraud.
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset link and read more about it here.
Project workflow is as follows:
Data was collected and downloaded from here.
- unzipped the dataset and extracted the csv file from the directory
- Initialized the local directory into a git directory using the command
git init
-
making sure that: I'm in the correct directory using one of the two ways navigating using
cd
command followed by the path or directly from the shortcut menu use the git bash here. -
rename the
master
branch intomain
branch using the command:
git branch -m master main
- created repository on GitHub have the same name as the problem and connected the local repo and the cloud repo together using
git remote add origin 'url'
- created the preprocessing notebook.
Before feeding the data into the model directly, the data preprocessing step is done first. The used preprocessing techniques used:
- Ensuring the data is clean (no null values, outliers, ...etc)
- Checking the class imabalance (solved using sampling)
- PCA was already applied on the data, so we ensured that all the values are normalized using the standard scaler for better training.
- exported the cleansed data
- After importing the dataset in the notebook using
pandas
, we gained some insights about the dataset using.info()
function as follows:
- Only the target column is of type integer and the rest of the columns are of float type, also they are all have the same count (no missing values), you can check for null values using
df.isnull().sum()
to sum the null vaules of each column. - The dataset is too large, we have to check to class imabalnce to prevent bias:
-
After visualizing the calss balance, we found that one of two classes is too little to the other class, so a class imbalance problem is addressed here to be solved.
-
There are multiple techniques to resolve class imbalance: GANs, SMOTE, resampling, ...etc. We used
resampling
usingscikit-learn
library. -
The data was resampled into small example (undersampling), because oversampling would make the data too large.
- Another technique was applied is to augment the minor class using GANs:
- Augmentation using GAN: To avoid information loss, we applied Generative Adversarial Networks (GANs) to synthetically generate new fraud examples, thereby increasing the size of the fraud class to match the non-fraudulent transactions. While this technique provided more balanced data, it led to a large dataset, slowing down the computation due to the sheer number of non-fraud cases (over 200k).
- We just needed to prove a point using the GAN model, that it can be used to augment numerical data which is also processed before using PCA which added another challenge to GAN network, the idea was inspired from this blog: Fruad Detection WIth GANS