Skip to content

Latest commit

 

History

History
108 lines (62 loc) · 4.75 KB

README.md

File metadata and controls

108 lines (62 loc) · 4.75 KB

Credit Card Fraud Detetction

  • Credit Card Fraud Detection is the chosen graduation project idea in DEPI (Digital Egypt Pioneers Initiative) by the teammates.

Table Of Contents:

Click To Expand
  1. About The Idea
  2. The Dataset
  3. Project WorkFlow
  4. Data Preprocessing
  5. Model Selection And Training
  6. Inference And Evaluation
  7. Model Deployment
  8. Workflow
  9. Acknowledgements

1- About The Idea:

Credit card fraud is a major source of financial trouble not only for consumers, but for banks too.

Credit card fraud detection is a set of tools and protocols which card issuers use to detect suspicious activity that could indicate a fraud attempt. These tools are generally proactive, aiming to stop credit card fraud before it starts. They also help to prevent financial losses caused by credit card fraud

Why It's important?

Credit Card Fraud Detection helps to prevent financial losses caused by credit card fraud.

2- The Dataset:

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset link and read more about it here.

3- Project WorkFlow

Project workflow is as follows:

Project Workflow

Data was collected and downloaded from here.

Setting up the project worksapce on local pc:

  1. unzipped the dataset and extracted the csv file from the directory
  2. Initialized the local directory into a git directory using the command
    git init
  1. making sure that: I'm in the correct directory using one of the two ways navigating using cd command followed by the path or directly from the shortcut menu use the git bash here.

  2. rename the master branch into main branch using the command:

    git branch -m master main
  1. created repository on GitHub have the same name as the problem and connected the local repo and the cloud repo together using
    git remote add origin 'url' 
  1. created the preprocessing notebook.

4- Preprocessing

Before feeding the data into the model directly, the data preprocessing step is done first. The used preprocessing techniques used:

  1. Ensuring the data is clean (no null values, outliers, ...etc)
  2. Checking the class imabalance (solved using sampling)
  3. PCA was already applied on the data, so we ensured that all the values are normalized using the standard scaler for better training.
  4. exported the cleansed data

1. Cleansing the data:

  • After importing the dataset in the notebook using pandas, we gained some insights about the dataset using .info() function as follows:

dataset insights

  • Only the target column is of type integer and the rest of the columns are of float type, also they are all have the same count (no missing values), you can check for null values using df.isnull().sum() to sum the null vaules of each column.
  • The dataset is too large, we have to check to class imabalnce to prevent bias:

class imabalance

  • After visualizing the calss balance, we found that one of two classes is too little to the other class, so a class imbalance problem is addressed here to be solved.

  • There are multiple techniques to resolve class imbalance: GANs, SMOTE, resampling, ...etc. We used resampling using scikit-learn library.

  • The data was resampled into small example (undersampling), because oversampling would make the data too large.

resampling

  • Another technique was applied is to augment the minor class using GANs:
    • Augmentation using GAN: To avoid information loss, we applied Generative Adversarial Networks (GANs) to synthetically generate new fraud examples, thereby increasing the size of the fraud class to match the non-fraudulent transactions. While this technique provided more balanced data, it led to a large dataset, slowing down the computation due to the sheer number of non-fraud cases (over 200k).
    • We just needed to prove a point using the GAN model, that it can be used to augment numerical data which is also processed before using PCA which added another challenge to GAN network, the idea was inspired from this blog: Fruad Detection WIth GANS