The motivation for this repository are the difficulties that the dataset present when we define the Target and Features. One of the problems involve several data leakages.
There are several attempts in kaggle with low metrics particularly when we restrict the training set to features with information before the loan was granted and we want try to improve it:
https://www.kaggle.com/datasets/devanshi23/loan-data-2007-2014/data
We use various data preprocces techniques like SelectKbest with information value, Binning , Up-sampling with Imlearn, One Hot Encoder and Imputers
loan_status (our target) has the followings values:
- Current
- Fully Paid
- Charged Off
- Late (31-120 days)
- In Grace Period
- Does not meet the credit policy. Status:Fully Paid
- Late (16-30 days)
- Default
- Does not meet the credit policy. Status:Charged Off
The main point we must consider is that the values belong to differents moments in the loan life span.
Those that belong to an end of the Loan:
- Fully Paid
- Charged Off
- Does not meet the credit policy. Status:Fully Paid
- Default
- Does not meet the credit policy. Status:Charged Off
Middle term of a loan:
- Current
- Late (31-120 days)
- Late (16-30 days)
while In Grace Period belongs to the beginning.
On top of this we should consider:
All the loans regardless its end, were previously in time "In Period Grace"
All the loans regardless its end, were previously in time Current and/or Late
"Good loans" (1):
- Fully Paid
"Bad loans" (0):
- Charged Off
- Does not meet the credit policy. Status:Fully Paid
- Default
- Does not meet the credit policy. Status:Charged Off
We just consider ends of loans categorys in the target, and we should consider only features in X_train set that belong before the loan was granted.