Loan lending has been an important part of the daily lives for organizations and individuals alike, and this activity has become more or less inevitable with the growing financial constraints. Though loan lending is quite beneficial for both the lenders and the receivers, and is considered an essential part of financial organizations, it does carry great risks. Credit Risk is the inability of the receiver to pay back the loan at the designated time which was decided by the lender and the borrower during the loan agreement. This causes major concerns among the financial institutes as it can result in “credit defaulting”, which can prove to be drastic to the lending party, e.g. may result in bankruptcy. A thorough evaluation and verification of the ability of a borrower to repay his/her loan in the decided time period can result in minimized credit risk, and so will prove beneficial for financial institutes around the world.
I have made use of two datasets to benchmark my results. The first dataset contains information about Dream Housing Finance company which gives out housing loans, and contains 614 instances and 13 attributes. The second dataset contains information on credit card clients in Taiwan from April 2005 to September 2005. It has 30,000 instances and 25 attributes. Both datasets have multivariate characteristics, and the attributes have integer, categorical and real data types.
I have applied multiple models on the datasets to make an informed decision through the comparative analysis of all the models. My problem was a binary classification problem and so I used classifiers. Before the models were applied, each dataset went through a thorough process of Data Cleaning, Exploratory Data Analysis and Pre-processing. A total of six models were applied on each dataset, namely Logistic Regression, Decision Trees, Random Forest Classifier, Support Vector Machine (SVM), Naïve Bayes, and XGBoost Classifier.
This study was conducted to find an algorithm which would predict loan default accurately, and save financial institutions from loaning to defaulting customers and incurring losses. Although Random Forrest Classifier came in a close second, the new classifier which has made for itself a reputation in winning Kaggle competitions has proved itself once again. The XGBoost Classifier proved itself to be the most optimal classifier in predicting if a loan would default or not on both large and small datasets. Its results surpassed the other classifiers in all aspects, be it accuracy or AUC score. Moreover, it is seen that results achieved on the smaller data were far better than that reached on the larger dataset. For example, the highest AUC score on ‘loan data set’ was 0.89, whereas that on ‘default of credit card clients’ was only 0.78. This can be due to the fact that there were many discrepancies in the ‘default of credit card clients’ dataset. It had errors, values that were not quoted in the data description, and trends which could not be explained logically. We could not remove these values either as they carried important information, and on their removal the model accuracies reduced. On the other hand, the ‘loan data set’ provided better results even though it contained missing values. This proves the fact that a model is only as good as the database.