The purpose of this analysis is to use Machine Learning to determine risks of applicates from a data set from LendingClub. This project is classified as “Supervied Learning” because the data includes labeled outcomes. To complete the analysis, adjustments to balance the unbalanced classifications from the given data set were made for more accurate predictions for higher accuracy scores.
- RandomOverSampler
- SMOTE
- ClusterCentroids
- SMOTEENN
- BalancedRandomForestClassifier
- EasyEnsembleClassifier
Originally, the dataset had over 100,000 loan applicants in Q1 2019. After using the loan status to determine if the application was considered "high" or "low" risk, the applicants who were classified as "current" or "loan status" were classified as "low risk", meaning the rest of the data was "high risk". By cleaning the data, this reduced the dataset to 68470 with nearly all applicants classified to "low risk"(99%).
RandomeOverSampler Model found a balanced accuracy score of 64%
- The high risk precision rate was 1% with a recall of 66%, which gave this result of an F1 score of 2%.
- The low risk had a precision of 100% and the recall was at 62%.
The SMOTE algorithm had a balanced accuracy score of 65.8% which is somewhat better than the previous model.
- The high risk precision, again, was only 1% but the recall dropped slightly to 62%
- The low risk still had a precision of 100% but improved the recall score to 69%.
This algorithms balanced score was lower than the oversamplings scores at 54.4%
- The high risk precision rate was at 1% and the recall at 69%. The F1 score was 1%.
- The low risk model had a precision rate of 100% with a low recall rate of 40% compared to the oversampling models.
(Synthetic Minority Oversampling Technique + Edited Nearest Neighbors) or SMOTEENN had a balanced accuracy score with was 64.8%
- The high risk precision rate was 1% and the precision rate was a 72%, which brought the F1 score to 2%.
- The low risk was still 100%, but with a recall at 57%.
This algorithm brought the balanced accuracy score to 78.8%.
- The high risk precision rate increased to 3% with the recall at 70% to give the F1 score of 6%.
- The low risk still had a precision score of 100% but a high recall of 87%.
This algorithm had the best algorithm score of 93.1%.
- The high risk precision rate increased to 9% and the recall increased to 92% wiht the highest F1 score of 16%.
- The low risk precision rate was still 100% but the recall jumped to 94%.
After reviewing the results, it was clear that the EasyEnsembleClassifier Model had the best results with an accuracy score of 93.1 and a precision rate of 9% when predicitng high risk applicants. The recall rate was also the highest at 92% for high risk applicants as well as low risk applicants, 94%. This model is clearly the best model to choose because it has the best algorithm to assess credit risks when lending to applicants.