Skip to content

Analysis of different machine learning models' performance on predicting credit default

Notifications You must be signed in to change notification settings

LJD0/Credit_Risk_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit_Risk_Analysis

Tools Used
  • VSCode 1.78
  • Python
    • pandas
    • scikitlearn
    • numpy
    • imblearn

Overview

To predict the risk of loan defaults using machine learning techniques. The data provided includes results from five different sampling techniques, The performance of the models is measured using balanced accuracy score, precision, and recall.

The purpose of this analysis is to assess different machine learning models for credit risk prediction. By evaluating the performance of various models, we can determine their effectiveness in identifying high-risk loans and low-risk loans. This analysis will provide insights into the strengths and weaknesses of the four different sampling models; Random Oversampling, Cluster Centroid Undersampling, SMOTE Oversampling, SMOTEENN Combination Sampling, and two classifier models; Balanced Random Forest Classifier, and Easy Ensemble Classifier.

We will look at metrics including, the balanced accuracy score, precision, and recall. These metrics will allow us to make an informed decision on which model will best perform a credit risk analysis.

Results

Random Over Sampler SMOTE Oversampling Cluster Centroids SMOTEENN Sampling Balanced Random Forest Easy Ensemble Classifier
Accuracy Score 0.6640 0.6556 0.5455 0.6424 0.7885 0.9317
Confusion Matrix
(True/False)
[[72, 29],
[6582, 10522]]
[[64, 37],
[5514, 11590]]
[[67, 34],
[9791, 7313]]
[[71, 30],
[7154, 9950]]
[[71, 30],
[2153, 14951]]
[[93, 8],
[983, 16121]]
Precision [0.99]- average
[0.01]- high risk
[1.00]- low risk
[0.99]- average
[0.01]- high risk
[1.00]- low risk
[0.99]-average
[0.01]- high risk
[1.00]- low risk
[0.99]- average
[0.01]- high risk
[1.00]- low risk
[0.99]- average
[0.03]- high risk
[1.00]- low risk
[0.99]- average
[0.09]- high risk
[1.00]- low risk
Recall [0.62]- average
[0.71]- high risk
[0.62]- low risk
[0.68]- average
[0.63]- high risk
[0.68]- low risk
[0.42]- average
[0.66]- high risk
[0.43]- low risk
[0.58]- average
[0.70]- high risk
[0.58]- low risk
[0.87]- average
[0.70]- high risk
[0.87]- low risk
[0.94]- average
[0.92]- high risk
[0.94]- low risk
F1 Score [0.76]- average
[0.02]- high risk
[0.76]- low risk
[0.80]- average
[0.02]- high risk
[0.81]- low risk
[0.59]- average
[0.01]- high risk
[0.60]- low risk
[0.73]- average
[0.02]- high risk
[0.73]- low risk
[0.93]- average
[0.06]- high risk
[0.93]- low risk
[0.97]- average
[0.16]- high risk
[0.97]- low risk

Summary

There is a wide variety of performance between each model. The Easy Ensemble Classifier and the Balanced Random Forest Classifier are the best-performing models, with the balanced accuracy scores, .9317 and .7885 respectively. Both have F1 scores above .9 indicating that the models are able to predict both positive and negative outcomes more accurately.

The other models, such as Random Oversampling, Cluster Centroid Undersampling, SMOTE Oversampling, and SMOTEENN Combination Sampling, have lower balanced accuracy, precision, and recall scores.

Overall the Easy Ensemble Classifier is best able to predict credit risk. Its ability to combine multple weak learners together allowed it to consistently outperform the other models in terms of accuracy, precision, recall, and F1 score. Its robust performance, balanced predictions, and established effectiveness make it the best recommendation for our credit risk assessment tasks.