Group Details:
- Group 3 Members: Mitchell Lor, Frewoini Mebrahtu, Eric Johnson, Lucinda Hodgson
Project Title: Predicting Corporate Credit Ratings
Data Source: Kaggle Dataset Link
Overview:
Introduction: The project aims to leverage data analysis techniques to extract meaningful insights and predict credit ratings for corporations to assist in investment decisions. By utilizing a dataset sourced from Kaggle, the group intends to preprocess the data meticulously before applying various machine learning models for predictive analysis. The models will undergo optimization and evaluation to ensure accuracy and reliability in predicting credit ratings.
Project Details:
-
Data Acquisition and Preprocessing:
-
Initial Attempts:
- Initially, deep learning techniques were explored; however, encountered roadblocks due to overfitting and imbalanced data.
- Overfitting was observed due to the simplicity of the data, leading to poor generalization.
- Imbalanced data, where investment grade loans dominated, posed challenges for deep learning.
-
Model Evaluation:
- Three models were developed and evaluated:
- Model 1: Loss - 0.636, Accuracy - 0.667
- Model 2: Loss - 0.490, Accuracy - 0.791
- Model 3: Loss - 0.439, Accuracy - 0.797
- Three models were developed and evaluated:
-
Random Forest Model for Credit Rating Forecasting:
-
A Random Forest Classifier model was employed to forecast credit ratings based on a curated dataset.
-
Data preprocessing involved loading and cleaning data, extracting essential features, and incorporating dummy variables for categorical data representation.
-
The dataset was split into training and testing sets, and standard scaling was applied for consistent feature scaling.
-
A Random Forest Classifier with 500 decision trees was trained on the scaled data to capture complex relationships.
-
-
Model Evaluation and Feature Importance Analysis:
- The model's performance was evaluated using standard metrics such as confusion matrix, accuracy score, and classification report.
- Additionally, a feature importance analysis was conducted to identify the significant contributors to credit rating prediction.
-
Search API to Test Model:
The application of machine learning has yielded encouraging outcomes. Through experimentation with various models, some patterns have emerged: certain models excel in predicting positive outcomes, while others are proficient in identifying negative outcomes. The random forest models are the top performers with 95% accuracy rate on this test dataset.
However, during deployment in real-world scenarios, particularly in predicting junk credit status (S&P BB+ or lower), challenges arose. Despite techniques like oversampling and undersampling to address class imbalances, the models struggled to accurately identify instances of junk credit. They did however exhibit consistent success in predicting good credit status.
To enhance model performance, alternative methods were explored such as k-folding and feature engineering. One notable limitation was the absence of industry sector information in our API. This was available in training and testing datasets, and when utilized the model performance improved. But these features were dropped due to constraints in the API's data retrieval capabilities. It is evident that incorporating industry sector data could significantly enhance prediction accuracy.