This repository contains the solution model for the Kaggle competition: TAU Intro2DS - Final Assignment - Spring 2023. As for now the main script is not organized and will be splited into different files.
Best model so far: XGBoost_model(1).pickle
The code has been organized into multiple files for better modularity and readability. Here's a brief description of each file:
data_loading.py
: Contains functions to load the datasets.preprocessing.py
: Includes data preprocessing steps such as handling missing values, feature engineering, and encoding categorical variables.model.py
: Defines the model training and evaluation pipeline.visualize.py
: Provides functions for visualizing model performance, feature importance, and data distributions.submission.py
: Handles predictions and saves the submission file.main.py
: The main script that orchestrates the entire process.
To run the solution model, follow these steps:
- Ensure that you have the necessary dependencies installed. You can find them listed in the
requirements.txt
file. - Place the competition datasets (
personal_info_train.csv
,personal_info_test.csv
,measurements_results_train.csv
,measurements_results_test.csv
) in the same directory as the code files. - Run the
main.py
script.
- Data Loading: The datasets are loaded using the
load_datasets()
function fromdata_loading.py
. - Data Preprocessing: The loaded datasets are preprocessed using the
preprocess_data()
function frompreprocessing.py
. This step includes handling missing values, feature engineering, and encoding categorical variables. - Model Training and Evaluation: The preprocessed data is used to train and evaluate the model using the
train_and_evaluate()
function frommodel.py
. The model pipeline includes a column transformer for one-hot encoding categorical features and an XGBoost classifier. - Model Visualization: The model's performance and interpretability are visualized using the
visualize()
function fromvisualize.py
. The function plots the confusion matrix, ROC curve, feature importance, and other relevant visualizations. - Prediction and Submission: The model is used to predict the test set probabilities, and the predictions are saved to a submission file using the
predict_and_save_results()
function fromsubmission.py
. The submission file is namedmysubmission-XGBoost(1).csv
. - Model Persistence: The trained model is saved in a pickle file named
XGBoost_model(1).pickle
for future use.
Feel free to explore the individual files to understand the implementation details and customize the code as per your requirements.
For any questions or clarifications, please refer to the competition's Kaggle page or reach out via the Moodle platform.
The solution model is based on the XGBoost algorithm, a popular gradient boosting machine learning library. This script uses healthcare data, including personal and health measurements, to train and test the model.
The data is loaded using the load_datasets()
function from data_loading.py
. It comes from two separate datasets: personal_info_train.csv
(and its test counterpart) and measurements_results_train.csv
(and its test counterpart). These datasets are merged based on the 'patient_id' column to create unified datasets for both training and testing. The personal information dataset contains demographic and personal details of the patients, while the measurements dataset holds the results of various tests and measurements.
Extensive data preprocessing is performed using the preprocess_data()
function from preprocessing.py
to prepare the datasets for modeling. The following steps are carried out:
- Data merging: The
personal_info_train
andmeasurements_train
datasets are merged based on the 'patient_id' column to create thetrain
dataset. Similarly, thepersonal_info_test
andmeasurements_test
datasets are merged to create thetest
dataset. - Removing duplicates: Duplicate records in the
train
dataset are identified and removed based on the 'patient_id' column using thedrop_duplicates()
function. - Dropping unnecessary columns: The 'country' and 'region' columns are dropped from both the
train
andtest
datasets using thedrop()
function. - Mapping gender to numeric values: The 'gender' column in both the
train
andtest
datasets is mapped to numeric values using a mapping dictionary. - Fixing height and weight outliers: Outliers in the 'height' and 'weight' columns are addressed by applying specific transformations. Heights below 10 are multiplied by 100, and weights above 200 are divided by 1000.
- Calculating BMI and filling missing values: BMI (Body Mass Index) is calculated for records with valid height and weight values. Missing BMI values are imputed using the median BMI value calculated from the available data.
- Handling missing values and adding flags: Certain columns ('test_2', 'test_6', 'test_8', 'test_10', 'test_12', 'test_15') are checked for missing values. New columns with '_flag' suffixes are added to indicate whether a value is missing or not. Missing values are imputed with the median value of each respective column.
- Calculating average steps and filling missing values: The columns 'steps_day_1', 'steps_day_3', 'steps_day_4', and 'steps_day_5' are used to calculate the average number of steps for each record. Missing values in the 'steps_day_2' column are filled with the calculated average.
- Filling missing categorical columns: Missing values in the categorical columns ('HMO', 'city', 'employment') are filled with the string 'NaN' using the
fillna()
function. - Converting dates and extracting features: Date columns ('created_at' and 'birth_date') are converted to the datetime format using the
pd.to_datetime()
function. From the 'created_at' column, the year and fractional month values are extracted and stored in the 'created_year' column. From the 'birth_date' column, the age in years and fractional years is calculated relative to the current date. The original date columns are dropped. - Separating the target variable: The 'label' column is separated from the
train
dataset and stored as thetarget
variable. The 'patient_id' column is also dropped from thetrain
dataset. - Storing patient IDs for submission: The 'patient_id' column in the
test
dataset is stored separately as thetest_ids
variable. The 'patient_id' column is dropped from thetest
dataset. - Encoding categorical columns: Categorical columns ('employment', 'HMO', 'city') in the
train
dataset are encoded usingLabelEncoder
from thesklearn.preprocessing
module. The encoded values are stored back into the respective columns. The same label encoders are applied to thetest
dataset, with exception handling in case any encoding issues occur. - Printing processing completion message: A message is printed to indicate that the data preprocessing steps have been completed.
- Returning preprocessed data: The preprocessed
train
andtest
datasets, thetarget
variable, thetest_ids
variable, and the list of categorical columns (cat_cols
) are returned as the output of thepreprocess_data()
function.
Please note that these explanations provide a high-level overview of the preprocessing steps. For more detailed information, please refer to the code comments and the specific preprocessing functions used in the code. The preprocessed datasets are saved to 'final_procc.csv' for further use.
The solution model employs a pipeline defined in model.py
that consists of a column transformer for one-hot encoding categorical features and an XGBoost classifier. The training data is split into a training set and a validation set.
The model is trained and evaluated using the train_and_evaluate()
function from model.py
. The function returns the trained model pipeline (model_pipeline
) along with the training and validation ROC AUC scores, validation features, labels, and predictions.
The train_and_evaluate
function in the code performs the following steps:
- Column Transformation and Preprocessing: It creates a column transformer,
preprocessor
, usingmake_column_transformer
fromsklearn.compose
. The column transformer applies one-hot encoding to the categorical columns specified incategorical_columns
and leaves the remaining columns unchanged. - Model Pipeline Creation: It creates a pipeline,
model_pipeline
, usingmake_pipeline
fromsklearn.pipeline
. The pipeline consists of thepreprocessor
and anXGBClassifier
fromxgboost
. The XGBoost classifier is configured with specific hyperparameters. - Data Splitting: It splits the
training_data
into training and validation sets usingtrain_test_split
fromsklearn.model_selection
. The validation set size is set to 20% of the training data. - Model Fitting: It fits the
model_pipeline
to the training data and labels using thefit
method. The fitted model is then ready for making predictions. - Prediction and Evaluation: It predicts the probabilities for the training and validation sets using the
predict_proba
method. These probabilities are used to calculate the ROC AUC (Area Under the Receiver Operating Characteristic Curve) scores for both the training and validation sets usingroc_auc_score
fromsklearn.metrics
. Additionally, it predicts the labels for the validation set using thepredict
method. - Results Return: It returns the
model_pipeline
, train and validation AUC scores, the validation features, labels, and predictions. - Printing Messages: It prints messages to indicate the progress of fitting the model and the successful fitting of the model.
The function encapsulates the training and evaluation process of an XGBoost-based model. It preprocesses the data, creates a pipeline with the XGBoost classifier, fits the model, and returns the necessary information for further analysis and evaluation.
The visualize()
function from visualize.py
is used to generate visualizations of the model's performance. This function plots the confusion matrix, ROC curve, feature importance, and other relevant visualizations based on the trained model pipeline (model_pipeline
), validation features, labels, and predictions.
The visualize
function in the code performs the following visualization steps:
- Confusion Matrix Visualization: It plots the confusion matrix using
ConfusionMatrixDisplay
fromsklearn.metrics
. The confusion matrix is based on the true labels (validation_labels
) and predicted labels (validation_predictions
). The plot provides insights into the performance of the model in terms of true positive, true negative, false positive, and false negative predictions. - Decision Tree Visualization: It plots the first decision tree of the model using
plot_tree
fromxgboost
. The decision tree visualizes the hierarchical structure of the model's decision-making process. - Correlation Matrix Visualization: It creates a correlation matrix using
corr_matrix
calculated fromdata_df.corr()
. The correlation matrix represents the pairwise correlation between different features in the dataset. It is plotted usingsns.heatmap
fromseaborn
to visualize the strength and direction of the correlations. - Feature Importance Visualization: It creates a horizontal bar plot to visualize the feature importance of the model. The importance of each feature is obtained from the
feature_importances_
attribute of thexgbclassifier
in themodel_pipeline
. The features and their importance values are sorted and plotted usingplt.barh
. The plot helps identify the most influential features in the model. - Scaling Features: It performs feature scaling using
StandardScaler
fromsklearn.preprocessing
. The numerical features ofdata_df
are scaled usingscaler.fit_transform
and stored inscaled_features
. Scaling ensures that features are on a similar scale, which can be beneficial for certain machine learning algorithms. - Skewness and Kurtosis Calculation: It calculates and prints the skewness and kurtosis values for each numerical feature in
data_df
. Skewness measures the asymmetry of the data distribution, while kurtosis measures the tails' thickness. These statistics provide insights into the distribution characteristics of the numerical features.
The function combines different visualization techniques to gain insights into the model's performance, the importance of features, correlations among features, and the distribution of numerical features.
The predict_and_save_results()
function from submission.py
is utilized to make predictions on the test data using the trained model pipeline (model_pipeline
). The function saves the predicted probabilities, training and validation ROC AUC scores, and the test data along with the corresponding IDs to a submission file (mysubmission-XGBoost(1).csv
).
The predict_and_save_results
function in the code performs the following steps:
- Making Predictions: It uses the
model_pipeline
to predict probabilities for thetest_data
usingpredict_proba
. The predicted probabilities are stored intest_predictions_proba
. - Creating Submission DataFrame: It creates a submission dataframe (
submission_df
) containing thetest_ids
and the predicted probabilities (test_predictions_proba
). - Saving Submission: It saves the submission dataframe to a CSV file in the "Submission Files" directory. The file name includes a timestamp to differentiate submissions.
- Saving Model: It saves the trained
model_pipeline
to a pickle file in the "Models" directory. The file name also includes a timestamp. - Calculating Metrics: It calculates various evaluation metrics based on the
validation_labels
(true labels) andvalidation_predictions
(predicted labels). The metrics include accuracy, precision, recall, F1-score, log loss, Matthews correlation coefficient (MCC), balanced accuracy, and the elements of the confusion matrix (true positive, false positive, false negative, true negative). - Logging Metrics: It creates a metrics dictionary containing the calculated metrics, along with other model parameters and information such as the timestamp, model name, learning rate, number of estimators, maximum depth, minimum child weight, gamma, subsample, colsample_bytree, reg_lambda, reg_alpha, and scale_pos_weight.
- Saving Metrics: It saves the metrics dictionary to a CSV file named "model_metrics.csv". If the file already exists, it appends the new metrics to the existing file. The file contains information about multiple model runs.
Overall, the function predicts the probabilities for the test data, saves the submission, saves the trained model, calculates evaluation metrics, and logs the metrics for further analysis and comparison.
The trained model is also saved in a pickle file named XGBoost_model(1).pickle
for future use.