Goal: Develop a predictive model that analyzes children's physical activity and fitness data to identify early signs of problematic internet use.
The results were to my satisfaction as I managed to earn my first medal and was in the top 3% of all submissions (out of 3600 groups)
Healthy Brain Network (HBN) dataset - consists of a clinical sample of about 3800 children who have undergone clinical and research screenings. The dataset is compiled into two sources, parquet files containing the accelerometer (actigraphy) time series and CSV files containing the remaining tabular data. There are 80 unique features in the tabular data that are divided into 11 different categories which are detailed here:
- data
- processed - location for processed files
- train_processed.csv
- test_processed.csv
- raw - location for parquet files containing raw data (train, test, sample submission)
- data_dictionary.csv
- sample_submission.csv
- test.csv
- train.csv
- processed - location for processed files
- notebooks - all of the notebooks that were used for the project
- Kaggle_Submission.ipynb - the notebook that was used for submission in Kaggle
- Problematic_Internet_Use_HyperTuning.ipynb - tuned hyperparameters based on the processed train set
- Problematic_Internet_Use_EDA.ipynb - The detailed EDA process
- TimeSeries_EDA.ipynb - EDA + transformation process of the timeseries
- submission.csv - file consisting the prediction on the test set
My EDA process which is detailed in this notebook The data is extremely messy and has features that can have over 80% missing features, some features have extreme outliers that exceed the human limit (such as a BMI score of 0) and there are a lot of cases of Multicollinearity, which can cause overfitting. Therefore the first action after starting to work on the project is to clean the data and try to capture valuable insights. For example: the features BMI, Height and Weight have the biggest connection and correlation to the target which can be latter used in feature engineering.
For the final solution, I used an ensemble of regression models consisting of XGBoost, LightGBM, and CatBoost. After experimenting with each on its own, I concluded that this composition provides the most robust solution. The models' performances were measured by the QWK metric (Quadratic Weighted Kappa) Which measures the agreement between two outcomes. It typically varies from 0 (random agreement) to 1 (complete agreement). The metric is well suited for the task because it takes into account the size of the error. I have tuned the hyperparameters for each model individually using optuna in this notebook.
To evaluate the model's performance, I employed a 5-fold cross-validation strategy for Out-of-Fold (OOF) predictions. This approach divides the data into five equal parts, where each part is used as a holdout set for evaluation while the remaining four parts are used for training. It helped to provide a more robust solution. In addition, I optimized the classification thresholds. The default thresholds for class predictions often did not maximize the score, likely because of the QWK metric. By playing and adjusting these thresholds, there was a significant improvement in the model's performance. In the end, while I presented both scores, for the final predictions I used the optimized thresholds as they showed superior results.