Children Internet Use

Competition Goal and Results

Goal: Develop a predictive model that analyzes children's physical activity and fitness data to identify early signs of problematic internet use.
The results were to my satisfaction as I managed to earn my first medal and was in the top 3% of all submissions (out of 3600 groups)

Dataset

Healthy Brain Network (HBN) dataset - consists of a clinical sample of about 3800 children who have undergone clinical and research screenings. The dataset is compiled into two sources, parquet files containing the accelerometer (actigraphy) time series and CSV files containing the remaining tabular data. There are 80 unique features in the tabular data that are divided into 11 different categories which are detailed here:

Project Structure

data
- processed - location for processed files
  - train_processed.csv
  - test_processed.csv
- raw - location for parquet files containing raw data (train, test, sample submission)
  - data_dictionary.csv
  - sample_submission.csv
  - test.csv
  - train.csv
notebooks - all of the notebooks that were used for the project
- Kaggle_Submission.ipynb - the notebook that was used for submission in Kaggle
- Problematic_Internet_Use_HyperTuning.ipynb - tuned hyperparameters based on the processed train set
- Problematic_Internet_Use_EDA.ipynb - The detailed EDA process
- TimeSeries_EDA.ipynb - EDA + transformation process of the timeseries
- submission.csv - file consisting the prediction on the test set

Solution

EDA

My EDA process which is detailed in this notebook The data is extremely messy and has features that can have over 80% missing features, some features have extreme outliers that exceed the human limit (such as a BMI score of 0) and there are a lot of cases of Multicollinearity, which can cause overfitting. Therefore the first action after starting to work on the project is to clean the data and try to capture valuable insights. For example: the features BMI, Height and Weight have the biggest connection and correlation to the target which can be latter used in feature engineering.

The Model

For the final solution, I used an ensemble of regression models consisting of XGBoost, LightGBM, and CatBoost. After experimenting with each on its own, I concluded that this composition provides the most robust solution. The models' performances were measured by the QWK metric (Quadratic Weighted Kappa) Which measures the agreement between two outcomes. It typically varies from 0 (random agreement) to 1 (complete agreement). The metric is well suited for the task because it takes into account the size of the error. I have tuned the hyperparameters for each model individually using optuna in this notebook.

Prediction and Evaluation

To evaluate the model's performance, I employed a 5-fold cross-validation strategy for Out-of-Fold (OOF) predictions. This approach divides the data into five equal parts, where each part is used as a holdout set for evaluation while the remaining four parts are used for training. It helped to provide a more robust solution. In addition, I optimized the classification thresholds. The default thresholds for class predictions often did not maximize the score, likely because of the QWK metric. By playing and adjusting these thresholds, there was a significant improvement in the model's performance. In the end, while I presented both scores, for the final predictions I used the optimized thresholds as they showed superior results.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
catboost_info		catboost_info
components		components
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Children Internet Use

Competition Goal and Results

Dataset

Project Structure

Solution

EDA

The Model

Prediction and Evaluation

About

Releases

Packages

Languages

Idelsohn/Children-Internet-Use

Folders and files

Latest commit

History

Repository files navigation

Children Internet Use

Competition Goal and Results

Dataset

Project Structure

Solution

EDA

The Model

Prediction and Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages