LTA Mobility Sensing Project for Land Transport Master Plan (LTMP) 2040

Table of Contents 📑

Project Objective
Project Outline
Code and Resources Used

Project Objective

Location information about commuter activities is vital for planning for travel disruptions and infrastructural development. The Mobility Sensing Project aims to find innovative and novel ways to identify travel patterns from GPS data and other multi-sensory data collected in smartphones. This will be transformative to provide personalised travel information.

Project Outline

Part 1

Retrieve data from AWS S3 bucket
Download to local machine as csv file
Assisted in data collection through mobility sensing application (trial) created by LTA's software engineer

Part 2

Load and merge 3 files into Python
Perform initial cleaning phase such as removing duplicates and converting epoch to date and time
Selected raw features: Accelerometer, Gyroscope, Magnetometer
- Barometer (Not all users have baraometer sensor in their phones)
- GPS Location (Use too much battery power and sampled at vastly different frequency)
- Light sensor (Not useful based on previous literature, affected when mobile in bag or pants etc.)
- Pedometer (Only for walking, unable to differentiate once on different vechicles)

Part 3

Statistical Feature-Based Approach

Data Exploration

Countplot
Boxplots
Density Plots
Pair plot
Aggregation Plots
- Mean Plots
- Median Plots
- Time Series Plots

Data Denoising

Kalman Filter
- Kalman filtering is an algorithm that provides estimates of some unknown variables given the measurements observed over time
- Kalman filter algorithm consists of two stages: prediction and update

Data Preparation/Scaling

Standardization (Z-score Normalization)
- Standardize the training set using the training set means and standard deviations to prevent data leakage
  - Subtract mean, divide standard deviation
  - Necessary to normalize data before performing PCA
  - General principle: any thing you learn, must be learned from the model's training data
Apply Principal Component Analysis (PCA) on the scaled dataset
- Instead of choosing the number of components manually, we will be using the option that allows us to set the variance of the input that is supposed to be explained by the generated components.
  - Typically, we want the explained variance to be between 95–99%. We will use 95% here.
  - As usual to prevent data leakage, we fit PCA on the training data set, and then we transform the test data set using the already fitted pca.
  - Cumulative Summation of the Explained Variance
  - Most important feature of each principal component
  - Number of times each sensor type appeared as the most important feature of a principal component
    
    Sensor Type Count
    
    Accelerometer 15
    
    Gyroscope 21
    
    Magnetometer 12
  - Number of times each statistical type appeared as the most important feature of a principal component
    
    Statistical Type Count
    
    Mean 5
    
    Variance 8
    
    Skewness 11
    
    Kurtosis 13
    
    Others 11

Feature Extraction

Main idea: Transformation of patterns into features that are considered as a compressed representation
For each time series variable, key statistical features will be extracted to measure different properties of that variable
Main groups of the calculated statistical measures (Non-exhaustive):
- Measures of Central Tendency (Mean, Median)
- Measures of Variability (Variance, Standard Deviation, IQR, Range)
- Measures of Shape (Skewness, Kurtosis)
- Measures of Position (Percentiles)
- Measures of Impurity (Entropy)
  - Not ideal to be using the entropy function from sklearn as it assumes a discrete distribution of the data. Instead, we will be using a custom ready-made non-parametric k-nearest neighbour entropy estimator.
Determine a moving window size to calculate the above features (An initial size of window 10 is chosen based on literature)
Before extracting the features, a magnitude vector is calculated from each sensor to obtain 3 extra data sources
- The formula for the magnitude of a vector can be generalized to arbitrary dimensions. For example, if a = (a1,a2,a3,a4) is a four-dimensional vector, the formula for its magnitude is ∥a∥ = √a21+a22+a23+a24.

Classifier Training/Tuning/Evaluation (Before and after upsampling/downsampling techniques)

Train-Test Split
- Perform a 70/30 train test split with stratification to ensure balanced Y distribution in both train and test sets
- Set seed for reproducible results
- To minimize data leakage when developing predictive models:
  - Perform data preparation within training set
  - Hold back a test set for final sanity check of the developed models

Trial Modeling Stage

Before proceeding with PCA, build a few common classifiers using sklearn first. This allows us to compare the results before and after using PCA.
Perform cross validation on the training set to use less training time
Results might not be as good with less data but that doesn't matter for now as it is more about gaining insights into as many models as possible first.
Keep the default settings for most models
Models tested: LogisticRegression, SVC, LinearSVC, KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, GaussianNB, MLPClassifier
Random Forest performed the best (94.866%) in terms of mean F1 Score with a 5 fold cross validation, followed by Support Vector Machine (SVM) with rbf kernel and MLPClassifier.

Insights:

Due to the class imbalance, accuracy is not a good metric to be used here. F1 Score is a more suitable performance metric.
Upsampling techniques may be required too later to solve the class imbalance issue.

Summary of trial modeling results

Before PCA (216 features)

Model	F1 Score	Standard Deviation (%)
LogisticRegression	0.91717	0.48
SVC	0.94076	1.64
LinearSVC	0.91584	0.89
KNeighbors	0.92328	1.62
DecisionTree	0.92687	2.91
RandomForest	0.94866	3.25
GaussianNB	0.33205	14.72
MLPClassifier	0.93195	1.62

After PCA (48 features)

Model	F1 Score	Standard Deviation (%)
LogisticRegression	0.92014	0.72
SVC	0.93772	1.43
LinearSVC	0.91704	0.25
KNeighbors	0.92413	1.58
DecisionTree	0.93004	3.22
RandomForest	0.95004	3.20
GaussianNB	0.52965	20.92
MLPClassifier	0.93880	2.56

Boosting Algorithms tried

AdaBoost (Adaptive Boosting)
CatBoost
- Hyper-parameter tuning is seldom needed when using CatBoost.
- CatBoost which is implemented by powerful theories like ordered Boosting, Random permutations, makes sure that we are not overfitting our model. It also implements symmetric trees which eliminates parameters like (min_child_leafs). We can further tune with parameters like learning_rate, random_strength, L2_regulariser, but the results usually doesn’t vary much.

Light GBM

Gradient boosting framework that uses tree based learning algorithms
Designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data

Insights:

F1 macro-average will be used in place of accuracy or F1-weighted based on different considerations for precision-recall tradeoff
F1 macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally). This is ideal for imbalanced datasets.
As Light GBM only takes around 10-15 seconds to run and the default parameters already look quite promising, it is worth tuning the hyperparameters to aim for better results. We will attempt random search first, which is more flexible and more efficient than a grid search. We will also utilize the trainset and testset now.
If the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability (95%). Higher number can be used if running time is low enough to avoid unlucky searches.
Best Average F1 Macro reached: 0.7271336837697994 * The lower score compared to normal F1 Score used by earlier models indicate that it is more ideal for tuning as it recognized the precision-recall tradeoff and returned a lower score. Hopefully, the hyperparameters are tuned to balance precision and recall better.
Optimal parameters for the best estimator * {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.5989147519171187, 'importance_type': 'split', 'is_unbalance': True, 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 152, 'min_child_weight': 0.01, 'min_split_gain': 0.0, 'n_estimators': 150, 'n_jobs': -1, 'num_leaves': 49, 'objective': 'binary', 'random_state': 0, 'reg_alpha': 2, 'reg_lambda': 0, 'silent': True, 'subsample': 0.8711234237005314, 'subsample_for_bin': 200000, 'subsample_freq': 0}

Confusion Matrix Definition

Error	Definition
True Positives (TP)	Predicted Not MRT correctly
True Negatives (TN)	Predicted MRT correctly
False Positives (FP)	Predicted Not MRT but actual is MRT
False Negatives (FN)	Predicted MRT but actual is Not MRT

Confusion Matrix Results

Error Values

True Positives (TP) 18986

True Negatives (TN) 752

False Positives (FP) 377

False Negatives (FN) 115
The model optimizes recall instead of precision. In this case, recall can be thought as of a model’s ability to find all the data points of interest (MRT) in a dataset. A precision-recall tradeoff is common in many scenarios and it often boils down to the business problem that the company wants to solve or improve on.

Final LightGBM with random search

Metrics Score

Accuracy 0.9778

F1 Score 0.9883

Precision 0.9812

Recall 0.9956

Experimentation with oversampling and undersampling techniques
- ADASYN: Adaptive Synthetic Sampling Method for Imbalanced Data
  - ADASYN works similarly to the regular SMOTE. However, the number of samples generated for each x_i is proportional to the number of samples which are not from the same class than x_i in a given neighborhood. Therefore, more samples will be generated in the area that the nearest neighbor rule is not respected.
- Borderline-SMOTE: Over-Sampling Method in Imbalanced Data Sets
  - Only the minority examples near the borderline are over-sampled unliked the normal SMOTE and ADASYN.
- Original vs Upsample (SMOTE) Scatterplot
  - Original data points
  - Upsample (SMOTE) data points
- Original vs Downsample Scatterplot
  - Original data points
  - Downsample data points
- LGBM using hyperopt with smote dataset
  - fmin - main function to minimize
  - tpe and anneal - optimization approaches
    - TPE (Tree-structured Parzen Estimator) is a default algorithm for the Hyperopt. It uses Bayesian approach for optimization. At every step it is trying to build probabilistic model of the function and choose the most promising parameters for the next step.
  - hp - include different distributions of variables
  - Trials - used for logging
  - Insights:
    - Results are slightly better than using random search
    - False positives are higher by an insignificant amount
Ensemble learning using heterogeneous algorithms
- Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
- Test out the ensemble using different base algorithms.
- Ensemble 1
  - Final LightGBM with SMOTE borderline and bayes optimisation (hyperopt)
  - Random Forest
- Ensemble vote methods
  - In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels.
  - In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.
  - Comparison of F1 Macro results
    
    Model Score
    
    lgbm_hyperopt_final 0.9816
    
    Random Forest 0.9788
    
    Voting_Classifier_Hard 0.9812
    
    Voting_Classifier_Soft 0.9824
  - The ensemble which used a soft voting system performed the best.
  - False positive is quite high although overall result is good.
- Ensemble 2 (Attempt to combine low correlation models to get higher precision score)
  - Final LightGBM with SMOTE borderline and bayes optimisation (hyperopt)
  - Random Forest
  - K-Nearest Neighbour
    - Only hard voting since k_nearest neighbour does not output any probabilities.
    - Insight:
      - Adding KNeighborsClassifier to the ensemble reduced false positives only by a little bit while false negatives close to doubled.
Creating a Stacking ensemble
- The main idea behind the structure of a stacked generalization is to use one or more first level models, make predictions using these models and then use these predictions as features to fit one or more second level models on top. To avoid overfitting, cross-validation is usually used to predict the OOF (out-of-fold) part of the training set.
- Train a normal xgboost first.
- Define first level models. We will use light gbm, random forest and K-Nearest Neighbour.
- Use first level models to make predictions.
- Fit the second level model on new S_train and same Y_train_smote.
- Results are around the same as just the lgbm model alone. Probably due to the other models not being highly complementary.

Code and Resources Used

Database: AWS S3
Packages: json, io, boto3, pandas, numpy, matplotlib, seaborn, datetime, sklearn, math, scipy, catboost, lightgbm, imblearn, hyperopt, xgboost, vecstack
Special mention to Guanlan for Kalman Filter script

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Images		Images
LICENSE		LICENSE
LTA Mobility Sensing Project Part 2.ipynb		LTA Mobility Sensing Project Part 2.ipynb
LTA Mobility Sensing Project Part 3.ipynb		LTA Mobility Sensing Project Part 3.ipynb
Part 1 (Output Notebook).ipynb		Part 1 (Output Notebook).ipynb
Part 1 (S3 Script).py		Part 1 (S3 Script).py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LTA Mobility Sensing Project for Land Transport Master Plan (LTMP) 2040

Table of Contents 📑

Project Objective

Project Outline

Part 1

Part 2

Part 3

Statistical Feature-Based Approach

Data Exploration

Data Denoising

Data Preparation/Scaling

Feature Extraction

Classifier Training/Tuning/Evaluation (Before and after upsampling/downsampling techniques)

Code and Resources Used

About

Releases

Packages

Languages

Error	Values
True Positives (TP)	18986
True Negatives (TN)	752
False Positives (FP)	377
False Negatives (FN)	115

Model	Score
lgbm_hyperopt_final	0.9816
Random Forest	0.9788
Voting_Classifier_Hard	0.9812
Voting_Classifier_Soft	0.9824

Sensor Type	Count
Accelerometer	15
Gyroscope	21
Magnetometer	12

Statistical Type	Count
Mean	5
Variance	8
Skewness	11
Kurtosis	13
Others	11

Metrics	Score
Accuracy	0.9778
F1 Score	0.9883
Precision	0.9812
Recall	0.9956

License

denistanjingyu/LTA-Mobility-Sensing-Project

Folders and files

Latest commit

History

Repository files navigation

LTA Mobility Sensing Project for Land Transport Master Plan (LTMP) 2040

Table of Contents 📑

Project Objective

Project Outline

Part 1

Part 2

Part 3

Statistical Feature-Based Approach

Data Exploration

Data Denoising

Data Preparation/Scaling

Feature Extraction

Classifier Training/Tuning/Evaluation (Before and after upsampling/downsampling techniques)

Code and Resources Used

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages