Nanodegree Machine Learning Engineer

Final project proposal

Leticia Portella July 21st, 2018

Proposal

Subject History and problem description

The brazilian roads are very dangerous being responsible for several deaths through the years. The responsabilities of the roads are splited in local, regional and national polices. The National Road Police has open data of all accidents that happened on federal roads and their main characteristics.

These are some news about road accidents in Brazil:

Brasil é o quinto país do mundo em mortes no trânsito, segundo OMS
Trânsito no Brasil mata 47 mil por ano e deixa 400 mil com alguma sequela
Acidentes de trânsito custam R$ 19 bi por ano, e Brasil fica longe de meta

The main idea of the project is try to estimate what kind of victims an accident can cause based on road characteristics and climate conditions in the time of the accident. If the type of victims can be predicted, it means we can analyze the most dangerous roads and climate characteristics.

Similar analysis can be checked here.

Dataset

The dataset was collected on the site of the National Road Police. I will the datasets from 2017 and 2016 with the current variables available:

id - accident identification
data_inversa - accident date
dia_semana - weekday of the accident
horario - time of the accident
uf - state of the road
br - federal road number
km - kilometer of the road where the accident took place
municipio - county where the accident took place
causa_acidente - accident cause
tipo_acidente - kind of accident
classificacao_acidente - if the accident had victims (with injured victims, with deaths, without victims)
fase_dia - time of day the accident happened (day, night, dawn, dusk)
sentido_via - road direction on the point where the accident happened (ascending, descending, not specified)
condicao_metereologica - meteorological condition
tipo_pista - kind of road (single, double, multiple)
tracado_via - road layout (tunnel, curve, straight, etc...)
uso_solo - if the soil is being used (yes/no)
pessoas - number of people involved
mortos - number of dead people
feridos_leves - number of people with small injuries
feridos_graves - number of people with severe injuries
ilesos - number of unharmed people
ignorados - number of people involved but with no information of injuries
feridos - number of all injuried people
veiculos - number of veicules
latitude
longitude

Solution description

The target variable (classificacao_acidente) is a categorical variable. Thus, it will be converted to the following numbers:

Class 0 - Car accidents with no victims
Class 1 - Car accidents with injured victims
Class 2 - Car accidents with death victims

This way, the model must be a supervised machine learning algorithm that will try to define which class the accident will likely fall given the roads and climate conditions.

Benchmark

Chong, Abraham & Paprzycki, 2005 did a simillar study, where they used neural network for trying to classify victims degree of injuries on traffic accidents. On this study, they had 5 classes of injures, including no injury, possible injury, non-incapacitating injury, incapacitating injury, and fatal injury. The features where based both on road and driver's characteristics. The road characteristics were similiar to what we found in the datasets of National Road Police. The authors found a ~60% accuracy on predicting

I will try to use the following models: * Logistic Regresssion - to use it as a Baseline model to compare all other * Gaussian Naive Bayes - chose by its high perfomance and we probably won't need to worry with feature interaction * Random Forest Classifier - powerfull, hard to overfit but not so fast with large datas

The baseline model will be contruct first and be used as comparison with the other two models. I intend to use GridSearchCV to define the best hyperparameters for each model.

Since most of features are categorical, I will use the method get_dummies of Pandas to treat these features. If the dataset is too large to be processed by my computer, I'll use PCA to reduce dimensionality.

I believe that I will not find outliers, since both the target feature and most features are categorical, thus a outlier treatment won't be necessary.

Evaluation metrics

I intend to use confunsion matrix, as explained on this post, as well as f1_scores, recall and precision. In this case, I would prioritize the recall values, since a false negative is worse than a false positive. This is because if an accident predicted injuries instead of no injuries, it would mean that the road should be considered dangerous (which is not a bad final result). However, when a result has a false negative, it indicated that a potentially dangerous road was not classified as a dangerous one.

Project design

The project will be devided as described, based on cookiecutter-data-science:

A data folder containing the csvs files
A notebooks folder containing the notebooks on insights and model training
A model folder containing the final version of the model
A requirements.txt with software requirements for this project

By the end of the project, I will like to have a model that has a good accuracy on predicting the type of victims, indicating the most dangerous roads and conditions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROJECT.md

PROJECT.md

Nanodegree Machine Learning Engineer

Final project proposal

Proposal

Subject History and problem description

Dataset

Solution description

Benchmark

Evaluation metrics

Project design

Files

PROJECT.md

Latest commit

History

PROJECT.md

File metadata and controls

Nanodegree Machine Learning Engineer

Final project proposal

Proposal

Subject History and problem description

Dataset

Solution description

Benchmark

Evaluation metrics

Project design