Leticia Portella July 21st, 2018
The brazilian roads are very dangerous being responsible for several deaths through the years. The responsabilities of the roads are splited in local, regional and national polices. The National Road Police has open data of all accidents that happened on federal roads and their main characteristics.
These are some news about road accidents in Brazil:
- Brasil é o quinto país do mundo em mortes no trânsito, segundo OMS
- Trânsito no Brasil mata 47 mil por ano e deixa 400 mil com alguma sequela
- Acidentes de trânsito custam R$ 19 bi por ano, e Brasil fica longe de meta
The main idea of the project is try to estimate what kind of victims an accident can cause based on road characteristics and climate conditions in the time of the accident. If the type of victims can be predicted, it means we can analyze the most dangerous roads and climate characteristics.
Similar analysis can be checked here.
The dataset was collected on the site of the National Road Police. I will the datasets from 2017 and 2016 with the current variables available:
- id - accident identification
- data_inversa - accident date
- dia_semana - weekday of the accident
- horario - time of the accident
- uf - state of the road
- br - federal road number
- km - kilometer of the road where the accident took place
- municipio - county where the accident took place
- causa_acidente - accident cause
- tipo_acidente - kind of accident
- classificacao_acidente - if the accident had victims (with injured victims, with deaths, without victims)
- fase_dia - time of day the accident happened (day, night, dawn, dusk)
- sentido_via - road direction on the point where the accident happened (ascending, descending, not specified)
- condicao_metereologica - meteorological condition
- tipo_pista - kind of road (single, double, multiple)
- tracado_via - road layout (tunnel, curve, straight, etc...)
- uso_solo - if the soil is being used (yes/no)
- pessoas - number of people involved
- mortos - number of dead people
- feridos_leves - number of people with small injuries
- feridos_graves - number of people with severe injuries
- ilesos - number of unharmed people
- ignorados - number of people involved but with no information of injuries
- feridos - number of all injuried people
- veiculos - number of veicules
- latitude
- longitude
The target variable (classificacao_acidente
) is a categorical variable. Thus,
it will be converted to the following numbers:
- Class 0 - Car accidents with no victims
- Class 1 - Car accidents with injured victims
- Class 2 - Car accidents with death victims
This way, the model must be a supervised machine learning algorithm that will try to define which class the accident will likely fall given the roads and climate conditions.
Chong, Abraham & Paprzycki, 2005
did a simillar study, where they used neural network for trying to classify
victims degree of injuries on traffic accidents. On this study, they had 5 classes
of injures, including no injury
, possible injury
, non-incapacitating injury
,
incapacitating injury
, and fatal injury
.
The features where based both on road and driver's characteristics. The road
characteristics were similiar to what we found in the datasets of National Road Police.
The authors found a ~60% accuracy on predicting
I will try to use the following models: * Logistic Regresssion - to use it as a Baseline model to compare all other * Gaussian Naive Bayes - chose by its high perfomance and we probably won't need to worry with feature interaction * Random Forest Classifier - powerfull, hard to overfit but not so fast with large datas
The baseline model will be contruct first and be used as comparison with the other two models. I intend to use GridSearchCV to define the best hyperparameters for each model.
Since most of features are categorical, I will use the method get_dummies
of
Pandas to treat these features. If the dataset is too large to be processed by
my computer, I'll use PCA to reduce dimensionality.
I believe that I will not find outliers, since both the target feature and most features are categorical, thus a outlier treatment won't be necessary.
I intend to use confunsion matrix, as explained
on this post,
as well as f1_scores, recall and precision. In this case, I would prioritize
the recall values, since a false negative
is worse than a false positive
. This
is because if an accident predicted injuries
instead of no injuries
, it would
mean that the road should be considered dangerous (which is not a bad final result).
However, when a result has a false negative
, it indicated that a potentially
dangerous road was not classified as a dangerous one.
The project will be devided as described, based on cookiecutter-data-science:
- A
data
folder containing the csvs files - A
notebooks
folder containing the notebooks on insights and model training - A
model
folder containing the final version of the model - A
requirements.txt
with software requirements for this project
By the end of the project, I will like to have a model that has a good accuracy on predicting the type of victims, indicating the most dangerous roads and conditions.