NOTE: This file is a template that you can use to create the README for your project. The TODO comments below will highlight the information you should be sure to include.
This project is the capstone project related to the Azure ML Engineer path, this is the final project. The topic of the project will be to train a model with AutoML, train a model with HyperDrive and finally deploy the best model between the mentioned above.
The dataset represents a dataset of stroke event, according to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. The dataset is composed of these columns:
- id: unique identifier
- gender: "Male", "Female" or "Other"
- age: age of the patient
- hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- ever_married: "No" or "Yes"
- work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- Residence_type: "Rural" or "Urban"
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- stroke: 1 if the patient had a stroke or 0 if not *Note: "Unknown" in smoking_status means that the information is unavailable for this patient
For data analysis, the following columns need a normalization:
- gender: Male 1; Female 0;
- work_type: Private 1; Self-employed 0; others -1;
- bmi: N/A 0;
- Residence_type: Urban 1; Rural 0;
- smoking_status: smokes 1; never smoked -1; formerly smoked 2; others 0;
The task involve to predict strokes based on the different type of diseases, in fact in the dataset there are several people with different diseases and habits, so we can try to have an insight about the correlation of these with stroke episodes. We are going to find the best formant ML between HyperDriveParameter and AutoML.
The dataset is free and it can be downloade from Kaggle, but I downloaded it and saved it into my Github repo and used it in Azure ML.
I loaded the dataset and register it inside my workspace using the code below:
and the it will be visible in the Dataset section as described below:
We tried to perform different metric but this one we got the best result with the AUC_weighted
metric. For the training as mentioned above, we use:
- primary_metric: AUC_weighted - calculate arithmetic mean of the score for each class, weighted by the number of true instances in each class;
- n_cross_validations: 2 - number of cross validation to execute in this case 2;
- experiment_timeout_minutes: 30 - max experiment timeout.
AutoML configuration is exposed below:
automl_settings = {
"n_cross_validations": 2,
"experiment_timeout_minutes": 30,
"max_concurrent_iterations": 5,
"primary_metric" : 'AUC_weighted',
"verbosity": logging.INFO
}
automl_config = AutoMLConfig(compute_target=compute_target,
task = "classification",
training_data=dataset,
label_column_name="stroke",
path = project_folder,
enable_early_stopping= True,
featurization= 'auto',
debug_log = "automl_errors.log",
**automl_settings
)
We can try to improve increasing cross-correlation and experiment timeout because we tried accuracy as a metric with the worst result.
The estimators related to the best autoML model, the RandomForestClassifier
for example has these parameters:
- min_samples_leaf=0.01
- max_features=None
- min_samples_leaf=0.01
- min_samples_split=0.056842105263157895
- n_estimators=50
- n_jobs=1
The best automl pipeline was VotingEnsemble:
We tune the follow hyper drive parameters:
- Regulation C: we use regularization to avoid overfitting so that we get more accurate predictions, so we choose a value between 0.1 and 1.0
- Max iteration: Maximum number of iterations to converge (50,100,150,200,250)
We set the metric to Accuracy
and metric goal to MAXIMIZE and we got an accuracy of ~95%. The parametrs of the best model are reported below:
We deployed the best model, the hyper drive one, enabling Application insight, but without authentication. WARNING: this is a test but in production enable authentication is the better choice.
We deployed the model first on a Local Service to try verify the correct execution, we provided a script file endopoint.py
where it is possible to test locally:
Here we create a sample data input string, covert it in JSON and put a header to the REST call to consume the endpoint, output 1 if the data in input satisfy the model:
We deployed the model on a Remote Service:
Here we create a sample data input string, covert it in JSON and put a header to the REST call to consume the endpoint, output 1 if the data in input satisfy the model:
Here is the status of the deployed model in ML Studio:
In the future can be increased the number of the iteration to fit best the model to the data, also we need more data indide the dataset because 5000 records are a small dataset.