Udacity Machine Learning Engineer - Capstone project

This project is the capstone project related to the Azure ML Engineer path, this is the final project. The topic of the project will be to train a model with AutoML, train a model with HyperDrive and finally deploy the best model between the mentioned above.

The dataset represents a dataset of stroke event, according to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. The dataset is composed of these columns:

  1. id: unique identifier
  2. gender: "Male", "Female" or "Other"
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: "No" or "Yes"
  7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
  8. Residence_type: "Rural" or "Urban"
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
  12. stroke: 1 if the patient had a stroke or 0 if not *Note: "Unknown" in smoking_status means that the information is unavailable for this patient

For data analysis, the following columns need a normalization:

  • gender: Male 1; Female 0;
  • work_type: Private 1; Self-employed 0; others -1;
  • bmi: N/A 0;
  • Residence_type: Urban 1; Rural 0;
  • smoking_status: smokes 1; never smoked -1; formerly smoked 2; others 0;


The task involve to predict strokes based on the different type of diseases, in fact in the dataset there are several people with different diseases and habits, so we can try to have an insight about the correlation of these with stroke episodes. We are going to find the best formant ML between HyperDriveParameter and AutoML.


The dataset is free and it can be downloade from Kaggle, but I downloaded it and saved it into my Github repo and used it in Azure ML.

I loaded the dataset and register it inside my workspace using the code below:


and the it will be visible in the Dataset section as described below:


Automated ML

We tried to perform different metric but this one we got the best result with the AUC_weighted metric. For the training as mentioned above, we use:

  • primary_metric: AUC_weighted - calculate arithmetic mean of the score for each class, weighted by the number of true instances in each class;
  • n_cross_validations: 2 - number of cross validation to execute in this case 2;
  • experiment_timeout_minutes: 30 - max experiment timeout.

AutoML configuration is exposed below:

automl_settings = {
    "n_cross_validations": 2,
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted',
    "verbosity": logging.INFO

automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",


We can try to improve increasing cross-correlation and experiment timeout because we tried accuracy as a metric with the worst result.

Best automl model best_model_automl

The estimators related to the best autoML model, the RandomForestClassifier for example has these parameters:

  • min_samples_leaf=0.01
  • max_features=None
  • min_samples_split=0.056842105263157895
  • n_estimators=50
  • n_jobs=1 best_model1 best_model2

The best automl pipeline was VotingEnsemble: automl1

Best run automl2

Best run curves automl3

Hyperparameter Tuning

We tune the follow hyper drive parameters:

  • Regulation C: we use regularization to avoid overfitting so that we get more accurate predictions, so we choose a value between 0.1 and 1.0
  • Max iteration: Maximum number of iterations to converge (50,100,150,200,250)


We set the metric to Accuracy and metric goal to MAXIMIZE and we got an accuracy of ~95%. The parametrs of the best model are reported below:





Model Deployment

We deployed the best model, the hyper drive one, enabling Application insight, but without authentication. WARNING: this is a test but in production enable authentication is the better choice.


We deployed the model first on a Local Service to try verify the correct execution, we provided a script file where it is possible to test locally: model2

Here we create a sample data input string, covert it in JSON and put a header to the REST call to consume the endpoint, output 1 if the data in input satisfy the model: model3

We deployed the model on a Remote Service: model4

Here we create a sample data input string, covert it in JSON and put a header to the REST call to consume the endpoint, output 1 if the data in input satisfy the model: model5

Delete not used resources: model6

Here is the status of the deployed model in ML Studio: model7

Endpoint logs: model8

Screen Recording


Future improvement

In the future can be increased the number of the iteration to fit best the model to the data, also we need more data indide the dataset because 5000 records are a small dataset.


