By now, we have a good understanding how Azure Machine Learning works. In this challenge, we'll take a data set and use Automated Machine Learning for testing out different regression algorithms automatically. Automated Machine Learning is currently able to perform classification
, regression
and also forecasting
.
Note: As of May 2019, Automated Machine Learning can also be used directly from the Azure Portal. In this challenge, we'll use the Portal, in the next challenge, we'll be using code.
For this challenge, we'll use the Pima Indians Diabetes
dataset: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. Before getting started, have a look at the data set: pima-indians-diabetes.csv
.
Note:
You can find more datasets for trying out AutoML on this website or obviously on Kaggle - by the way, the Wine Quality Dataset
also makes for a nice demo.
A word of caution: Always make sure to only use properly formatted csv
files with Automated Machine Learning. Especially incomplete lines/rows, e.g. missing a few commas, can throw off the service easily.
In your Machine Learning workspace, navigate to the Automated ML
section and select + New Automated ML run
.
The process includes creating or selecting a dataset
, Configuring the run
and Task type and settings
Give our new dataset a name or selecting a existing one. For this challenge we will use a cleansed version the data set with headers here: pima-indians-diabetes.csv
Then we can either re-use our previous storage or create a new one (in this case we can just use our existing one):
Next we view the settings and preview and select Use headers from the first file
And we will also see a preview of our data, where we can exclude features and also specify which column we want to include. In this challenge we leave the schema as it is:
Lastly we confirm the details:
Then we can name our experiment and we can either re-use our Compute VM, but we could also create a new Azure Machine Learning compute
cluster or re-use the cluster from challenge 2. The Create a new compute
window is self-explanatory after the last challenges (set minimum and maximum number of nodes to 1
)!
or we use our existing one:
Lastly we can configure the Task type and settings
tab:
Here we make sure we set the job to Classifcation
and define diabetes
as the target column.
Under View additional configuration settings
, we can further configure our AutoML job and select our optimization metric, concurrency, etc. Let's set Training job time (minutes)
to 10
. This means our training will terminate after a maximum of 10 minutes.
Once we start the training, it'll take ~6 minutes to prepare the experiment. Overall, the default 100 iterations would take quite a while, but since we limited the training time to 10 minutes, it'll terminate earlier. Once the job has finished, we should see something like this:
Below, we can see the metrics per iteration:
If we click one of the iterations, we'll get plenty of metrics for the evaluated pipeline:
Without doubt, it is important to understand what those metrics actually mean, since this will allow us to judge if the generated model(s) are useful or not. This link will help you understanding the metrics of Automated Machine Learning.
Next, we can deploy one of the iterations to ACI.
On the details screen for each iteration, we can download the Model's .pkl
file and also directly deploy it to ACI. Let's deploy one of the models:
In the same screen, we can also download the yaml
for the Conda environment used, but more importantly, the score.py
- this helps us to understand, what data we need to input into our API!
We can see how AutoML is first creating an image, and then starts the deployment to a new Azure Container Instance.
Once the deployment has finished (~7 minutes), we can find the scoring URI in our Workspace under Deployments --> diabetes-api --> Details
:
Finally we can score one or more data samples using the following Python code (just run the code in one of the former notebooks and replace url
):
import requests
import json
url = 'Replace with your URL'
headers = {'Content-Type':'application/json'}
data = {"data": [{
"times_pregnant": 6,
"glucose": 148,
"blood_pressure": 72,
"triceps_thickness": 35,
"insulin": 0,
"bmi": 33.6,
"diabetes_pedigree": 0.627,
"age": 50
},
{
"times_pregnant": 1,
"glucose": 85,
"blood_pressure": 66,
"triceps_thickness": 29,
"insulin": 0,
"bmi": 26.6,
"diabetes_pedigree": 0.351,
"age": 31
}]}
resp = requests.post(url, data=json.dumps(data), headers=headers)
print("Prediction Results:", resp.json())
Pretty easy, right?
More details can be found here. We can obviously relatively easily re-use the code from challenge 3, and just swap out the score.py
and the conda.yml
for programmatically deploying the model.
At this point:
- We took the
Pima Indians Diabetes Dataset
and ran automated Automated Machine Learning for classification on it - We evaluated 25 algorithms and achieved an accuracy of ~77.9% (your accuracy might vary, since it is not necessarily deterministic)
- We took the best performing model and deployed it to ACI (similar to challenge 3)
- If we don't like the model yet, we could start further experimentation by taking the best performing pre-processing & algorithm pipeline and use it as a starting point
So far, we have focused on deploying models to Azure Container Instances, which is great for testing scenarios. For production grade deployments, we want to use Azure Kubernetes Service, which we'll do in the fifth challenge.