Predicting the CPU relative performance

Predicting the CPU relative performance
Resources

Predicting the CPU relative performance

Dataset

Overview

The CPU performance dataset is taken from the UCI Machine Learning Repository.

Task

The task is to predict the CPU relative performance of computer processors. The relative performance is represented by the ERP feature in the dataset.

The project demonstrates the creation of the model using Azure AutoML and HyperDrive runs. Here is a workflow of the project which aims to find the best performing model:

Access

The zipped dataset in csv format is uploaded to Azure ML workspace. After the data is cleansed, the resulting dataset is registered in the Workspace for further use.

Preparation

Features

Here is a list of dataset features:

Vendor name - nominal, categorical: adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang
Model Name - string, many unique symbols
MYCT - integer, machine cycle time in nanoseconds
MMIN - integer, minimum main memory in kilobytes
MMAX - integer, maximum main memory in kilobytes
CACH - integer, cache memory in kilobytes
CHMIN - integer, minimum channels in units
CHMAX - integer, maximum channels in units
PRP - integer, published relative performance
ERP - integer, estimated relative performance (the value to predict)

Data transformation and registration

I perform the same data transformation for the AutoMl run and for the HyperDrive run to have the most comparable results.

The cleansing script uses pandas to transform features:

categorical features transformation: hot encoding
column drop

The resulting dataset is registered in the ML workspace.

from scripts.cleansing import clean_data
import pandas as pd

def get_cleaned_dataset(ws):
    found = False
    ds_key = "machine-cpu"
    description_text = "CPU performance dataset (UCI)."

    if ds_key in ws.datasets.keys(): 
        found = True
        ds_cleaned = ws.datasets[ds_key] 

    # Otherwise, create it from the file
    if not found:

        with zipfile.ZipFile("./data/machine.zip","r") as zip_ref:
            zip_ref.extractall("data")

        #Reading a json lines file into a DataFrame
        data = pd.read_csv('./data/machine.csv')
        # DataFrame with cleaned data
        cleaned_data = clean_data(data)
        exported_df = 'cleaned-machine-cpu.parquet'
        cleaned_data.to_parquet(exported_df)
        # Register Dataset in Workspace using experimental funcionality to upload and register pandas dataframe at once
        ds_cleaned = TabularDatasetFactory.register_pandas_dataframe(dataframe=cleaned_data,
                                                                     target=(ws.get_default_datastore(), exported_df),
                                                                     name=ds_key, description=description_text,
                                                                     show_progress=True)
    return ds_cleaned

Automated ML

Predicting a CPU relative performance value is a Regression task.

According to Azure documentation:

Metrics like r2_score and spearman_correlation can better represent the quality of model when the scale of the value-to-predict covers many orders of magnitude.

The 'ERP' minimum value is 15, maximum value is 1238. So th scale is not exactly the same.

I'll be using r2_score metric. It is supported by AutoML and by GradientBoostingRegressor which I use for HyperDrive run.

The resulting configuration looks like this:

from azureml.train.automl import AutoMLConfig
auto_ml_directory_name = 'auto_ml_run'

auto_ml_directory = create_folder(project_folder, auto_ml_directory_name)

automl_settings = {
    "experiment_timeout_minutes": 40, #15 minutes is the minimum
    "enable_early_stopping": True,
    "primary_metric": 'r2_score', # the same as hyperdrive
    "featurization": 'auto',
    "verbosity": logging.DEBUG,
    "n_cross_validations": 10
}

automl_config = AutoMLConfig(compute_target=compute_target,
                             max_concurrent_iterations=3, #4 nodes
                             task= "regression",
                             training_data=dataset,
                             label_column_name="ERP",
                             debug_log = "automl_errors.log",
                             path = auto_ml_directory,
                             enable_onnx_compatible_models=True,
                             **automl_settings
                            )

Results

The best model, withr2_score=0.9597, is VotingEnsemble according to the results available in the Azure ML Studio:

The best model explanation and metrics can be also found in Azure Ml Studio:

The parameters of the best run can be extracted using a helper function provided in an Azure tutorial:

from pprint import pprint

# Helper function copied from Azure tutorial 
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#scaling-and-normalization
def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()

And here is the result:

Model step details:

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingregressor
{'estimators': [('33',
                 Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('extratreesregressor',
                 ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0,
                                     criterion='mse', max_depth=None,
                                     max_features=0.5, max_leaf_nodes=None,
                                     max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=0.0023646822772690063,
                                     min_samples_split=0.005285388593079247,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=100, n_jobs=1,
                                     oob_score=False, random_state=None,
                                     verbose=0, warm_start=False))],
         verbose=False)),
                ('27',
                 Pipeline(memory=None,
         steps=[('maxabsscaler', MaxAbsScaler(copy=True)),
                ('extratreesregressor',
                 ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0,
                                     criterion='mse', max_depth=None,
                                     max_features=0.4, max_leaf_nodes=None,
                                     max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=0.003466237459044996,
                                     min_samples_split=0.000630957344480193,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=400, n_jobs=1,
                                     oob_score=False, random_state=None,
                                     verbose=0, warm_start=False))],
         verbose=False)),
                ('45',
                 Pipeline(memory=None,
         steps=[('standardscalerwrapper',
                 <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f0431282ba8>),
                ('extratreesregressor',
                 ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0,
                                     criterion='mse', max_depth=None,
                                     max_features=0.6, max_leaf_nodes=None,
                                     max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=0.004196633747563344,
                                     min_samples_split=0.002602463309528381,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=600, n_jobs=1,
                                     oob_score=False, random_state=None,
                                     verbose=0, warm_start=False))],
         verbose=False)),
                ('14',
                 Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('extratreesregressor',
                 ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0,
                                     criterion='mse', max_depth=None,
                                     max_features=None, max_leaf_nodes=None,
                                     max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=0.001953125,
                                     min_samples_split=0.0010734188827013528,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=10, n_jobs=1, oob_score=False,
                                     random_state=None, verbose=0,
                                     warm_start=False))],
         verbose=False)),
                ('8',
                 Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('extratreesregressor',
                 ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0,
                                     criterion='mse', max_depth=None,
                                     max_features=None, max_leaf_nodes=None,
                                     max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=0.0028629618034842247,
                                     min_samples_split=0.005285388593079247,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=100, n_jobs=1,
                                     oob_score=False, random_state=None,
                                     verbose=0, warm_start=False))],
         verbose=False)),
                ('9',
                 Pipeline(memory=None,
         steps=[('standardscalerwrapper',
                 <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f0431086f60>),
                ('elasticnet',
                 ElasticNet(alpha=0.05357894736842105, copy_X=True,
                            fit_intercept=True, l1_ratio=0.6873684210526316,
                            max_iter=1000, normalize=False, positive=False,
                            precompute=False, random_state=None,
                            selection='cyclic', tol=0.0001,
                            warm_start=False))],
         verbose=False))],
 'weights': [0.4,
             0.06666666666666667,
             0.06666666666666667,
             0.26666666666666666,
             0.06666666666666667,
             0.13333333333333333]}

We can also monitor the results of the run within RunDetails widget in the notebook:

Possible improvements

The column 'Model Name' was dropped from the dataset, since it contains random text. The improvement could be splitting this column into 2: 'model name' and 'model subname' since it has the format : "name"/"subname". The new column "model name" is a nominal feature (which has hight carinality). It could be encoded using hashing, for example with SKLearnFeatureHasher. The "nodel subname" however stays too random to be taken into consideration, so it could be dropped.

Many machine learning algorithms perform better when dataset features are on a relatively similar scale and/or close to normally distributed. Linear regression takes part of these algorithms. We can see that input features do not have the same scale, so numeric features could be normalized by using one of the available normalization SKLearn algorithms, for example MinMaxScaler.

Another improvement for the AutoML tun could be fixing the allowed list of algorithms to use, with the help of allowed_models parameter. For example we can see that ElasticNet, XGBoostRegressor, LassoLars, GradientBoosting perform well on the data, so we could put it in the allowed_models list.

Hyperparameter Tuning

For hyperparameter tuning I've chosen the Gradient Boosting regression algorithm with a Huber loss function. Gradient Boosting performs well for a regression task.

I will tune 2 major hyperparameters of the GradientBoosting, which strongly interact with each other:

n_estimators - the number of weak learners (i.e. regression trees); the number of boosting stages to perform
learning_rate - a value in the range (0.0, 1.0] that controls overfitting via shrinkage (the coefficient of contribution of each weak learner)

The hyperdrive-run notebook sets up the HyperDrive run. I use random hypermarameters sampling configuration, which is less time-consuming than GridParameterSampling and gives good results.

ps = RandomParameterSampling(
    {
        '--learning_rate': uniform(0.01, 0.3),# Contribution of each tree: uniform discribution 
        '--n_estimators': choice(100, 150, 200, 250, 300, 350), # Number of learners
    }
)

I use uniform distribution for learning rate between 0.01 and 0.3, since smaller values give better results, coupled with high values of the number of learners. The number of learners (a.k.a n_estimators) is a choice option among the list of provided values.

For early stopping policy I chose BanditPolicy which compares the performance of the current run (after the specified number of intervals) with the "best current score", and if the measured performance is smaller (by the slack factor), the policy cancels the run.

policy = BanditPolicy(slack_factor=0.01, delay_evaluation = 50)

To avoid avoid premature termination of training runs, I use delay_evaluation of 50.

The primary metric is the same as for AutoML run: r2_score which we need to maximaze, since the closer r2_score is to 1, the better the model performs.

Results

We can see that we've obtained the best result with n_estimators=150 and learning_rate=0,1889 with the r2_score=0,964356. This model slightly outperforms AutoML model, which has r2_score=0,95975.

HyperDrive run models:

Possible improvements

Possible improvements:

normalizing the numeric features before training using MinMaxScaler to the range of 0 to 1.
standardizing numeric features whose distribution is close to normal using StandardScaler
perform another HyperDrive run with GridParameterSampling with the smaller range of the parameters' values to tune the values even further. For example n_estimators between 150 and 200 with the step of 10 (choice) and learning_rate between 0,18 and 0,15 with the step of 0,01 (choice).

We can monitor the results of the run within RunDetails widget in the notebook:

Model Deployment

The best performing model is the model produced by the HyperDrive run. So I will be deploying the sklearn model. For deployment we should use exactly the same environment as for training. I my case I used "AzureML-AutoML" curated environment. The scoring script loads the model from the workspace registry by its name and passes the received payload into predict() function.

The resulting inference configuration:

from azureml.core import Environment
from azureml.core.model import InferenceConfig

env = Environment.get(workspace=ws, name="AzureML-AutoML")

inference_config = InferenceConfig(entry_script='./scripts/score.py',
                                   environment=env)

Deployment configuration with authentication enabled:

from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.model import Model

print("Prepare ACI deployment configuration")
# Enable application insights
config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                            memory_gb = 1,
                                            enable_app_insights=True,
                                            auth_enabled=True)

And the model can be deployed via Model class:

from azureml.core.model import Model

try: model_hd
except NameError:
    model_hd = Model(ws, 'best-model-machine-cpu-hd')

# deploying to ACI using curated environment and the generated  scoring script
service_name_hd = 'machine-cpu-service-hd'
service_hd = Model.deploy(workspace=ws, name = service_name_hd, models=[model_hd],
                       overwrite=True, deployment_config=config, inference_config=inference_config)


service_hd.wait_for_deployment(show_output = True)

Querying the endpoint

Now that the model is deployed and it's status is "Healthy", we can get the endpoint URI either in the Machine Learning Studio or via SDK using scoring_uri property of the service. Deployed model details in ML Studio:

For HTTP request to be authenticated, I add Authorization header with the service's authentication key value:

headers = {'Content-Type': 'application/json', 'Accept': 'application/json'}

if service_hd.auth_enabled:
    headers['Authorization'] = 'Bearer '+ service_hd.get_keys()[0]

The input for the service is in JSON format. The data item was transformed the same way as for training. See cleansing.py for more details.

Sending POST request to the scoring URI:

scoring_uri = service_hd.scoring_uri

input_payload = json.dumps({
    "data":
         [{"MYCT": 29,"MMIN": 8000,"MMAX": 32000,"CACH": 32,"CHMIN": 8,"CHMAX": 32,"PRP": 208,"vendor_adviser": 0,
           "vendor_amdahl": 1,"vendor_apollo": 0,"vendor_basf": 0,"vendor_bti": 0,"vendor_burroughs": 0,
           "vendor_c.r.d": 0,"vendor_cambex": 0,"vendor_cdc": 0,"vendor_dec": 0,"vendor_dg": 0,"vendor_formation": 0,
           "vendor_four-phase": 0,"vendor_gould": 0,"vendor_harris": 0,"vendor_honeywell": 0,"vendor_hp": 0,"vendor_ibm": 0,
           "vendor_ipl": 0,"vendor_magnuson": 0,"vendor_microdata": 0,"vendor_nas": 0,"vendor_ncr": 0,"vendor_nixdorf": 0,
           "vendor_perkin-elmer": 0,"vendor_prime": 0,"vendor_siemens": 0,"vendor_sperry": 0,"vendor_sratus": 0,        
           "vendor_wang": 0}],
    'method': 'predict' 
})

response = requests.post(
        scoring_uri, data=input_payload, headers=headers)
print(response.status_code)
print(response.elapsed)
print(response.json())

Screen Recording

The screencast demonstrates:

A working model
Demo of the deployed model
Demo of a sample request sent to the endpoint and its response

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
automl.ipynb		automl.ipynb
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb
register_ds.py		register_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the CPU relative performance

Dataset

Overview

Task

Access

Preparation

Features

Data transformation and registration

Automated ML

Results

Possible improvements

Hyperparameter Tuning

Results

Possible improvements

Model Deployment

Querying the endpoint

Screen Recording

Resources

About

Releases

Packages

Languages

dariahervieux/azure-nd-capstone

Folders and files

Latest commit

History

Repository files navigation

Predicting the CPU relative performance

Dataset

Overview

Task

Access

Preparation

Features

Data transformation and registration

Automated ML

Results

Possible improvements

Hyperparameter Tuning

Results

Possible improvements

Model Deployment

Querying the endpoint

Screen Recording

Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages