Optimize Models in the Cloud using AWS Automatic Model Tuning

When training ML models, hyperparameter tuning is a step taken to find the best performing training model. In this article we will apply a random algorithm of Automated Hyperparameter Tuning to train a BERT-based natural language processing (NLP) classifier. The model analyzes customer feedback and classifies the messages into positive, neutral, and negative sentiments.

Pranath Fernando


February 14, 2023

1 Introduction

In earlier articles we introduced AWS cloud services for data science, and showed how it can help with different stages of the data science & machine learning workflow.

When training ML models, hyperparameter tuning is a step taken to find the best performing training model. In this project we will apply a random algorithm of Automated Hyperparameter Tuning to train a BERT-based natural language processing (NLP) classifier.

We will use the raw Women’s Clothing Reviews dataset - and will prepare it to train a deep learning BERT-based natural language processing (NLP) model. The model will be used to classify customer reviews into positive (1), neutral (0) and negative (-1) sentiment.

Amazon SageMaker supports Automated Hyperparameter Tuning. It runs multiple training jobs on the training dataset using the hyperparameter ranges specified by the user. Then it chooses the combination of hyperparameters that leads to the best model candidate. The choice is made based on the objective metrics, e.g. maximization of the validation accuracy.

For the choice of hyperparameters combinations, SageMaker supports two different types of tuning strategies: random and Bayesian. This capability can be further extended by providing an implementation of a custom tuning strategy as a Docker container.

In this project we will perform the following three steps:

First, let’s install and import the required modules.

import boto3
import sagemaker
import pandas as pd
import botocore

config = botocore.config.Config(user_agent_extra='dlai-pds/c3/w1')

# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker', 

sess = sagemaker.Session(sagemaker_client=sm)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

2 Configure dataset and Hyperparameter Tuning Job (HTP)

2.1 Configure dataset

Let’s set up the paths and copy the data to the S3 bucket:

processed_train_data_s3_uri = 's3://{}/transformed/data/sentiment-train/'.format(bucket)
processed_validation_data_s3_uri = 's3://{}/transformed/data/sentiment-validation/'.format(bucket)
processed_test_data_s3_uri = 's3://{}/transformed/data/sentiment-test/'.format(bucket)

Upload the data to the S3 bucket:

!aws s3 cp --recursive ./data/sentiment-train $processed_train_data_s3_uri
!aws s3 cp --recursive ./data/sentiment-validation $processed_validation_data_s3_uri
!aws s3 cp --recursive ./data/sentiment-test $processed_test_data_s3_uri
upload: data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-058323655887/transformed/data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv
upload: data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-058323655887/transformed/data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv
upload: data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv to s3://sagemaker-us-east-1-058323655887/transformed/data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv

Check the existence of those files in the S3 bucket:

!aws s3 ls --recursive $processed_train_data_s3_uri
2023-02-13 17:36:41    4894416 transformed/data/sentiment-train/part-algo-1-womens_clothing_ecommerce_reviews.tsv
!aws s3 ls --recursive $processed_validation_data_s3_uri
2023-02-13 17:36:42     276522 transformed/data/sentiment-validation/part-algo-1-womens_clothing_ecommerce_reviews.tsv
!aws s3 ls --recursive $processed_test_data_s3_uri
2023-02-13 17:36:43     273414 transformed/data/sentiment-test/part-algo-1-womens_clothing_ecommerce_reviews.tsv

Now we set up a dictionary of the input training and validation data channels, wrapping the corresponding S3 locations in a TrainingInput object.

from sagemaker.inputs import TrainingInput

data_channels = {
    'train': processed_train_data_s3_uri, 
    'validation': processed_validation_data_s3_uri 

There is no need to create a test data channel, as the test data is used later at the evaluation stage and does not need to be wrapped into the sagemaker.inputs.TrainingInput function.

2.2 Configure Hyperparameter Tuning Job

Model hyperparameters need to be set prior to starting the model training as they control the process of learning. Some of the hyperparameters you will set up as static - they will not be explored during the tuning job. For the non-static hyperparameters we will set the range of possible values to be explored.

First, we configure static hyperparameters including the instance type, instance count, maximum sequence length, etc. For the purposes of this project, we will use a relatively small instance type. Please refer to this link for additional instance types that may work for your use cases.

max_seq_length=128 # maximum number of input tokens passed to BERT model
freeze_bert_layer=False # specifies the depth of training within the network


Some of these will be passed into the PyTorch estimator and tuner in the hyperparameters argument. Let’s set up the dictionary for that:

    'freeze_bert_layer': freeze_bert_layer,
    'max_seq_length': max_seq_length,
    'epochs': epochs,
    'train_steps_per_epoch': train_steps_per_epoch,
    'validation_batch_size': validation_batch_size,
    'validation_steps_per_epoch': validation_steps_per_epoch,
    'seed': seed,
    'run_validation': run_validation

Now we configure hyperparameter ranges to explore in the Tuning Job. The values of the ranges typically come from prior experience, research papers, or other models similar to the task you are trying to do.

from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import ContinuousParameter
from sagemaker.tuner import CategoricalParameter
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(0.00001, 0.00005, scaling_type='Linear'), # specifying continuous variable type, the tuning job will explore the range of values
    'train_batch_size': CategoricalParameter([128, 256]), # specifying categorical variable type, the tuning job will explore only listed values

2.3 Set up evaluation metrics

Choose loss and accuracy as the evaluation metrics. The regular expressions Regex will capture the values of metrics that the algorithm will emit.

metric_definitions = [
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9.]+)'},

In the Tuning Job, we will be maximizing validation accuracy as the objective metric.

3 Run Tuning Job

3.1 Set up the RoBERTa and PyTorch script to run on SageMaker

We will now prepare the PyTorch model to run as a SageMaker Training Job. The estimator takes into the entry point a separate Python file, which will be called during the training. We can open and review this file src/train.py.

For more information on the PyTorchEstimator, see the documentation here: https://sagemaker.readthedocs.io/

from sagemaker.pytorch import PyTorch as PyTorchEstimator
# Note: we don't have to rename the PyTorch estimator,
# but this is useful for code clarity, especially when a few modules of 'sagemaker.pytorch' are used

estimator = PyTorchEstimator(

3.2 Launch the Hyperparameter Tuning Job

A hyperparameter tuning job runs a series of training jobs that each test a combination of hyperparameters for a given objective metric (i.e. validation:accuracy). In this project, we will use a Random search strategy to determine the combinations of hyperparameters - within the specific ranges - to use for each training job within the tuning job. For more information on hyperparameter tuning search strategies, please see the following documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

When the tuning job completes, we can select the hyperparameters used by the best-performing training job relative to the objective metric.

The max_jobs parameter is a stop criteria that limits the number of overall training jobs (and therefore hyperparameter combinations) to run within the tuning job.

The max_parallel_jobs parameter limits the number of training jobs (and therefore hyperparameter combinations) to run in parallel within the tuning job. This parameter is often used in combination with the Bayesian search strategy when you want to test a smaller set of training jobs (less than the max_jobs), learn from the smaller set of training jobs, then apply Bayesian methods to determine the next set of hyperparameters used by the next set of training jobs. Bayesian methods can improve hyperparameter-tuning performance in some cases.

The early_stopping_type parameter is used by SageMaker hyper-parameter tuning jobs to automatically stop a training job if the job is not improving the objective metrics (i.e. validation:accuracy) relative to previous training jobs within the tuning job. For more information on early stopping, please see the following documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-early-stopping.html.

Let’s set up the Hyperparameter Tuner.

from sagemaker.tuner import HyperparameterTuner

tuner = HyperparameterTuner(
    max_jobs=2, # maximum number of jobs to run
    max_parallel_jobs=2, # maximum number of jobs to run in parallel
    early_stopping_type='Auto' # early stopping criteria

Now we launch the SageMaker Hyper-Parameter Tuning (HPT) Job.


3.3 Check Tuning Job status

We can see the Tuning Job status in the console.

tuning_job_name = tuner.latest_tuning_job.job_name

CPU times: user 1.37 s, sys: 191 ms, total: 1.56 s
Wall time: 24min 53s

The results of the SageMaker Hyperparameter Tuning Job are available on the analytics of the tuner object. The dataframe function converts the result directly into the dataframe. We can explore the results with the following lines of the code:

import time

time.sleep(10) # slight delay to allow the analytics to be calculated

df_results = tuner.analytics().dataframe()
(2, 8)
df_results.sort_values('FinalObjectiveValue', ascending=0)
learning_rate train_batch_size TrainingJobName TrainingJobStatus FinalObjectiveValue TrainingStartTime TrainingEndTime TrainingElapsedTimeSeconds
0 0.000020 "128" pytorch-training-230213-1736-002-23e15b91 Completed 73.050003 2023-02-13 17:38:06+00:00 2023-02-13 18:01:09+00:00 1383.0
1 0.000017 "128" pytorch-training-230213-1736-001-44bd7477 Completed 72.269997 2023-02-13 17:38:02+00:00 2023-02-13 18:01:24+00:00 1402.0

When training and tuning at scale, it is important to continuously monitor and use the right compute resources. While we have the flexibility of choosing different compute options how do you choose the specific instance types and sizes to use? There is no standard answer for this. It comes down to understanding the workload and running empirical testing to determine the best compute resources to use for the training.

SageMaker Training Jobs emit CloudWatch metrics for resource utilization. We can review them in the AWS console.

4 Evaluate the results

An important part of developing a model is evaluating the model with a test data set - one that the model has never seen during its training process. The final metrics resulting from this evaluation can be used to compare competing machine learning models. The higher the value of these metrics, the better the model is able to generalize.

4.1 Show the best candidate

Let’s now show the best candidate - the one with the highest accuracy result.

learning_rate train_batch_size TrainingJobName TrainingJobStatus FinalObjectiveValue TrainingStartTime TrainingEndTime TrainingElapsedTimeSeconds
0 0.00002 "128" pytorch-training-230213-1736-002-23e15b91 Completed 73.050003 2023-02-13 17:38:06+00:00 2023-02-13 18:01:09+00:00 1383.0

4.2 Evaluate the best candidate

Let’s pull the information about the best candidate from the dataframe and then take the Training Job name from the column TrainingJobName.

best_candidate = df_results.sort_values('FinalObjectiveValue', ascending=0).iloc[0]

best_candidate_training_job_name = best_candidate['TrainingJobName']
print('Best candidate Training Job name: {}'.format(best_candidate_training_job_name))
Best candidate Training Job name: pytorch-training-230213-1736-002-23e15b91

Now lets show the accuracy result for the best candidate.

best_candidate_accuracy = best_candidate['FinalObjectiveValue'] 

print('Best candidate accuracy result: {}'.format(best_candidate_accuracy))
Best candidate accuracy result: 73.05000305175781

We can use the function describe_training_job of the service client to get some more information about the best candidate. The result is in dictionary format. Let’s check that it has the same Training Job name:

best_candidate_description = sm.describe_training_job(TrainingJobName=best_candidate_training_job_name)

best_candidate_training_job_name2 = best_candidate_description['TrainingJobName']

print('Training Job name: {}'.format(best_candidate_training_job_name2))
Training Job name: pytorch-training-230213-1736-002-23e15b91

Now lets pull the Tuning Job and Training Job Amazon Resource Name (ARN) from the best candidate training job description.

dict_keys(['TrainingJobName', 'TrainingJobArn', 'TuningJobArn', 'ModelArtifacts', 'TrainingJobStatus', 'SecondaryStatus', 'HyperParameters', 'AlgorithmSpecification', 'RoleArn', 'InputDataConfig', 'OutputDataConfig', 'ResourceConfig', 'StoppingCondition', 'CreationTime', 'TrainingStartTime', 'TrainingEndTime', 'LastModifiedTime', 'SecondaryStatusTransitions', 'FinalMetricDataList', 'EnableNetworkIsolation', 'EnableInterContainerTrafficEncryption', 'EnableManagedSpotTraining', 'TrainingTimeInSeconds', 'BillableTimeInSeconds', 'ProfilingStatus', 'WarmPoolStatus', 'ResponseMetadata'])
best_candidate_tuning_job_arn = best_candidate_description['TuningJobArn'] 
best_candidate_training_job_arn = best_candidate_description['TrainingJobArn'] 
print('Best candidate Tuning Job ARN: {}'.format(best_candidate_tuning_job_arn))
print('Best candidate Training Job ARN: {}'.format(best_candidate_training_job_arn))
Best candidate Tuning Job ARN: arn:aws:sagemaker:us-east-1:058323655887:hyper-parameter-tuning-job/pytorch-training-230213-1736
Best candidate Training Job ARN: arn:aws:sagemaker:us-east-1:058323655887:training-job/pytorch-training-230213-1736-002-23e15b91

Next, we pull the path of the best candidate model in the S3 bucket. We will need it later to set up the Processing Job for the evaluation.

model_tar_s3_uri = sm.describe_training_job(TrainingJobName=best_candidate_training_job_name)['ModelArtifacts']['S3ModelArtifacts']

To perform model evaluation we will use a scikit-learn-based Processing Job. This is essentially a generic Python Processing Job with scikit-learn pre-installed. We can specify the version of scikit-learn we wish to use. Also we need to pass the SageMaker execution role, processing instance type and instance count.

from sagemaker.sklearn.processing import SKLearnProcessor

processing_instance_type = "ml.c5.2xlarge"
processing_instance_count = 1

processor = SKLearnProcessor(

The model evaluation Processing Job will be running the Python code from the file src/evaluate_model_metrics.py. You can open and review the file.

Let’s launch the Processing Job, passing the defined above parameters, custom script, path and the S3 bucket location of the test data.

from sagemaker.processing import ProcessingInput, ProcessingOutput

        ProcessingOutput(s3_upload_mode="EndOfJob", output_name="metrics", source="/opt/ml/processing/output/metrics"),
    arguments=["--max-seq-length", str(max_seq_length)],

Job Name:  sagemaker-scikit-learn-2023-02-13-18-04-08-342
Inputs:  [{'InputName': 'model-tar-s3-uri', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-058323655887/pytorch-training-230213-1736-002-23e15b91/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/input/model/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'evaluation-data-s3-uri', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-058323655887/transformed/data/sentiment-test/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-058323655887/sagemaker-scikit-learn-2023-02-13-18-04-08-342/input/code/evaluate_model_metrics.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'metrics', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-058323655887/sagemaker-scikit-learn-2023-02-13-18-04-08-342/output/metrics', 'LocalPath': '/opt/ml/processing/output/metrics', 'S3UploadMode': 'EndOfJob'}}]

We can see the information about the Processing Jobs using the describe function. The result is in dictionary format. Let’s pull the Processing Job name:

scikit_processing_job_name = processor.jobs[-1].describe()["ProcessingJobName"]

print('Processing Job name: {}'.format(scikit_processing_job_name))
Processing Job name: sagemaker-scikit-learn-2023-02-13-18-04-08-342

Now lets pull the Processing Job status from the Processing Job description.

dict_keys(['ProcessingInputs', 'ProcessingOutputConfig', 'ProcessingJobName', 'ProcessingResources', 'StoppingCondition', 'AppSpecification', 'RoleArn', 'ProcessingJobArn', 'ProcessingJobStatus', 'LastModifiedTime', 'CreationTime', 'ResponseMetadata'])
scikit_processing_job_status = processor.jobs[-1].describe()['ProcessingJobStatus'] 
print('Processing job status: {}'.format(scikit_processing_job_status))
Processing job status: InProgress

Let’s monitor the Processing Job.

from pprint import pprint

running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=scikit_processing_job_name, sagemaker_session=sess

processing_job_description = running_processor.describe()

{'AppSpecification': {'ContainerArguments': ['--max-seq-length', '128'],
                      'ContainerEntrypoint': ['python3',
                      'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3'},
 'CreationTime': datetime.datetime(2023, 2, 13, 18, 4, 9, 1000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 2, 13, 18, 4, 9, 766000, tzinfo=tzlocal()),
 'ProcessingInputs': [{'AppManaged': False,
                       'InputName': 'model-tar-s3-uri',
                       'S3Input': {'LocalPath': '/opt/ml/processing/input/model/',
                                   'S3CompressionType': 'None',
                                   'S3DataDistributionType': 'FullyReplicated',
                                   'S3DataType': 'S3Prefix',
                                   'S3InputMode': 'File',
                                   'S3Uri': 's3://sagemaker-us-east-1-058323655887/pytorch-training-230213-1736-002-23e15b91/output/model.tar.gz'}},
                      {'AppManaged': False,
                       'InputName': 'evaluation-data-s3-uri',
                       'S3Input': {'LocalPath': '/opt/ml/processing/input/data/',
                                   'S3CompressionType': 'None',
                                   'S3DataDistributionType': 'FullyReplicated',
                                   'S3DataType': 'S3Prefix',
                                   'S3InputMode': 'File',
                                   'S3Uri': 's3://sagemaker-us-east-1-058323655887/transformed/data/sentiment-test/'}},
                      {'AppManaged': False,
                       'InputName': 'code',
                       'S3Input': {'LocalPath': '/opt/ml/processing/input/code',
                                   'S3CompressionType': 'None',
                                   'S3DataDistributionType': 'FullyReplicated',
                                   'S3DataType': 'S3Prefix',
                                   'S3InputMode': 'File',
                                   'S3Uri': 's3://sagemaker-us-east-1-058323655887/sagemaker-scikit-learn-2023-02-13-18-04-08-342/input/code/evaluate_model_metrics.py'}}],
 'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:058323655887:processing-job/sagemaker-scikit-learn-2023-02-13-18-04-08-342',
 'ProcessingJobName': 'sagemaker-scikit-learn-2023-02-13-18-04-08-342',
 'ProcessingJobStatus': 'InProgress',
 'ProcessingOutputConfig': {'Outputs': [{'AppManaged': False,
                                         'OutputName': 'metrics',
                                         'S3Output': {'LocalPath': '/opt/ml/processing/output/metrics',
                                                      'S3UploadMode': 'EndOfJob',
                                                      'S3Uri': 's3://sagemaker-us-east-1-058323655887/sagemaker-scikit-learn-2023-02-13-18-04-08-342/output/metrics'}}]},
 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 1,
                                           'InstanceType': 'ml.c5.2xlarge',
                                           'VolumeSizeInGB': 30}},
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '2328',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Mon, 13 Feb 2023 18:04:09 GMT',
                                      'x-amzn-requestid': '27108fc5-7782-41b6-ac72-25de5e9245dc'},
                      'HTTPStatusCode': 200,
                      'RequestId': '27108fc5-7782-41b6-ac72-25de5e9245dc',
                      'RetryAttempts': 0},
 'RoleArn': 'arn:aws:iam::058323655887:role/sagemaker-studio-vpc-firewall-us-east-1-sagemaker-execution-role',
 'StoppingCondition': {'MaxRuntimeInSeconds': 7200}}

.........................................................................!CPU times: user 338 ms, sys: 40.8 ms, total: 379 ms
Wall time: 6min 9s

4.3 Inspect the processed output data

Let’s take a look at the results of the Processing Job. Get the S3 bucket location of the output metrics:

processing_job_description = running_processor.describe()

output_config = processing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "metrics":
        processed_metrics_s3_uri = output["S3Output"]["S3Uri"]


List the content of the folder:

!aws s3 ls $processed_metrics_s3_uri/
2023-02-13 18:10:13      21764 confusion_matrix.png
2023-02-13 18:10:13         56 evaluation.json

The test accuracy can be pulled from the evaluation.json file.

import json
from pprint import pprint

metrics_json = sagemaker.s3.S3Downloader.read_file("{}/evaluation.json".format(

print('Test accuracy: {}'.format(json.loads(metrics_json)))
Test accuracy: {'metrics': {'accuracy': {'value': 0.7378640776699029}}}

Copy image with the confusion matrix generated during the model evaluation into the folder generated.

!aws s3 cp $processed_metrics_s3_uri/confusion_matrix.png ./generated/

import time
time.sleep(10) # Slight delay for our notebook to recognize the newly-downloaded file
download: s3://sagemaker-us-east-1-058323655887/sagemaker-scikit-learn-2023-02-13-18-04-08-342/output/metrics/confusion_matrix.png to generated/confusion_matrix.png

Lets show and review the confusion matrix, which is a table of all combinations of true (actual) and predicted labels. Each cell contains the number of the reviews for the corresponding sentiments.

We can see that the highest numbers of the reviews appear in the diagonal cells, which are the correct predictions for each sentiment class.

5 Acknowledgements

I’d like to express my thanks to the great Deep Learning AI Practical Data Science on AWS Specialisation Course which i completed, and acknowledge the use of some images and other materials from the training course in this article.