Running TPU training workloads on GKE

This reference guide compiles best practices, prescriptive guidance, and code samples for running large-scale machine learning training workloads with TPU v4, TPU v5p, and TPU v5e on Google Kubernetes Engine (GKE).

The guide covers two main topics:

Configuring a GKE based environment for large scale training on Cloud TPUs
- This section describes how to configure a GKE cluster to optimize it for running large-scale machine learning training workloads on Cloud TPUs.
Defining, Submitting, and Monitoring Training Jobs
- This section provides guidance on how to define, submit, and manage training jobs using the Kubernetes JobSet and Kueue APIs.

Architecture of the training environment

The diagram below depicts a high-level architecture of the training environment.

The foundation of the environment is a regional, VPC-native GKE cluster. The cluster has two types of node pools:

A single node pool with CPU-only nodes and
Several TPU node pools

This cluster topology supports running both single-slice and multislice TPU training jobs.

Following are the components supporting the environment:

Cloud Storage buckets for saving training datasets and artifacts produced by training jobs (such as logs and checkpoints)
Cloud Artifact Registry for packaging and managing the training, data processing, and other components of a training workload as Docker container images.
Vertex AI TensorBoard for tracking and visualizing training metrics.
Cloud Monitoring for collecting and analyzing non-functional performance metrics
Cloud Logging for managing logs produced by training workloads.
Training workloads impersonate an Identity and Access Management (IAM) service accounts to access Google Cloud services, such as Cloud Storage and Vertex AI TensorBoard.

Training workload processing

The following diagram illustrates the process of submitting and processing training workloads in the training environment.

In this guide we advocate using the Kubernetes JobSet API as the preferred method of coordinating large-scale distributed machine learning training workloads on Kubernetes. When combined with the Kubernetes Kueue job queuing API, it provides flexible and comprehensive training job orchestration.

The training environment's Kueue configuration consists of a single ClusterQueue and multiple LocalQueues. This topology provides basic multi-tenancy and supports managing and prioritizing jobs submitted by multiple teams.

All training workloads are represented as JobSet resources. A JobSet resource may contain multiple job types, such as a core distributed training job and an auxiliary job that manages TensorBoard logs and other artifacts generated by the training job.

JobSet workloads are submitted to a namespaced LocalQueue that points to a ClusterQueue. As illustrated in the diagram, in our reference implementation, there is a single cluster queue.

Kueue monitors when resources (such as TPU slices) required by a workload (JobSet) are available, and then decides when to admit the workload and how to allocate the workload's components to the cluster's node pools.

For example, a training workload can contain two types of jobs:

A multislice distributed training job
A job that uploads TensorBoard logs generated by the training job to Vertex AI TensorBoard

When all the resources required by this workload become available, the training job's workers are started on the requested number of TPU slices. The TensorBoard uploader is started on one of the nodes in the CPU node pool.

If the compute resources required by other submitted workloads are not available, these workloads are queued and scheduled for admission based on the priorities that have been defined in the Kueue configuration.

To submit a JobSet-defined workload, you need to create a YAML JobSet resource definition. There are a few different ways to do this. In this guide, we demonstrate two approaches:

Using Kustomize, which helps you create YAML JobSet resource definitions directly.
Using xpk, which provides an easy-to-use Python-based CLI.

Setup

The deployment process is automated using Cloud Build, Terraform, and Kustomize. The Cloud Build configuration file defines two deployment stages:

In the first stage a Terraform configuration is applied, which:

Creates a network, a subnet, and IP ranges for GKE pods and services.
Creates a VPC-native cluster.
Creates a node pool with nodes equipped with CPUs only.
Creates a specified number of TPU node pools.
Creates an IAM service account for Workload Identity and an IAM service account to be used as a custom node pool service account.
Configures the cluster for Workload Identity.
Creates a Google Cloud Storage bucket.
Creates a Vertex TensorBoard instance
Creates an Artifact Registry

In the second stage, the JobSet and Kueue custom resources are installed and Kueue is configured as described in the previous section.

Warning

Your project must have sufficient quota to provision TPU resources. Else, you can request for a higher quota limit.

Configure pre-requisites

Before submitting the Cloud Build build, you need to:

Create a new Google Cloud project or select an existing one.
Enable the necessary services.
Configure an automation service account and an automation Google Cloud storage bucket.

The following services are required by the base environment:

cloudbuild.googleapis.com
artifactregistry.googleapis.com
cloudkms.googleapis.com
cloudresourcemanager.googleapis.com
container.googleapis.com
compute.googleapis.com
container.googleapis.com
iam.googleapis.com
iamcredentials.googleapis.com
serviceusage.googleapis.com
stackdriver.googleapis.com
storage-component.googleapis.com
storage.googleapis.com
sts.googleapis.com
aiplatform.googleapis.com

You also need a GCS bucket that will be used for managing Terraform state and other Terraform artifacts and a service account that will be impersonated by Terraform when provisioning the environment. The service account should have the following project level roles:

iam.securityAdmin
iam.serviceAccountAdmin
compute.networkAdmin
container.admin
iam.serviceAccountUser
storage.admin
artifactregistry.admin
aiplatform.user
serviceusage.serviceUsageConsumer

If you lack administrative-level permissions to enable GCP services or to create and configure service accounts in your project, your project administrator must perform these tasks. However, if you are a project owner, you can enable the services and create and configure the automation service account as part of the Configure automation settings step.

Configure automation settings

During this step, Terraform is configured to utilize the specified automation bucket and service account. Optionally, if configured, it can also enable the necessary services and create both the automation service account and the automation bucket.

Clone this repo
Change the current folder to environment/0-bootstrap
Copy the terraform.tfvars.tmpl file to terraform.tfvars
Modify the terraform.tfvars file to reflect your environment

project_id - your project ID
deletion_protection - Set to true to protect you cluster and GCS buckets from accidental deletion by Terraform apply/destroy commands. Unless this field is set to false, a terraform destroy or terraform apply that would delete the cluster or non-empty GCS buckets will fail.
create_automation_bucket - set to true if you want to create a new automation bucket; set to false if you want to use an existing bucket
automation_bucket - the name and location of a bucket you want to use for automation. If you use an existing bucket the location field will be ignored
create_automation_sa - set to true if you want to create a new automation service account; set to false if you want to use an existing service account
automation_sa_name - the name of an automation service account to be used by Terraform for impersonation
enable_apis - set to true if you want to enable the services listed in the services variable
services - the list of services to enable in your project
roles - the list of roles to assign to an automation services account. These roles will only be assigned to a newly created account. If you are using an existing account, this list will be ignored.

Execute the terraform init command
Execute the terraform apply command

The Terraform configuration generates prepopulated template files for configuring the Terraform backend and providers, which can be utilized in the following setup stages. These template files are stored in the gs://<YOUR-AUTOMATION-BUCKET/providers and gs://<YOUR-AUTOMATION-BUCKET/tfvars folders.

Grant Cloud Build impersonating rights

To be able to impersonate the automation service account, the Cloud Build service account needs to have the iam.serviceAccountTokenCreator rights on the automation service account.

AUTOMATION_SERVICE_ACCOUNT=<AUTOMATTION_SERVICE_ACOUNT_EMAIL>
CLOUD_BUILD_SERVICE_ACCOUNT=<PROJECT_NUMBER>@cloudbuild.gserviceaccount.com

gcloud iam service-accounts add-iam-policy-binding $AUTOMATION_SERVICE_ACCOUNT --member="serviceAccount:$CLOUD_BUILD_SERVICE_ACCOUNT" --role='roles/iam.serviceAccountTokenCreator'

Replace <PROJECT_NUMBER> with your project number. Replace <AUTOMATION_SERVICE_ACCOUNT_EMAIL> with the email of your automation service account. If you created the automation service account using the bootstrap Terraform you can retrieve its email by executing the terraform output automation_sa command from the environment\0-bootstrap folder.

Deploy

Clone the GitHub repo.

If you haven't already run the bootstrap stage, please clone this repository now.

git clone https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git

Change the current directory, to ai-infrastructure/tpu-training-on-gke/environment.

Configure build parameters

To configure the Terraform steps in the build, copy the terraform.tfvars.tmpl template file in the 1-base-infrastructure folder to terraform.tfvars. Make modifications to the terraform.tfvars file to align it with your specific environment. At the very least, you should set the following variables:

project_id - your project ID
region - your region for a VPC and a GKE cluster
prefix - the prefix that will be added to the default names of resources provisioned by the configuration
tensorboard_config.region - the region of a TensorBoard instance
create_artifact_registry - set to true to create a new artifact registry
cpu_node_pools - The terraform.tfvars.tmpl template provides an example configuration for a single autoscaling node pool.
tpu_node_pools - The template shows an example configuration for two TPU node pools: one with a single v5e-4 pod slice and the other with a single v5e-16 pod slice. Modify the tpu_node_pools variable to provision different TPU node pool configurations, as described below.

If you wish to modify other default settings, such as the default name suffixes for a cluster or GCS bucket names, you can override the defaults specified in the variables.tf file within your terraform.tfvars file.

When configuring TPU node pools, ensure that you set the TPU type to one of the following values:

TPU types

TPU type name	Slice type	Slice topology	TPU VM type	Number of VMs in a slice	Number of chips in a VM
v5litepod-4	tpu-v5-lite-podslice	2x2	ct5lp-hightpu-4t	1	4
v5litepod-16	tpu-v5-lite-podslice	4x4	ct5lp-hightpu-4t	4	4
v5litepod-32	tpu-v5-lite-podslice	4x8	ct5lp-hightpu-4t	8	4
v5litepod-64	tpu-v5-lite-podslice	8x8	ct5lp-hightpu-4t	16	4
v5litepod-128	tpu-v5-lite-podslice	8x16	ct5lp-hightpu-4t	32	4
v5litepod-256	tpu-v5-lite-podslice	16x16	ct5lp-hightpu-4t	64	4
v4-8	tpu-v4-podslice	2x2x1	ct4p-hightpu-4t	1	4
v4-16	tpu-v4-podslice	2x2x2	ct4p-hightpu-4t	2	4
v4-32	tpu-v4-podslice	2x2x4	ct4p-hightpu-4t	4	4
v4-64	tpu-v4-podslice	2x4x4	ct4p-hightpu-4t	8	4
v4-128	tpu-v4-podslice	4x4x4	ct4p-hightpu-4t	16	4
v4-256	tpu-v4-podslice	4x4x8	ct4p-hightpu-4t	32	4
v4-512	tpu-v4-podslice	4x8x8	ct4p-hightpu-4t	64	4
v4-1024	tpu-v4-podslice	8x8x8	ct4p-hightpu-4t	128	4
v4-1536	tpu-v4-podslice	8x8x12	ct4p-hightpu-4t	192	4
v4-2048	tpu-v4-podslice	8x8x16	ct4p-hightpu-4t	256	4
v4-4096	tpu-v4-podslice	8x16x16	ct4p-hightpu-4t	512	4
v5p-8	tpu-v5p-slice	2x2x1	ct5p-hightpu-4t	1	4
v5p-16	tpu-v5p-slice	2x2x2	ct5p-hightpu-4t	2	4
v5p-32	tpu-v5p-slice	2x2x4	ct5p-hightpu-4t	4	4
v5p-64	tpu-v5p-slice	2x4x4	ct5p-hightpu-4t	8	4
v5p-128	tpu-v5p-slice	4x4x4	ct5p-hightpu-4t	16	4
v5p-256	tpu-v5p-slice	4x4x8	ct5p-hightpu-4t	32	4
v5p-384	tpu-v5p-slice	4x4x12	ct5p-hightpu-4t	48	4
v5p-512	tpu-v5p-slice	4x8x8	ct5p-hightpu-4t	64	4
v5p-640	tpu-v5p-slice	4x4x20	ct5p-hightpu-4t	80	4
v5p-768	tpu-v5p-slice	4x8x12	ct5p-hightpu-4t	96	4
v5p-896	tpu-v5p-slice	4x4x28	ct5p-hightpu-4t	112	4
v5p-1024	tpu-v5p-slice	8x8x8	ct5p-hightpu-4t	128	4
v5p-1152	tpu-v5p-slice	4x12x12	ct5p-hightpu-4t	144	4
v5p-1280	tpu-v5p-slice	4x8x20	ct5p-hightpu-4t	160	4
v5p-1408	tpu-v5p-slice	4x4x44	ct5p-hightpu-4t	176	4
v5p-1536	tpu-v5p-slice	8x8x12	ct5p-hightpu-4t	192	4
v5p-1664	tpu-v5p-slice	4x4x52	ct5p-hightpu-4t	208	4
v5p-1792	tpu-v5p-slice	4x8x28	ct5p-hightpu-4t	224	4
v5p-1920	tpu-v5p-slice	4x12x20	ct5p-hightpu-4t	240	4
v5p-2048	tpu-v5p-slice	8x8x16	ct5p-hightpu-4t	256	4
v5p-2176	tpu-v5p-slice	4x4x68	ct5p-hightpu-4t	272	4
v5p-2304	tpu-v5p-slice	8x12x12	ct5p-hightpu-4t	288	4
v5p-2432	tpu-v5p-slice	4x4x76	ct5p-hightpu-4t	304	4
v5p-2560	tpu-v5p-slice	8x8x20	ct5p-hightpu-4t	320	4
v5p-2688	tpu-v5p-slice	4x12x28	ct5p-hightpu-4t	336	4
v5p-2816	tpu-v5p-slice	4x8x44	ct5p-hightpu-4t	352	4
v5p-2944	tpu-v5p-slice	4x4x92	ct5p-hightpu-4t	368	4
v5p-3072	tpu-v5p-slice	4x12x16	ct5p-hightpu-4t	384	4
v5p-3200	tpu-v5p-slice	4x20x20	ct5p-hightpu-4t	400	4
v5p-3328	tpu-v5p-slice	4x8x52	ct5p-hightpu-4t	416	4
v5p-3456	tpu-v5p-slice	12x12x12	ct5p-hightpu-4t	432	4
v5p-3584	tpu-v5p-slice	8x8x28	ct5p-hightpu-4t	448	4
v5p-3712	tpu-v5p-slice	4x4x116	ct5p-hightpu-4t	464	4
v5p-3840	tpu-v5p-slice	8x12x20	ct5p-hightpu-4t	480	4
v5p-3968	tpu-v5p-slice	4x4x124	ct5p-hightpu-4t	496	4
v5p-4096	tpu-v5p-slice	8x16x16	ct5p-hightpu-4t	512	4
v5p-4224	tpu-v5p-slice	4x12x44	ct5p-hightpu-4t	528	4
v5p-4352	tpu-v5p-slice	4x8x68	ct5p-hightpu-4t	544	4
v5p-4480	tpu-v5p-slice	4x20x28	ct5p-hightpu-4t	560	4
v5p-4608	tpu-v5p-slice	12x12x16	ct5p-hightpu-4t	576	4
v5p-4736	tpu-v5p-slice	4x4x148	ct5p-hightpu-4t	592	4
v5p-4864	tpu-v5p-slice	4x8x76	ct5p-hightpu-4t	608	4
v5p-4992	tpu-v5p-slice	4x12x52	ct5p-hightpu-4t	624	4
v5p-5120	tpu-v5p-slice	8x16x20	ct5p-hightpu-4t	640	4
v5p-5248	tpu-v5p-slice	4x4x164	ct5p-hightpu-4t	656	4
v5p-5376	tpu-v5p-slice	8x12x28	ct5p-hightpu-4t	672	4
v5p-5504	tpu-v5p-slice	4x4x172	ct5p-hightpu-4t	688	4
v5p-5632	tpu-v5p-slice	8x8x44	ct5p-hightpu-4t	704	4
v5p-5760	tpu-v5p-slice	12x12x20	ct5p-hightpu-4t	720	4
v5p-5888	tpu-v5p-slice	4x8x92	ct5p-hightpu-4t	736	4
v5p-6016	tpu-v5p-slice	4x4x188	ct5p-hightpu-4t	752	4
v5p-6144	tpu-v5p-slice	12x16x16	ct5p-hightpu-4t	768	4
v5p-6272	tpu-v5p-slice	4x28x28	ct5p-hightpu-4t	784	4
v5p-6400	tpu-v5p-slice	8x20x20	ct5p-hightpu-4t	800	4
v5p-6528	tpu-v5p-slice	4x12x68	ct5p-hightpu-4t	816	4
v5p-6656	tpu-v5p-slice	8x8x52	ct5p-hightpu-4t	832	4
v5p-6784	tpu-v5p-slice	4x4x212	ct5p-hightpu-4t	848	4
v5p-6912	tpu-v5p-slice	12x12x24	ct5p-hightpu-4t	864	4
v5p-7040	tpu-v5p-slice	4x20x44	ct5p-hightpu-4t	880	4
v5p-7168	tpu-v5p-slice	8x16x28	ct5p-hightpu-4t	896	4
v5p-7296	tpu-v5p-slice	4x12x76	ct5p-hightpu-4t	912	4
v5p-7424	tpu-v5p-slice	4x8x116	ct5p-hightpu-4t	928	4
v5p-7552	tpu-v5p-slice	4x4x236	ct5p-hightpu-4t	944	4
v5p-7680	tpu-v5p-slice	12x16x20	ct5p-hightpu-4t	960	4
v5p-7808	tpu-v5p-slice	4x4x244	ct5p-hightpu-4t	976	4
v5p-7936	tpu-v5p-slice	4x8x124	ct5p-hightpu-4t	992	4
v5p-8064	tpu-v5p-slice	12x12x28	ct5p-hightpu-4t	1008	4
v5p-8192	tpu-v5p-slice	16x16x16	ct5p-hightpu-4t	1024	4
v5p-8320	tpu-v5p-slice	4x20x52	ct5p-hightpu-4t	1040	4
v5p-8448	tpu-v5p-slice	8x12x44	ct5p-hightpu-4t	1056	4
v5p-8704	tpu-v5p-slice	8x8x68	ct5p-hightpu-4t	1088	4
v5p-8832	tpu-v5p-slice	4x12x92	ct5p-hightpu-4t	1104	4
v5p-8960	tpu-v5p-slice	8x20x28	ct5p-hightpu-4t	1120	4
v5p-9216	tpu-v5p-slice	12x16x24	ct5p-hightpu-4t	1152	4
v5p-9472	tpu-v5p-slice	4x8x148	ct5p-hightpu-4t	1184	4
v5p-9600	tpu-v5p-slice	12x20x20	ct5p-hightpu-4t	1200	4
v5p-9728	tpu-v5p-slice	8x8x76	ct5p-hightpu-4t	1216	4
v5p-9856	tpu-v5p-slice	4x28x44	ct5p-hightpu-4t	1232	4
v5p-9984	tpu-v5p-slice	8x12x52	ct5p-hightpu-4t	1248	4
v5p-10240	tpu-v5p-slice	16x16x20	ct5p-hightpu-4t	1280	4
v5p-10368	tpu-v5p-slice	12x12x36	ct5p-hightpu-4t	1296	4
v5p-10496	tpu-v5p-slice	4x8x164	ct5p-hightpu-4t	1312	4
v5p-10752	tpu-v5p-slice	12x16x28	ct5p-hightpu-4t	1344	4
v5p-10880	tpu-v5p-slice	4x20x68	ct5p-hightpu-4t	1360	4
v5p-11008	tpu-v5p-slice	4x8x172	ct5p-hightpu-4t	1376	4
v5p-11136	tpu-v5p-slice	4x12x116	ct5p-hightpu-4t	1392	4
v5p-11264	tpu-v5p-slice	8x16x44	ct5p-hightpu-4t	1408	4
v5p-11520	tpu-v5p-slice	12x20x24	ct5p-hightpu-4t	1440	4
v5p-11648	tpu-v5p-slice	4x28x52	ct5p-hightpu-4t	1456	4
v5p-11776	tpu-v5p-slice	8x8x92	ct5p-hightpu-4t	1472	4
v5p-11904	tpu-v5p-slice	4x12x124	ct5p-hightpu-4t	1488	4
v5p-12032	tpu-v5p-slice	4x8x188	ct5p-hightpu-4t	1504	4
v5p-12160	tpu-v5p-slice	4x20x76	ct5p-hightpu-4t	1520	4
v5p-12288	tpu-v5p-slice	16x16x24	ct5p-hightpu-4t	1536	4
v5p-13824	tpu-v5p-slice	12x24x24	ct5p-hightpu-4t	1728	4
v5p-17920	tpu-v5p-slice	16x20x28	ct5p-hightpu-4t	2240	4

Modify Workload Identity and Kueue configurations

By default the following names and identifiers are used when configuring Workload Identity Federation and Kueue

The IAM service account for WID - <prefix>-wid-sa
The Kubernetes service account - wid-ksa
The Cluster Queue name - cluster-queue
The Local Queue name - tpu-training-jobs
The Namespace for WID Kubernetes accoutn and Local Queue - tpu-training

If you want to change these defaults, create a terraform.tfvars file in the 2-gke-config and override the default values from the environment/2-gke-config/variables.tf file.

Submit the build

To initiate the build, execute the following command:

export PROJECT_ID=<PROJECT_ID>
export AUTOMATION_BUCKET=<YOUR_AUTOMATION_BUCKET>
export AUTOMATION_ACCOUNT=<YOUR_AUTOMATION_ACCOUNT>
export ENV_NAME=<ENV_STATE_FOLDER> 
export JOBSET_API_VERSION=v0.3.0
export KUEUE_API_VERSION=v0.5.3 

gcloud builds submit \
  --project $PROJECT_ID \
  --config cloudbuild.provision.yaml \
  --substitutions _JOBSET_API_VERSION=$JOBSET_API_VERSION,_KUEUE_API_VERSION=$KUEUE_API_VERSION,_AUTOMATION_BUCKET=$AUTOMATION_BUCKET,_ENV_NAME=$ENV_NAME,_AUTOMATION_ACCOUNT=$AUTOMATION_ACCOUNT \
  --timeout "2h" \
  --machine-type=e2-highcpu-32

Replace the following values:

<PROJECT_ID> with your project ID
<YOUR_AUTOMATION_BUCKET> with your automation bucket
<YOUR_AUTOMATION_ACCOUNT> with you automation service account
<ENV_STATE_FOLDER> with the name of the folder within your automation bucket where Terraform state and other artifacts will be managed

The examples in this repo have been tested with v0.4.0 version of the JobSet API and v0.5.3 version of the Kueue API.

To track the progress of the build, you can either follow the link displayed in Cloud Shell or visit the Cloud Build page on the Google Cloud Console.

Training workloads examples

The examples folder contains code samples that demonstrate how to configure, submit and manage a number of different training workloads.

Refer to the README in the examples folder for detailed instructions.

Cleanup Environment

To destroy the environment and clean up all the provisioned resources:

export PROJECT_ID=<PROJECT_ID>
export AUTOMATION_BUCKET=<YOUR_AUTOMATION_BUCKET>
export ENV_NAME=<TF_STATE_FOLDER>

gcloud builds submit \
  --project $PROJECT_ID \
  --config cloudbuild.destroy.yaml \
  --substitutions _AUTOMATION_BUCKET=$AUTOMATION_BUCKET,_ENV_NAME=$ENV_NAME \
  --timeout "2h" \
  --machine-type=e2-highcpu-32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Running TPU training workloads on GKE

Architecture of the training environment

Training workload processing

Setup

Configure pre-requisites

Configure automation settings

Grant Cloud Build impersonating rights

Deploy

Clone the GitHub repo.

Configure build parameters

TPU types

Modify Workload Identity and Kueue configurations

Submit the build

Training workloads examples

Cleanup Environment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Running TPU training workloads on GKE

Architecture of the training environment

Training workload processing

Setup

Configure pre-requisites

Configure automation settings

Grant Cloud Build impersonating rights

Deploy

Clone the GitHub repo.

Configure build parameters

TPU types

Modify Workload Identity and Kueue configurations

Submit the build

Training workloads examples

Cleanup Environment