Deploy TPU workloads in GKE Standard


This page provides a foundation for learning how to accelerate machine learning (ML) workloads using TPUs in Google Kubernetes Engine (GKE). TPUs are designed for matrix multiplication processing, such as large-scale deep learning model training. TPUs are optimized to handle the enormous datasets and complex models of ML and therefore are more cost-effective and energy efficient for ML workloads due to their superior performance. In this guide, you learn how to deploy ML workloads by using Cloud TPU accelerators, configure quotas for TPUs, configure upgrades for node pools that run TPUs, and monitor TPU workload metrics.

This tutorial is intended for Machine learning (ML) engineers and Platform admins and operators who are interested in using Kubernetes container orchestration to manage large-scale model training, tuning, and inference workloads using TPUs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks .

Before reading this page, ensure that you're familiar with the following:

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update .

Plan your TPU configuration

Plan your TPU configuration based on your model and how much memory it requires. Before you use this guide to deploy your workloads on TPU, complete the planning steps in Plan your TPU configuration .

Ensure that you have TPU quota

The following sections help you ensure that you have enough quota when using TPUs in GKE.

Quota for on-demand or Spot VMs

If you are creating a TPU slice node pool with on-demand or Spot VMs, you must have sufficient TPU quota available in the region that you want to use.

Creating a TPU slice node pool that consumes a TPU reservation does not require any TPU quota. 1 You may safely skip this section for reserved TPUs.

Creating an on-demand or Spot TPU slice node pool in GKE requires Compute Engine API quota. Compute Engine API quota (compute.googleapis.com) is not the same as Cloud TPU API quota (tpu.googleapis.com), which is needed when creating TPUs with the Cloud TPU API.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

  1. Go to the Quotaspage in the Google Cloud console:

    Go to Quotas

  2. In the Filterbox, do the following:

    1. Use the following table to select and copy the property of the quota based on the TPU version and machine type. For example, if you plan to create on-demand TPU v5e nodes whose machine type begins with ct5lp- , enter Name: TPU v5 Lite PodSlice chips .

      TPU version, machine type begins with Property and name of the quota for on-demand instances Property and name of the quota for Spot 2 instances
      TPU v3,
      ct3-
      Dimensions (e.g. location):
      tpu_family:CT3
      Not applicable
      TPU v3,
      ct3p-
      Dimensions (e.g. location):
      tpu_family:CT3P
      Not applicable
      TPU v4,
      ct4p-
      Name:
      TPU v4 PodSlice chips
      Name:
      Preemptible TPU v4 PodSlice chips
      TPU v5e,
      ct5lp-
      Name:
      TPU v5 Lite PodSlice chips
      Name:
      Preemptible TPU v5 Lite Podslice
      chips
      TPU v5p,
      ct5p-
      Name:
      TPU v5p chips
      Name:
      Preemptible TPU v5p chips
      TPU Trillium,
      ct6e-
      Dimensions (e.g. location):
      tpu_family:CT6E
      Name:
      Preemptible TPU slices v6e
    2. Select the Dimensions (e.g. locations)property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a . TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota adjustment .

When a TPU reservation is created, both the limit and current use values for the corresponding quota increase by the number of chips in the TPU reservation. For example, when a reservation is created for 16 TPU v5e chips whose machine type begins with ct5lp- , then both the Limitand Current usagefor the TPU v5 Lite PodSlice chips quota in the relevant region increase by 16.

  1. When creating a TPU slice node pool, use the --reservation and --reservation-affinity=specific flags to create a reserved instance. TPU reservations are available when purchasing a commitment.

  2. When creating a TPU slice node pool, use the --spot flag to create a Spot instance.

Quotas for additional GKE resources

You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.

  • Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
  • In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
  • Ensure that max-pods-per-node aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example, max-pods-per-node of 32 requires 64 IP addresses which translates to a /26 subnet per node . Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the --max-pods-per-node flag to limit the number of pods allowed to be scheduled on a node. The quota for max-pods-per-node should be set at least as high as the maximum number of GKE nodes you anticipate creating.

To request an increase in quota, see Request a quota adjustment .

Ensure reservation availability

To create a TPU slice node pool using a reservation, the reservation must have sufficient available TPU chips at the time of node pool creation.

To see which reservations exist within a project and how many TPU chips within a TPU reservation are available, view a list of your reservations .

Options for provisioning TPUs in GKE

GKE lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors in your workload manifest or by creating Standard mode node pools with TPUs.

Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.

For instructions, see the Provision TPUs using custom compute classes section.

Create a cluster

Create a GKE cluster in Standard mode in a region with available TPUs.

Best practice :

Use regional clusters, which provide high availability of the Kubernetes control plane.

 gcloud  
container  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--location  
 LOCATION 
  
 \ 
  
--cluster-version  
 VERSION 
 

Replace the following:

  • CLUSTER_NAME : the name of the new cluster.
  • LOCATION : the region with your TPU capacity available.
  • VERSION : the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. To learn what are the minimum GKE versions available by TPU machine type, see TPU availability in GKE .

Create a node pool

You can create a single or multi-host TPU slice node pool.

Create a single-host TPU slice node pool

You can create a single-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

 gcloud  
container  
node-pools  
create  
 NODE_POOL_NAME 
  
 \ 
  
--location = 
 LOCATION 
  
 \ 
  
--cluster = 
 CLUSTER_NAME 
  
 \ 
  
--node-locations = 
 NODE_ZONES 
  
 \ 
  
--machine-type = 
 MACHINE_TYPE 
  
 \ 
  
 [ 
--sandbox = 
 type 
 = 
gvisor ] 
 

Replace the following:

  • NODE_POOL_NAME : The name of the new node pool.
  • LOCATION : The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE .
  • CLUSTER_NAME : The name of the cluster.
  • NODE_ZONES : The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE : The type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Choose the TPU version .

Optionally, you can also use the following flags:

  • --num-nodes= NUM_NODES : The initial number of nodes in the node pool in each zone. If you omit this flag,GKE assigns the default of 3 ,

    Best practice :

    If you use the enable-autoscaling flag for the node pool, set num-nodes to 0 so that the autoscaler provisions additional nodes as soon as your workloads demand them.

  • --reservation= RESERVATION_NAME : The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPUs. To learn more about TPU reservations, see About Cloud TPU reservations .

  • --node-labels cloud.google.com/gke-workload-type=HIGH_AVAILABILITY : Tells GKE that the single-host TPU slice node pool is part of a collection. Use this flag if the following conditions apply:

    • The node pool runs inference workload in the new node pool.
    • The node pool uses TPU Trillium.
    • The node pool doesn't use Spot VMs.

    To learn more about collection scheduling management, see Manage collection scheduling in single-host TPU slices .

  • --enable-autoscaling : Create a node pool with autoscaling enabled. Requires the following additional flags:

    • --total-min-nodes= TOTAL_MIN_NODES : Minimum number of all nodes in the node pool.
    • --total-max-nodes= TOTAL_MAX_NODES : Maximum number of all nodes in the node pool.
    • --location-policy=ANY : prioritize usage of unused reservations and reduce the preemption risk of Spot VMs.
  • --spot : Sets the node pool to use Spot VMs for the nodes in the node pool. This cannot be changed after node pool creation.

  • --flex-start : Sets the node pool to use flex-start provisioning mode. Flex-start is supported in GKE version 1.33.0-gke.1712000 or later.

  • --sandbox=type=gvisor : Provisions a node with GKE Sandbox enabled. Requires TPU v4 and later versions. For more information, see GKE Sandbox .

For a full list of all the flags that you can specify, see the gcloud container clusters create reference.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:
  resource 
  
 "google_container_node_pool" 
  
 " NODE_POOL_RESOURCE_NAME 
" 
  
 { 
  
 provider 
  
 = 
  
 google 
  
 project 
  
 = 
  
  PROJECT_ID 
 
  
 cluster 
  
 = 
  
  CLUSTER_NAME 
 
  
 name 
  
 = 
  
  POOL_NAME 
 
  
 location 
  
 = 
  
  CLUSTER_LOCATION 
 
  
 node_locations 
  
 = 
  
 [ 
  NODE_ZONES 
 
 ] 
  
 node_config 
  
 { 
  
 machine_type 
  
 = 
  
  MACHINE_TYPE 
 
  
 reservation_affinity 
  
 { 
  
 consume_reservation_type 
  
 = 
  
 "SPECIFIC_RESERVATION" 
  
 key 
  
 = 
  
 "compute.googleapis.com/reservation-name" 
  
 values 
  
 = 
  
 [ 
  RESERVATION_LABEL_VALUES 
 
 ] 
  
 } 
  
 spot 
  
 = 
  
 true 
  
 flex_start 
  
 = 
  
 false 
  
 } 
 } 
 

Replace the following:

  • NODE_POOL_RESOURCE_NAME : The name of the node pool resource in the Terraform template.
  • PROJECT_ID : Your project ID.
  • CLUSTER_NAME : The name of the existing cluster.
  • POOL_NAME : The name of the node pool to create.
  • CLUSTER_LOCATION : The compute zone(s) of the cluster. Specify the region where the TPU version is available. To learn more, see Select a TPU version and topology .
  • NODE_ZONES : The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE : The type of TPU machine to use. To see TPU compatible machine types, use the table in Choose the TPU version .

Optionally, you can also use the following variables:

  • autoscaling : Create a node pool with autoscaling enabled. For single-host TPU slice, GKE scales between the TOTAL_MIN_NODES and TOTAL_MAX_NODES values.
    • TOTAL_MIN_NODES : Minimum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
    • TOTAL_MAX_NODES : Maximum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
  • RESERVATION_NAME : If you use About Cloud TPU reservations , this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate the RESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider .
  • spot : Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs .
  • flex_start : Sets the node pool to use flex-start provisioning mode. Can't be set to true if spot is enabled. Flex-start is supported in GKE version 1.33.0-gke.1712000 or later.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Enginepage in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool detailssection, check the Specify node locationsbox.

  5. Select the zone based on the TPU version you want to use. To identify an available zone, see TPU availability in GKE .

  6. From the navigation pane, click Nodes.

  7. In the Machine Configurationsection, select TPUs.

  8. In the Seriesdrop-down menu, select one of the following:

    • CT3: TPU v3, single host device
    • CT3P: TPU v3, multi host pod slice
    • CT4P: TPU v4
    • CT5LP: TPU v5e
    • CT5P: TPU v5p
    • CT6E: TPU Trillium (v6e)
  9. In the Machine typedrop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a single-host TPU slice node pool.

  10. In the TPU Topologydrop-down menu, select the physical topology for the TPU slice.

  11. In the Changes neededdialog, click Make changes.

  12. Ensure that Boot disk typeis either Standard persistent diskor SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMscheckbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Create a multi-host TPU slice node pool

You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

 gcloud  
container  
node-pools  
create  
 POOL_NAME 
  
 \ 
  
--location = 
 LOCATION 
  
 \ 
  
--cluster = 
 CLUSTER_NAME 
  
 \ 
  
--node-locations = 
 NODE_ZONES 
  
 \ 
  
--machine-type = 
 MACHINE_TYPE 
  
 \ 
  
--tpu-topology = 
 TPU_TOPOLOGY 
  
 \ 
  
 [ 
--num-nodes = 
 NUM_NODES 
 ] 
  
 \ 
  
 [ 
--spot  
 \] 
  
 [ 
--flex-start  
 \] 
  
 [ 
--enable-autoscaling  
 \ 
  
--max-nodes  
 MAX_NODES 
 ] 
  
 [ 
--reservation-affinity = 
specific  
 \ 
  
--reservation = 
 RESERVATION_NAME 
 ] 
  
 \ 
  
 [ 
--node-labels  
cloud.google.com/gke-nodepool-group-name = 
 COLLECTION_NAME 
,cloud.google.com/gke-workload-type = 
HIGH_AVAILABILITY ] 
  
 [ 
--placement-type = 
COMPACT ] 
 

Replace the following:

  • POOL_NAME : The name of the new node pool.
  • LOCATION : The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE .
  • CLUSTER_NAME : The name of the cluster.
  • NODE_ZONES : The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE : The type of machine to use for nodes. To learn more about the available machine types, see Choose the TPU version .
  • TPU_TOPOLOGY : The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more about TPU topologies, use the table in Choose a topology .

    To learn more, see Topology .

Optionally, you can also use the following flags:

  • NUM_NODES : The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ( {A}x{B}x{C} ) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8. If you omit this flag, the number of nodes is calculated and defaulted based on the topology and machine type.
  • RESERVATION_NAME : The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU slice node pools. To learn more about TPU reservations, see TPU reservation .
  • --spot : Sets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs .
  • --flex-start : Sets the node pool to use flex-start provisioning mode. Flex-start is supported in GKE version 1.33.0-gke.1712000 or later.
  • --enable-autoscaling : Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.

    • MAX_NODES : The maximum size of the node pool. The --max-nodes flag is required if --enable-autoscaling is supplied and must be equal to the product of the values defined in TPU_TOPOLOGY ( {A}x{B}x{C} ) divided by the number of chips in each VM.
  • --node-label=cloud.google.com/gke-nodepool-group-name= COLLECTION_NAME , cloud.google.com/gke-workload-type=HIGH_AVAILABILITY : Tells GKE that the multi-host TPU slice node pool is a collection. Use this flag if the following conditions apply:

    • The node pool runs inference workloads in the new node pool.
    • The node pool uses TPU Trillium.
    • Spot VMs don't support collection scheduling.

    To learn more about collection scheduling management, see Manage collection scheduling in multi-host TPU slices .

  • --placement-type=COMPACT : Create a node pool with compact placement enabled. This option must be used with the flag --tpu-topology . For more information, see Create a compact placement policy and TPU Topology .

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:

      resource 
      
     "google_container_node_pool" 
      
     " NODE_POOL_RESOURCE_NAME 
    " 
      
     { 
      
     provider 
      
     = 
      
     google 
      
     project 
      
     = 
      
      PROJECT_ID 
     
      
     cluster 
      
     = 
      
      CLUSTER_NAME 
     
      
     name 
      
     = 
      
      POOL_NAME 
     
      
     location 
      
     = 
      
      CLUSTER_LOCATION 
     
      
     node_locations 
      
     = 
      
     [ 
      NODE_ZONES 
     
     ] 
      
     initial_node_count 
      
     = 
      
      NUM_NODES 
     
      
     autoscaling 
      
     { 
      
     max_node_count 
      
     = 
      
      MAX_NODES 
     
      
     location_policy 
      
     = 
      
     "ANY" 
      
     } 
      
     node_config 
      
     { 
      
     machine_type 
      
     = 
      
      MACHINE_TYPE 
     
      
     reservation_affinity 
      
     { 
      
     consume_reservation_type 
      
     = 
      
     "SPECIFIC_RESERVATION" 
      
     key 
      
     = 
      
     "compute.googleapis.com/reservation-name" 
      
     values 
      
     = 
      
     [ 
      RESERVATION_LABEL_VALUES 
     
     ] 
      
     } 
      
     spot 
      
     = 
      
     true 
      
     flex_start 
      
     = 
      
     false 
      
     } 
      
     placement_policy 
      
     { 
      
     type 
      
     = 
      
     "COMPACT" 
      
     tpu_topology 
      
     = 
      
      TPU_TOPOLOGY 
     
      
     } 
     } 
     
    

    Replace the following:

    • NODE_POOL_RESOURCE_NAME : The name of the node pool resource in the Terraform template.
    • PROJECT_ID : Your project ID.
    • CLUSTER_NAME : The name of the existing cluster to add the node pool to.
    • POOL_NAME : The name of the node pool to create.
    • CLUSTER_LOCATION : Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology .
    • NODE_ZONES : The comma-separated list of one or more zones where GKE creates the node pool.
    • NUM_NODES : The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU slice node has 4 chips. For example, if TPU_TOPOLOGY is 4x8 , then there are 32 chips which means NUM_NODES must be 8. To learn more about TPU topologies, use the table in Choose the TPU version .
    • TPU_TOPOLOGY : This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using. To learn more about TPU topologies, use the table in Choose a topology .

    Optionally, you can also use the following variables:

    • RESERVATION_NAME : If you use TPU reservation , this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate the RESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider .
    • autoscaling : Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
      • MAX_NODES : It is the maximum size of the node pool. It must be equal to the product of the values defined in TPU_TOPOLOGY ( {A}x{B}x{C} ) divided by the number of chips in each VM.
    • spot : Lets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs .
    • flex_start : Sets the node pool to use flex-start provisioning mode. Can't be set to true if spot is enabled.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Enginepage in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool detailssection, check the Specify node locationsbox.

  5. Select the name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE .

  6. From the navigation pane, click Nodes.

  7. In the Machine Configurationsection, select TPUs.

  8. In the Seriesdrop-down menu, select one of the following:

    • CT3P: For TPU v3.
    • CT4P: For TPU v4.
    • CT5LP: For TPU v5e.
  9. In the Machine typedrop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a multi-host TPU slice node pool.

  10. In the TPU Topologydrop-down menu, select the physical topology for the TPU slice.

  11. In the Changes neededdialog, click Make changes.

  12. Ensure that Boot disk typeis either Standard persistent diskor SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMscheckbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

How GKE handles capacity issues

If GKE cannot create your TPU slice node pool due to insufficient TPU capacity available, GKE returns an error message indicating the TPU slice nodes cannot be created due to lack of capacity.

If you are creating a single-host TPU slice node pool, the error message looks similar to this:

 2 nodes cannot be created due to lack of capacity. The missing nodes will be
created asynchronously once capacity is available. You can either wait for the
nodes to be up, or delete the node pool and try re-creating it again later. 

If you are creating a multi-host TPU slice node pool, the error message looks similar to this:

 The nodes (managed by ...) cannot be created now due to lack of capacity. They
will be created asynchronously once capacity is available. You can either wait
for the nodes to be up, or delete the node pool and try re-creating it again
later. 

Your TPU provisioning request can stay in the queue for a long time and remains in the "Provisioning" state while in the queue.

Once capacity is available, GKE creates the remaining nodes that were not created.

If you need capacity sooner, consider trying Spot VMs , though note that Spot VMs consume different quota than on-demand instances.

You can delete the queued TPU request by deleting the TPU slice node pool .

Run your workload on TPU slice nodes

This section explains how to prepare your workloads and examples of how you can run your workloads.

Prepare your workloads

TPU workloads have the following preparation requirements.

  1. Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpu shared library. libtpu includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certain libtpu.so version. To avoid package version conflicts, we recommend using a JAX AI image . To use TPUs in GKE, ensure that you use the following versions:
    TPU type
    libtpu.so version
    TPU Trillium (v6e)
    tpu-v6e-slice
    TPU v5e
    tpu-v5-lite-podslice
    TPU v5p
    tpu-v5p-slice
    • Recommended JAX AI image: jax0.4.35-rev1 or later
    • Recommended jax[tpu] version: 0.4.19 or later .
    • Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
    TPU v4
    tpu-v4-podslice
    TPU v3
    tpu-v3-slice
    tpu-v3-device
  2. Set the following environment variables for the container requesting the TPU resources:
    • TPU_WORKER_ID : A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
    • TPU_WORKER_HOSTNAMES : A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the TPU_WORKER_ID .
    • GKE automatically injects these environment variables by using a mutating webhook when a Job is created with the completionMode: Indexed , subdomain , parallelism > 1 , and requesting google.com/tpu properties. GKE adds a headless Service so that the DNS records are added for the Pods backing the Service .

      When deploying TPU multi-host resources with Kuberay , GKE provides a deployable webhook as part of the experimental Terraform templates for running Ray on GKE. Instructions for running Ray on GKE with TPUs can be found in the experimental TPU User Guide . The mutating webhook will inject these environment variables into Ray clusters requesting google.com/tpu properties and a multi-host cloud.google.com/gke-tpu-topology node selector.
    • In your workload manifest, add Kubernetes node selectors to ensure that GKE schedules your TPU workload on the TPU machine type and TPU topology you defined:

      nodeSelector:
          cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR 
      cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY 
      

      Replace the following:

      • TPU_ACCELERATOR : The name of the TPU accelerator .
      • TPU_TOPOLOGY : The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more, see Plan TPUs in GKE .

After you complete the workload preparation, you can run a Job that uses TPUs.

The following sections show examples on how to run a Job that performs basic computation with TPUs.

Example 1: Run a workload that displays the number of available TPU chips in a TPU slice node pool

The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:

  • TPU version: TPU v4
  • Topology: 2x2x4

This version and topology selection result in a multi-host slice.

  1. Save the following manifest as available-chips-multihost.yaml :
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
     headless-svc 
     spec 
     : 
      
     clusterIP 
     : 
      
     None 
      
     selector 
     : 
      
     job-name 
     : 
      
     tpu-available-chips 
     --- 
     apiVersion 
     : 
      
     batch/v1 
     kind 
     : 
      
     Job 
     metadata 
     : 
      
     name 
     : 
      
     tpu-available-chips 
     spec 
     : 
      
     backoffLimit 
     : 
      
     0 
      
     completions 
     : 
      
     4 
      
     parallelism 
     : 
      
     4 
      
     completionMode 
     : 
      
     Indexed 
      
     template 
     : 
      
     spec 
     : 
      
     subdomain 
     : 
      
     headless-svc 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-tpu-accelerator 
     : 
      
     tpu-v4-podslice 
      
     # Node selector to target TPU v4 slice nodes. 
      
     cloud.google.com/gke-tpu-topology 
     : 
      
     2x2x4 
      
     # Specifies the physical topology for the TPU slice. 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     tpu-job 
      
     image 
     : 
      
     us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest 
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
     8471 
      
     # Default port using which TPU VMs communicate 
      
     - 
      
     containerPort 
     : 
      
     8431 
      
     # Port to export TPU runtime metrics, if supported. 
      
     securityContext 
     : 
      
     privileged 
     : 
      
     true 
      
     # Required for GKE versions earlier than 1.28 to access TPUs. 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     python -c 'import jax; print("TPU cores:", jax.device_count())' # Python command to count available TPU chips. 
      
     resources 
     : 
      
     requests 
     : 
      
     cpu 
     : 
      
     10 
      
     memory 
     : 
      
     407Gi 
      
      google.com/tpu 
     : 
      
     4 
      
     # Request 4 TPU chips for this workload. 
      
     limits 
     : 
      
     cpu 
     : 
      
     10 
      
     memory 
     : 
      
     407Gi 
      
      google.com/tpu 
     : 
      
     4 
      
     # Limit to 4 TPU chips for this workload. 
    
  2. Deploy the manifest:
    kubectl create -f available-chips-multihost.yaml

    GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips.

  3. Verify that the Job created four Pods:
    kubectl get pods

    The output is similar to the following:

    NAME                       READY   STATUS      RESTARTS   AGE
    tpu-job-podslice-0-5cd8r   0/1     Completed   0          97s
    tpu-job-podslice-1-lqqxt   0/1     Completed   0          97s
    tpu-job-podslice-2-f6kwh   0/1     Completed   0          97s
    tpu-job-podslice-3-m8b5c   0/1     Completed   0          97s
  4. Get the logs of one of the Pods:
    kubectl logs POD_NAME 
    

    Replace POD_NAME with the name of one of the created Pods. For example, tpu-job-podslice-0-5cd8r .

    The output is similar to the following:

    TPU cores: 16
  5. Optional: Remove the workload:
    kubectl delete -f available-chips-multihost.yaml

Example 2: Run a workload that displays the number of available TPU chips in the TPU slice

The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:

  • TPU version: TPU v5e
  • Topology: 2x4

This version and topology selection result in a single-host slice.

  1. Save the following manifest as available-chips-singlehost.yaml :
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Pod 
     metadata 
     : 
      
     name 
     : 
      
     tpu-job-jax-v5 
     spec 
     : 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-tpu-accelerator 
     : 
      
     tpu-v5-lite-podslice 
      
     # Node selector to target TPU v5e slice nodes. 
      
     cloud.google.com/gke-tpu-topology 
     : 
      
     2x4 
      
     # Specify the physical topology for the TPU slice. 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     tpu-job 
      
     image 
     : 
      
     us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest 
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
     8431 
      
     # Port to export TPU runtime metrics, if supported. 
      
     securityContext 
     : 
      
     privileged 
     : 
      
     true 
      
     # Required for GKE versions earlier than 1.28 to access TPUs. 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     python -c 'import jax; print("Total TPU chips:", jax.device_count())' 
      
     resources 
     : 
      
     requests 
     : 
      
     google.com/tpu 
     : 
      
     8 
      
     # Request 8 TPU chips for this container. 
      
     limits 
     : 
      
     google.com/tpu 
     : 
      
     8 
      
     # Limit to 8 TPU chips for this container. 
    
  2. Deploy the manifest:
    kubectl create -f available-chips-singlehost.yaml

    GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice).

  3. Get the logs of the Pod:
    kubectl logs tpu-job-jax-v5

    The output is similar to the following:

    Total TPU chips: 8
  4. Optional: Remove the workload:
    kubectl delete -f available-chips-singlehost.yaml

Upgrade node pools using accelerators (GPUs and TPUs)

GKE automatically upgrades Standard clusters, including node pools. You can also manually upgrade node pools if you want your nodes on a later version sooner. To control how upgrades work for your cluster, use release channels , maintenance windows and exclusions , and rollout sequencing .

You can also configure a node upgrade strategy for your node pool, such as surge upgrades , blue-green upgrades or short-lived upgrades . By configuring these strategies, you can ensure that the node pools are upgraded in a way that achieves the optimal balance between speed and disruption for your environment. For multi-host TPU slice node pools , instead of using the configured node upgrade strategy, GKE atomically recreates the entire node pool in a single step. To learn more, see the definition of atomicity in Terminology related to TPU in GKE .

Using a node upgrade strategy temporarily requires GKE to provision additional resources, depending on the configuration. If Google Cloud has limited capacity for your node pool's resources—for example, you're seeing resource availability errors when trying to create more nodes with GPUs or TPUs—see Upgrade in a resource-constrained environment .

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this guide, consider deleting the TPU slice node pools that no longer have scheduled workloads. If the workloads running must be gracefully terminated, use kubectl drain to clean up the workloads before you delete the node.

  1. Delete a TPU slice node pool:

     gcloud  
    container  
    node-pools  
    delete  
     POOL_NAME 
      
     \ 
      
    --location = 
     LOCATION 
      
     \ 
      
    --cluster = 
     CLUSTER_NAME 
     
    

    Replace the following:

    • POOL_NAME : The name of the node pool.
    • CLUSTER_NAME : The name of the cluster.
    • LOCATION : The compute location of the cluster.

Configure additional settings

The following sections describe the additional configurations you can apply to your TPU workloads.

Manage collection scheduling

In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.

TPU Trillium supports collection scheduling for single-host and multi-host node pools that run inference workloads. The following describes how collection scheduling behavior depends on the type of TPU slice that you use:

  • Multi-host TPU slice:GKE groups multi-host TPU slices to form a collection. Each GKE node pool is a replica within this collection. To define a collection, create a multi-host TPU slice and assign a unique name to the collection. To add more TPU slices to the collection, create another multi-host TPU slice node pool with the same collection name and workload type.
  • Single-host TPU slice:GKE considers the entire single-host TPU slice node pool as a collection. To add more TPU slices to the collection, you can resize the single-host TPU slice node pool.

To manage a collection, perform any of these actions based on the type of node pool that you use.

Manage collection scheduling in multi-host TPU slice node pools

Use the following tasks to manage multi-host TPU slice node pools.

  • To check if a multi-host TPU slice pool is part of a collection, run the following command:

     gcloud  
    container  
    node-pools  
    describe  
     NODE_POOL_NAME 
      
     \ 
      
    --location  
     LOCATION 
      
     \ 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --format = 
     "json" 
      
     | 
      
    jq  
    -r  
     \ 
      
     '"nodepool-group-name: \(.config.labels["cloud.google.com/gke-nodepool-group-name"] // "")\ngke-workload-type: \(.config.labels["cloud.google.com/gke-workload-type"] // "")"' 
     
    

    The output is similar to the following:

     nodepool-group-name: <code><var>NODE_POOL_COLLECTION_NAME</var></code>
    gke-workload-type: HIGH_AVAILABILITY 
    

    If multi-host TPU slice pool is part of a collection, the output has the following labels:

    • cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
    • cloud.google.com/gke-nodepool-group-name: <code><var>COLLECTION_NAME</var></code>
  • To get the list of collections in the cluster, run the following command:

      #!/bin/bash 
     # Replace with your cluster name, project, and location 
     CLUSTER_NAME 
     = 
     CLUSTER_NAME 
     PROJECT 
     = 
     PROJECT_ID 
     LOCATION 
     = 
     LOCATION 
     declare 
      
    -A  
    collection_names node_pools 
     = 
     $( 
    gcloud  
    container  
    node-pools  
    list  
    --cluster  
     " 
     $CLUSTER_NAME 
     " 
      
    --project  
     " 
     $PROJECT 
     " 
      
    --location  
     " 
     $LOCATION 
     " 
      
    --format = 
     "value(name)" 
     ) 
     # Iterate over each node pool 
     for 
      
    pool  
     in 
      
     $node_pools 
     ; 
      
     do 
      
     # Describe the node pool and extract labels using jq 
      
     collection_name 
     = 
     $( 
    gcloud  
    container  
    node-pools  
    describe  
     " 
     $pool 
     " 
      
     \ 
      
    --cluster  
     " 
     $CLUSTER_NAME 
     " 
      
     \ 
      
    --project  
     " 
     $PROJECT 
     " 
      
     \ 
      
    --location  
     " 
     $LOCATION 
     " 
      
     \ 
      
    --format = 
     "json" 
      
     | 
      
    jq  
    -r  
     '.config.labels["cloud.google.com/gke-nodepool-group-name"]' 
     ) 
      
     # Add the collection name to the associative array if it's not empty 
      
     if 
      
     [[ 
      
    -n  
     " 
     $collection_name 
     " 
      
     ]] 
     ; 
      
     then 
      
    collection_names [ 
     " 
     $collection_name 
     " 
     ]= 
     1 
      
     fi 
     done 
     # Print the unique node pool collection names 
     echo 
      
     "Unique cloud.google.com/gke-nodepool-group-name values:" 
     for 
      
    name  
     in 
      
     " 
     ${ 
     !collection_names[@] 
     } 
     " 
     ; 
      
     do 
      
     echo 
      
     " 
     $name 
     " 
     done 
     
    

    The output is similar to the following:

     Unique cloud.google.com/gke-nodepool-group-name values: {COLLECTION_NAME_1}, {COLLECTION_NAME_2}, {COLLECTION_NAME_3} 
    
  • To get a list of node pools that belong to a collection, run the following command:

      #!/bin/bash 
     TARGET_COLLECTION_NAME 
     = 
     COLLECTION_NAME 
     CLUSTER_NAME 
     = 
     CLUSTER_NAME 
     PROJECT 
     = 
     PROJECT_ID 
     LOCATION 
     = 
     LOCATION 
     matching_node_pools 
     =() 
     # Get the list of all node pools in the cluster 
     node_pools 
     = 
     $( 
    gcloud  
    container  
    node-pools  
    list  
    --cluster  
     " 
     $CLUSTER_NAME 
     " 
      
    --project  
     " 
     $PROJECT 
     " 
      
    --location  
     " 
     $LOCATION 
     " 
      
    --format = 
     "value(name)" 
     ) 
     # Iterate over each node pool 
     for 
      
    pool  
     in 
      
     $node_pools 
     ; 
      
     do 
      
     # Get the value of the cloud.google.com/gke-nodepool-group-name label 
      
     collection_name 
     = 
     $( 
    gcloud  
    container  
    node-pools  
    describe  
     " 
     $pool 
     " 
      
     \ 
      
    --cluster  
     " 
     $CLUSTER_NAME 
     " 
      
     \ 
      
    --project  
     " 
     $PROJECT 
     " 
      
     \ 
      
    --location  
     " 
     $LOCATION 
     " 
      
     \ 
      
    --format = 
     "json" 
      
     | 
      
    jq  
    -r  
     '.config.labels["cloud.google.com/gke-nodepool-group-name"]' 
     ) 
      
     # Check if the group name matches the target value 
      
     if 
      
     [[ 
      
     " 
     $collection_name 
     " 
      
     == 
      
     " 
     $TARGET_COLLECTION_NAME 
     " 
      
     ]] 
     ; 
      
     then 
      
     matching_node_pools 
     +=( 
     " 
     $pool 
     " 
     ) 
      
     fi 
     done 
     # Print the list of matching node pools 
     echo 
      
     "Node pools with collection name ' 
     $TARGET_COLLECTION_NAME 
     ':" 
     for 
      
    pool  
     in 
      
     " 
     ${ 
     matching_node_pools 
     [@] 
     } 
     " 
     ; 
      
     do 
      
     echo 
      
     " 
     $pool 
     " 
     done 
     
    

    The output is similar to the following:

     Node pools with collection name 'COLLECTION_NAME':
    {NODE_POOL_NAME_1}
    {NODE_POOL_NAME_2}
    {NODE_POOL_NAME_3} 
    
  • To scale up the collection, create another multi-host TPU slice node pool and add the cloud.google.com/gke-workload-type and cloud.google.com/gke-nodepool-group-name . Use the same collection name in cloud.google.com/gke-nodepool-group-name and run the same workload type. If node auto-provisioning is enabled on the cluster, GKE automatically creates pools based on workload demands.

  • To scale down the collection, delete the node pool .

  • To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster . Deleting the cluster removes all of the collections in it.

Manage collection scheduling in single-host TPU slice node pools

Use the following tasks to manage single-host TPU slice node pools.

  • To check if a single-host TPU slice pool has collection scheduling enabled, run the following command:

     gcloud  
    container  
    node-pools  
    describe  
     NODE_POOL_NAME 
      
     \ 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --project  
     PROJECT_NAME 
      
     \ 
      
    --location  
     LOCATION 
      
     \ 
      
    --format = 
     "json" 
      
     | 
      
    jq  
    -r  
     '.config.labels["cloud.google.com/gke-workload-type"]' 
     
    

    The output is similar to the following:

     gke-workload-type: HIGH_AVAILABILITY 
    

    If the single-host TPU slice pool is part of a collection, the output has the cloud.google.com/gke-workload-type: HIGH_AVAILABILITY label.

  • To scale up the collection, resize the node pool manually or automatically with node auto-provisioning.

  • To scale down the collection, delete the node pool .

  • To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster . Deleting the cluster removes all of the collections in it.

Use Multislice

You can aggregate smaller slices together in a Multislice to handle larger training workloads. For more information, see Multislice TPUs in GKE .

Migrate your TPU reservation

If you have existing TPU reservations, you must first migrate your TPU reservation to a new Compute Engine-based reservation system. You can also create Compute Engine-based reservation system where no migration is needed. To learn how to migrate your TPU reservations, see TPU reservation .

Enable logging

Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging .

Use GKE node auto-provisioning

You can configure GKE to automatically create and delete node pools to meet the resource demands of your TPU workloads. For more information, see Configuring Cloud TPUs .

Provision TPUs by using custom compute classes

You can also configure GKE to request TPUs during scaling operations that create new nodes by using custom compute classes .

You can specify TPU configuration options in your custom compute class specification. When a GKE workload uses that custom compute class, GKE attempts to provision TPUs that use your specified configuration when scaling up.

To provision TPUs with a custom compute class that follows the TPU rules and deploy the workload, complete the following steps:

  1. Save the following manifest as tpu-compute-class.yaml :

      apiVersion 
     : 
      
     cloud.google.com/v1 
     kind 
     : 
      
     ComputeClass 
     metadata 
     : 
      
     name 
     : 
      
     tpu-class 
     spec 
     : 
      
     priorities 
     : 
      
     - 
      
     tpu 
     : 
      
     type 
     : 
      
     tpu-v5-lite-podslice 
      
     count 
     : 
      
     4 
      
     topology 
     : 
      
     2x4 
      
     - 
      
     spot 
     : 
      
     true 
      
     tpu 
     : 
      
     type 
     : 
      
     tpu-v5-lite-podslice 
      
     count 
     : 
      
     4 
      
     topology 
     : 
      
     2x4 
      
     - 
      
     flexStart 
     : 
      
     enabled 
     : 
      
     true 
      
     tpu 
     : 
      
     type 
     : 
      
     tpu-v6e-slice 
      
     count 
     : 
      
     4 
      
     topology 
     : 
      
     2x4 
      
     nodePoolAutoCreation 
     : 
      
     enabled 
     : 
      
     true 
     
    
  2. Deploy the compute class:

     kubectl  
    apply  
    -f  
    tpu-compute-class.yaml 
    

    For more information about custom compute classes and TPUs, see TPU configuration .

  3. Save the following manifest as tpu-job.yaml :

      apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
     headless-svc 
     spec 
     : 
      
     clusterIP 
     : 
      
     None 
      
     selector 
     : 
      
     job-name 
     : 
      
     tpu-job 
     --- 
     apiVersion 
     : 
      
     batch/v1 
     kind 
     : 
      
     Job 
     metadata 
     : 
      
     name 
     : 
      
     tpu-job 
     spec 
     : 
      
     backoffLimit 
     : 
      
     0 
      
     completions 
     : 
      
     4 
      
     parallelism 
     : 
      
     4 
      
     completionMode 
     : 
      
     Indexed 
      
     template 
     : 
      
     spec 
     : 
      
     subdomain 
     : 
      
     headless-svc 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
      cloud.google.com/compute-class 
     : 
      
     tpu-class 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     tpu-job 
      
     image 
     : 
      
     us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest 
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
     8471 
      
     # Default port using which TPU VMs communicate 
      
     - 
      
     containerPort 
     : 
      
     8431 
      
     # Port to export TPU runtime metrics, if supported. 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     python -c 'import jax; print("TPU cores:", jax.device_count())' 
      
     resources 
     : 
      
     requests 
     : 
      
     cpu 
     : 
      
     10 
      
     memory 
     : 
      
      MEMORY_SIZE 
     
      
      google.com/tpu 
     : 
      
      NUMBER_OF_CHIPS 
     
      
     limits 
     : 
      
     cpu 
     : 
      
     10 
      
     memory 
     : 
      
      MEMORY_SIZE 
     
      
      google.com/tpu 
     : 
      
      NUMBER_OF_CHIPS 
     
     
    

    Replace the following:

    • NUMBER_OF_CHIPS : the number of TPU chips for the container to use. Must be the same value for limits and requests , equal to the value in the tpu.count field in the selected custom compute class.
    • MEMORY_SIZE : The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators .
    • NUMBER_OF_CHIPS : the number of TPU chips for the container to use. Must be the same value for limits and requests .
  4. Deploy the Job:

     kubectl  
    create  
    -f  
    tpu-job.yaml 
    

    When you create this Job, GKE automatically does the following:

    • Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
    • Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.

    To learn more, see About custom compute classes .

  5. When you finish this section, you can avoid continued billing by deleting the resources you created:

     kubectl  
    delete  
    -f  
    tpu-job.yaml 
    

Configure auto repair for TPU slice nodes

If a TPU slice node in a multi-host TPU slice node pool is unhealthy, the entire node pool is recreated. Whereas, In a single-host TPU slice node pool, only the unhealthy TPU node is auto-repaired.

Conditions that result in unhealthy TPU slice nodes include the following:

  • Any TPU slice node with common node conditions .
  • Any TPU slice node with an unallocatable TPU count larger than zero.
  • Any VM instance in a TPU slice that is stopped (due to preemption) or is terminated.
  • Node maintenance: If any TPU slice node within a multi-host TPU slice node pool goes down for host maintenance, GKE recreates the entire TPU slice node pool.

You can see the repair status (including the failure reason) in the operation history . If the failure is caused by insufficient quota, contact your Google Cloud account representative to increase the corresponding quota.

Configure graceful termination for TPU slice nodes

In GKE clusters with the control plane running 1.29.1-gke.1425000 or later, TPU slice nodes support SIGTERM signals that alert the node of an imminent shutdown. The imminent shutdown notification is configurable up to five minutes in TPU nodes.

To configure GKE to terminate your workloads gracefully within this notification timeframe, follow the steps in Manage GKE node disruption for GPUs and TPUs .

Run containers without privileged mode

Containers running in nodes in GKE version 1.28 or later don't need to have privileged mode enabled to access TPUs. Nodes in GKE version 1.28 and earlier require privileged mode.

If your TPU slice node is running versions less than 1.28, read the following section:

A container running on a VM in a TPU slice needs access to higher limits on locked memory so the driver can communicate with the TPU chips over direct memory access (DMA). To enable this, you must configure a higher ulimit . If you want to reduce the permission scope on your container, complete the following steps:

  1. Edit the securityContext to include the following fields:

      securityContext 
     : 
      
     capabilities 
     : 
      
     add 
     : 
      
     [ 
     "SYS_RESOURCE" 
     ] 
     
    
  2. Increase ulimit by running the following command inside the container before your setting up your workloads to use TPU resources:

      ulimit 
      
    -l  
     68719476736 
     
    

For TPU v5e, running containers without privileged mode is available in clusters in version 1.27.4-gke.900 and later.

Observability and metrics

Dashboard

Node pool observability in the Google Cloud console is generally available. To view the status of your TPU multi-host node pools on GKE, go to GKE TPU Node Pool Statusdashboard provided by Cloud Monitoring:

Go to GKE TPU Node Pool Status

This dashboard gives you comprehensive insights into the health of your multi-host TPU node pools. For more information, see Monitor health metrics for TPU nodes and node pools .

In the Kubernetes Clusters page in the Google Cloud console, the Observabilitytab also displays TPU observability metrics, such as TPU usage, under the Accelerators > TPUheading. For more information, see View observability metrics .

The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.

Runtime metrics

In GKE version 1.27.4-gke.900 or later, TPU workloads that both use JAX version 0.4.14 or later and specify containerPort: 8431 export TPU utilization metrics as GKE system metrics . The following metrics are available in Cloud Monitoring to monitor your TPU workload's runtime performance:

  • Duty cycle: percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
  • Memory used: amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
  • Memory total: total accelerator memory in bytes. Sampled every 60 seconds.

These metrics are located in the Kubernetes node ( k8s_node ) and Kubernetes container ( k8s_container ) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/duty_cycle
  • kubernetes.io/container/accelerator/memory_used
  • kubernetes.io/container/accelerator/memory_total

Kubernetes node:

  • kubernetes.io/node/accelerator/duty_cycle
  • kubernetes.io/node/accelerator/memory_used
  • kubernetes.io/node/accelerator/memory_total

Monitor health metrics for TPU nodes and node pools

When a training job has an error or terminates in failure, you can check metrics related to the underlying infrastructure to figure out if the interruption was caused by an issue with the underlying node or node pool.

Node status

In GKE version 1.32.1-gke.1357001 or later, the following GKE system metric exposes the condition of a GKE node:

  • kubernetes.io/node/status_condition

The condition field reports conditions on the node, such as Ready , DiskPressure , and MemoryPressure . The status field shows the reported status of the condition, which can be True , False , or Unknown . This is a metric with the k8s_node monitored resource type.

This PromQL query shows if a particular node is Ready :

  kubernetes_io 
 : 
 node_status_condition 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 ", 
  
 node_name 
 = 
 " 
  NODE_NAME 
 
 ", 
  
 condition 
 = 
 " 
 Ready 
 ", 
  
 status 
 = 
 " 
 True 
 "} 
 

To help troubleshoot issues in a cluster, you might want to look at nodes that have exhibited other conditions:

  kubernetes_io 
 : 
 node_status_condition 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 ", 
  
 condition 
 != 
 " 
 Ready 
 ", 
  
 status 
 = 
 " 
 True 
 "} 
 

You might want to specifically look at nodes that aren't Ready :

  kubernetes_io 
 : 
 node_status_condition 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 ", 
  
 condition 
 = 
 " 
 Ready 
 ", 
  
 status 
 = 
 " 
 False 
 "} 
 

If there is no data, then the nodes are ready. The status condition is sampled every 60 seconds.

You can use the following query to understand the node status across the fleet:

  avg 
  
 by 
  
 ( 
 condition 
 , 
 status 
 )( 
  
 avg_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_status_condition 
 { 
 monitored_resource 
 = 
 " 
 k8s_node 
 "}[ 
 ${__interval} 
 ] 
 )) 
 

Node pool status

The following GKE system metric for the k8s_node_pool monitored resource exposes the status of a GKE node pool:

  • kubernetes.io/node_pool/status

This metric is reported only for multi-host TPU node pools.

The status field reports the status of the node pool, such as Provisioning , Running , Error , Reconciling , or Stopping . Status updates happen after GKE API operations complete.

To verify if a particular node pool has Running status, use the following PromQL query:

  kubernetes_io 
 : 
 node_pool_status 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 ", 
  
 node_pool_name 
 = 
 " 
  NODE_POOL_NAME 
 
 ", 
  
 status 
 = 
 " 
 Running 
 "} 
 

To monitor the number of node pools in your project grouped by their status, use the following PromQL query:

  count 
  
 by 
  
 ( 
 status 
 )( 
  
 count_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_pool_status 
 { 
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 "}[ 
 ${__interval} 
 ] 
 )) 
 

Node pool availability

The following GKE system metric shows whether a multi-host TPU node pool is available:

  • kubernetes.io/node_pool/multi_host/available

The metric has a value of True if all of the nodes in the node pool are available, and False otherwise. The metric is sampled every 60 seconds.

To check the availability of multi-host TPU node pools in your project, use the following PromQL query:

  avg 
  
 by 
  
 ( 
 node_pool_name 
 )( 
  
 avg_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_pool_multi_host_available 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 "}[ 
 ${__interval} 
 ] 
 )) 
 

Node interruption count

The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):

  • kubernetes.io/node/interruption_count

The interruption_type (such as TerminationEvent , MaintenanceEvent , or PreemptionEvent ) and interruption_reason (like HostError , Eviction , or AutoRepair ) fields can help provide the reason for why a node was interrupted.

To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:

   
 sum 
  
 by 
  
 ( 
 interruption_type 
 , 
 interruption_reason 
 )( 
  
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_interruption_count 
 { 
 monitored_resource 
 = 
 " 
 k8s_node 
 "}[ 
 ${__interval} 
 ] 
 )) 
 

To only see the host maintenance events , update the query to filter the HW/SW Maintenance value for the interruption_reason . Use the following PromQL query:

   
 sum 
  
 by 
  
 ( 
 interruption_type 
 , 
 interruption_reason 
 )( 
  
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_interruption_count 
 { 
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 interruption_reason 
 = 
 " 
 HW/SW Maintenance 
 "}[ 
 ${__interval} 
 ] 
 )) 
 

To see the interruption count aggregated by node pool, use the following PromQL query:

   
 sum 
  
 by 
  
 ( 
 node_pool_name 
 , 
 interruption_type 
 , 
 interruption_reason 
 )( 
  
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_pool_interruption_count 
 { 
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 ", 
  
 interruption_reason 
 = 
 " 
 HW/SW Maintenance 
 ", 
  
 node_pool_name= NODE_POOL_NAME 
 
  
 }[ 
 ${__interval} 
 ] 
 )) 
 

Node pool times to recover (TTR)

The following GKE system metric reports the distribution of recovery period durations for GKE multi-host TPU node pools:

  • kubernetes.io/node_pool/accelerator/times_to_recover

Each sample recorded in this metric indicates a single recovery event for the node pool from a downtime period.

This metric is useful for tracking the multi-host TPU node pool time to recover and time between interruptions.

You can use the following PromQL query to calculate the mean time to recovery (MTTR) for the last 7 days in your cluster:

  sum 
 ( 
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_pool_accelerator_times_to_recover_sum 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 "}[ 
 7d 
 ] 
 )) 
 / 
 sum 
 ( 
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_pool_accelerator_times_to_recover_count 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node_pool 
 ", 
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 "}[ 
 7d 
 ] 
 )) 
 

Node pool times between interruptions (TBI)

Node pool times between interruptions measures how long your infrastructure runs before experiencing an interruption. It is computed as the average over a window of time, where the numerator measures the total time that your infrastructure was up and the denominator measures the total interruptions to your infrastructure.

The following PromQL example shows the 7-day mean time between interruptions (MTBI) for the given cluster:

  sum 
 ( 
 count_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_memory_total_bytes 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 node_name 
 =~ 
 " 
 gke-tpu.*|gk3-tpu.* 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 "}[ 
 7d 
 ] 
 )) 
 / 
 sum 
 ( 
 sum_over_time 
 ( 
  
 kubernetes_io 
 : 
 node_interruption_count 
 { 
  
 monitored_resource 
 = 
 " 
 k8s_node 
 ", 
  
 node_name 
 =~ 
 " 
 gke-tpu.*|gk3-tpu.* 
 ", 
  
 cluster_name 
 = 
 " 
  CLUSTER_NAME 
 
 "}[ 
 7d 
 ] 
 )) 
 

Host metrics

In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics . The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:

  • TensorCore utilization: current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit . The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
  • Memory bandwidth utilization: current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.

These metrics are located in the Kubernetes node ( k8s_node ) and Kubernetes container ( k8s_container ) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/tensorcore_utilization
  • kubernetes.io/container/accelerator/memory_bandwidth_utilization

Kubernetes node:

  • kubernetes.io/node/accelerator/tensorcore_utilization
  • kubernetes.io/node/accelerator/memory_bandwidth_utilization

For more information, see Kubernetes metrics and GKE system metrics .

Known issues

  • Cluster autoscaler might incorrectly calculate capacity for new TPU slice nodes before those nodes report available TPUs. Cluster autoscaler might then perform additional scale up and as a result create more nodes than needed. Cluster autoscaler scales down additional nodes, if they are not needed, after regular scale down operation.
  • Cluster autoscaler cancels scaling up of TPU slice node pools that remain in waiting status for more than 10 hours. Cluster Autoscaler retries such scale up operations later. This behavior might reduce TPU obtainability for customers who don't use reservations.
  • Non-TPU workloads that have a toleration for the TPU taint can prevent scale down of the node pool if they are being recreated during draining of the TPU slice node pool.
  • Memory bandwidth utilization metric is not available for v5e TPUs.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: