Cloud TPU deployments overview

This document explains the core concepts and advanced configurations available when you use Cluster Toolkit to deploy Google Kubernetes Engine (GKE) clusters with dedicated Cloud TPU node pools.

When you use Cluster Toolkit to provision Google Kubernetes Engine (GKE) clusters, you can deploy dedicated Cloud TPU node pools designed for artificial intelligence (AI) and machine learning (ML) workloads. Cluster Toolkit provides blueprints to automate the deployment of various supported Cloud TPU architectures.

Key concepts for Cloud TPU deployments in Cluster Toolkit

To deploy a Cloud TPU cluster, you must understand how Cloud TPU hardware is grouped and allocated. You create Cloud TPU slices as a single, fixed hardware unit with physical, high-speed interconnects between all nodes.

Cloud TPU in GKE

Cloud TPU is Google's custom-developed Application-Specific Integrated Circuit (ASIC) that is used to accelerate machine learning workloads. GKE provides full support for Cloud TPU node and node pool lifecycle management, including node creation, configuration, and automatic upgrades.

For more information about how Cloud TPU works with GKE, see About TPUs in GKE .

TPU slices

A Cloud TPU slice is a collection of Cloud TPU chips that are physically connected by a dedicated, ultra-low-latency network called the Inter-Chip Interconnect (ICI). This interconnect lets the nodes function as a single supercomputer.

In Cluster Toolkit, the num_slices parameter specifies the number of identical, independent Cloud TPU slices to create. Each slice corresponds to an individual node pool.

TPU topologies

The tpu_topology parameter defines the shape and total size of your Cloud TPU slice. Topologies describe the number and physical arrangement of the Cloud TPU chips in a slice.

  • Cloud TPU v6echips are arranged in a 2D torus topology, such as 2x4 .
  • Cloud TPU 7xchips have a direct connection to the nearest neighboring chips in 3 dimensions, forming a 3D torus interconnect topology, such as 2x2x1 .

The numbers in the topology represent the number of chips along each dimension of the TPU slice.

For more information about the architecture and configurations of these TPUs, see Cloud TPU v6e and Cloud TPU 7x in the Cloud TPU documentation.

Comparison of blueprint options

Cluster Toolkit provides both standard and advanced blueprints for deploying Cloud TPU v6e and Cloud TPU 7x clusters.

Both standard and advanced blueprints support the following features:

  • Multi-Virtual Private Cloud (VPC) network architecture: Provides the network architecture required for optimized high-throughput inter-chip communication.
  • Dedicated service accounts: Provides dedicated service accounts for the deployment, which helps improve security isolation.

For more information about blueprints, see the Cluster blueprint overview .

Standard blueprints

Standard blueprints provide a basic, functional cluster with a single TPU node pool and standard networking configuration. You can view the following standard blueprints on GitHub:

Advanced blueprints

To optimize high-throughput inter-chip communication and data pipelines, advanced blueprints add the following features to the standard blueprint:

  • Automatic bucket creation: Creates two Cloud Storage buckets to store training data and checkpoints.
  • Performance-tuned Cloud Storage FUSE mounts: Pre-configures Cloud Storage FUSE mounts in the cluster as Persistent Volumes for optimized high-throughput performance. Cloud Storage FUSE uses Workload Identity access. To learn how to mount Cloud Storage buckets in Google Kubernetes Engine (GKE), see Access Cloud Storage buckets with the Cloud Storage FUSE CSI driver .
  • Hyperdisk Balanced support: Provides highly available and consistent performance across GKE nodes. For more information, see About Hyperdisk for GKE .
  • Google Cloud Managed Lustre support: Provides a high-performance, fully managed parallel file system optimized for AI and high performance computing (HPC) workloads. For more information, see the Google Cloud Managed Lustre overview .
  • Filestore support: Provides managed Network File System (NFS) capabilities that let multiple Cloud TPU hosts share logs, code, or datasets. For more information, see the Filestore overview .

You can view the following advanced blueprints on GitHub:

Workload scheduling and consumption

Cluster Toolkit supports advanced scheduling and consumption models to help you manage costs and resource availability:

  • Kueue : A Kubernetes-native system for managing quotas and Job queuing. Cluster Toolkit can automatically calculate and set a Cloud TPU quota matching the total static Cloud TPU capacity of your cluster.
  • Dynamic Workload Scheduler: Cluster Toolkit supports the Dynamic Workload Scheduler (DWS) flex-start provisioning mode. The flex-start provisioning mode is ideal for training jobs that can wait for capacity and are cost-effective. The blueprints also provide queued provisioning support with DWS Flex Start. For more information, see About GPU, TPU, and H4D consumption with flex-start provisioning mode .

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: