Train a model using TPU7x (Ironwood)

This document describes how to provision TPU7x resources and gives an example of deploying a training workload using MaxText and XPK.

TPU7x is the first release within the Ironwood family, Google Cloud's seventh generation TPU. The Ironwood generation is designed for large-scale AI training and inference. For more information, see TPU7x .

For more examples optimized for TPU7x, see Training Recipes for Ironwood TPU on GitHub.

Provision TPUs

You can provision and manage TPU7x using the following methods:

GKE: You can use GKE to provision and manage TPUs as a pool of accelerators for your containerized machine learning workloads. Use the Google Cloud CLI to create your GKE cluster instance manually for precise customization or expansion of existing production GKE environments. For more information, see About TPUs in GKE .
GKE and XPK: XPK is a command-line tool that simplifies cluster creation and workload execution on GKE. It's designed for ML practitioners to provision TPUs and run training jobs without needing deep Kubernetes expertise. Use XPK to quickly create GKE clusters and run workloads for proof-of-concept and testing. For more information, see the XPK GitHub repository .
GKE and TPU Cluster Director: TPU Cluster Director is available through an All Capacity mode reservation, which gives you full access to all of your reserved capacity (without hold-backs) and full visibility into the TPU hardware topology, utilization status, and health status. For more information, see All Capacity mode overview .

Deploy a training workload with MaxText and XPK

Use Accelerated Processing Kit ( XPK ) to create GKE clusters for proof-of-concept and testing.

The following sections show how to deploy a training workload using MaxText and XPK .

Before you begin

Before you start, complete the following steps:

Ensure you have a Google Cloud project with billing enabled.
Get access to TPU7x. For more information, contact your account team.
Ensure the account you're using with XPK has the roles listed in the XPK GitHub repository .

Install XPK and dependencies

Install XPK. Follow the instructions in the XPK GitHub repository .

Install Docker using instructions provided by your administrator or follow the official installation instructions . Once installed, run the following commands to configure Docker and test the installation:

 gcloud  
auth  
configure-docker
sudo  
usermod  
-aG  
docker  
 $USER 
  
 # relaunch the terminal and activate venv after running this command 
docker  
run  
hello-world  
 # Test Docker

Set the following environment variables:
```
 export 
  
 PROJECT_ID 
 = 
 YOUR_PROJECT_ID 
 export 
  
 ZONE 
 = 
 YOUR_ZONE 
 export 
  
 CLUSTER_NAME 
 = 
 YOUR_CLUSTER_NAME 
 export 
  
 ACCELERATOR_TYPE 
 = 
 YOUR_ACCELERATOR_TYPE 
 export 
  
 BASE_OUTPUT_DIR 
 = 
 "gs:// YOUR_BUCKET_NAME 
" 
```
Replace the following:
- YOUR_PROJECT_ID : Your Google Cloud project ID.
- YOUR_ZONE : The zone in which to create the cluster.
- YOUR_CLUSTER_NAME : The name of the new cluster.
- YOUR_ACCELERATOR_TYPE : The TPU version and topology. For example, tpu7x-4x4x8 . For a list of supported topologies, see Supported configurations .
- YOUR_BUCKET_NAME : The name of your Cloud Storage bucket, which will be the output directory for model training.

If you don't have an existing Cloud Storage bucket, create one using the following command:

 gcloud  
storage  
buckets  
create  
 ${ 
 BASE_OUTPUT_DIR 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--location = 
US  
 \ 
  
--default-storage-class = 
STANDARD  
 \ 
  
--uniform-bucket-level-access

Create a single-NIC, single slice cluster

Choose one of the following options to create your cluster. Using a custom network with 8,896 MTU is recommended for optimal performance.

Custom network

To create a custom network with 8,896 MTU and use it for your cluster, follow these steps:

Set environment variables for the network and firewall names:
```
 export 
  
 NETWORK_NAME 
 = 
 NETWORK_NAME 
 export 
  
 NETWORK_FW_NAME 
 = 
 FIREWALL_NAME 
```
Replace the following:
- NETWORK_NAME : A name for the network.
- FIREWALL_NAME : A name for the network firewall rule.

Create a custom network with an MTU of 8,896:

gcloud  
compute  
networks  
create  
 ${ 
 NETWORK_NAME 
 } 
  
 \ 
  
--mtu = 
 8896 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--subnet-mode = 
auto  
 \ 
  
--bgp-routing-mode = 
regional

Create a firewall rule that allows TCP, ICMP, and UDP traffic on your network:

gcloud  
compute  
firewall-rules  
create  
 ${ 
 NETWORK_FW_NAME 
 } 
  
 \ 
  
--network = 
 ${ 
 NETWORK_NAME 
 } 
  
 \ 
  
--allow  
tcp,icmp,udp  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 }

Set an environment variable for the XPK cluster arguments to use the network you created:

 export 
  
 CLUSTER_ARGUMENTS 
 = 
 "--network= 
 ${ 
 NETWORK_NAME 
 } 
 --subnetwork= 
 ${ 
 NETWORK_NAME 
 } 
 "

Create the XPK cluster. The following command provisions on-demand capacity:

xpk  
cluster  
create  
--cluster = 
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--cluster-cpu-machine-type = 
 n1-standard-8 
  
 \ 
  
--num-slices = 
 ${ 
 NUM_SLICES 
 } 
  
 \ 
  
--tpu-type = 
 ${ 
 ACCELERATOR_TYPE 
 } 
  
 \ 
  
--zone = 
 ${ 
 ZONE 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--on-demand  
 \ 
  
--custom-cluster-arguments = 
 " 
 ${ 
 CLUSTER_ARGUMENTS 
 } 
 "

To use reserved capacity, replace --on-demand with --reservation= RESERVATION_NAME . To use TPU Spot VMs, replace --on-demand with --spot .

Default network

If you don't require a high-MTU network, you can create a cluster that uses the default VPC network. The following command provisions on-demand capacity:

xpk  
cluster  
create  
--cluster = 
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--cluster-cpu-machine-type = 
 n1-standard-8 
  
 \ 
  
--num-slices = 
 ${ 
 NUM_SLICES 
 } 
  
 \ 
  
--tpu-type = 
 ${ 
 ACCELERATOR_TYPE 
 } 
  
 \ 
  
--zone = 
 ${ 
 ZONE 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--on-demand

To use reserved capacity, replace --on-demand with --reservation= RESERVATION_NAME . To use TPU Spot VMs, replace --on-demand with --spot .

Build or upload the MaxText Docker image

You can either build a Docker image locally using scripts provided by MaxText or use a prebuilt image.

Build locally

The following commands copy your local directory into the container:

  # Make sure you're running on a virtual environment with python3.12. If nothing is printed, you have the correct version. 
 [[ 
  
 " 
 $( 
python3  
-c  
 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 
  
 2 
>/dev/null ) 
 " 
  
 == 
  
 "3.12" 
  
 ]] 
  
 || 
  
 { 
  
>& 2 
  
 echo 
  
 "Error: Python version must be 3.12." 
 ; 
  
false ; 
  
 } 
 # Clone MaxText 
git  
clone  
https://github.com/AI-Hypercomputer/maxtext.git cd 
  
maxtext
git  
checkout  
maxtext-tutorial-v1.0.0 # Build the Docker image 
bash  
docker_build_dependency_image.sh  
 MODE 
 = 
stable  
 JAX_VERSION 
 = 
 0 
.8.2

After the successful execution of the commands, you should see an image named maxtext_base_image created locally. You can use your local image directly in the xpk workload command.

Upload image (optional)

After building the Docker image locally using the instructions in the previous section, you can upload the MaxText Docker image into the registry using the following command:

  export 
  
 CLOUD_IMAGE_NAME 
 = 
 " 
 ${ 
 USER 
 } 
 -maxtext-runner" 
bash  
docker_upload_runner.sh  
 CLOUD_IMAGE_NAME 
 = 
 ${ 
 CLOUD_IMAGE_NAME 
 }

After the successful execution of this command, you should see the MaxText image in gcr.io with the name gcr.io/ PROJECT_ID / CLOUD_IMAGE_NAME .

Define the MaxText training command

Prepare the command to run your training script within the Docker container.

The MaxText 1B model is a configuration within the MaxText framework designed for training a language model with approximately 1 billion parameters. Use this model to experiment with small chip scales. Performance is not optimized.

  export 
  
 MAXTEXT_COMMAND 
 = 
 "JAX_PLATFORMS=tpu,cpu \ 
 ENABLE_PJRT_COMPATIBILITY=true \ 
 python3 src/MaxText/train.py src/MaxText/configs/base.yml \ 
 base_output_directory= 
 ${ 
 BASE_OUTPUT_DIR 
 } 
 \ 
 dataset_type=synthetic \ 
 per_device_batch_size=2 \ 
 enable_checkpointing=false \ 
 gcs_metrics=true \ 
 run_name=maxtext_xpk \ 
 steps=30"

Deploy the training workload

Run the xpk workload create command to deploy your training job. You must either specify the --base-docker-image flag to use the MaxText base image or specify the --docker-image flag and the image you want to use. You can choose to include the --enable-debug-logs flag to enable debug logging.

 xpk  
workload  
create  
 \ 
  
--cluster  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--base-docker-image  
maxtext_base_image  
 \ 
  
--workload  
maxtext-1b- $( 
date  
+%H%M ) 
  
 \ 
  
--tpu-type = 
 ${ 
 ACCELERATOR_TYPE 
 } 
  
 \ 
  
--zone  
 ${ 
 ZONE 
 } 
  
 \ 
  
--project  
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--command  
 " 
 ${ 
 MAXTEXT_COMMAND 
 } 
 " 
  
 # [--enable-debug-logs]

Workload names must be unique within the cluster. In this example, $(date +%H%M) is appended to the workload name to ensure uniqueness.