Create TPU Flex-start VMs with Compute Engine

TPU Flex-start VMs, powered by Dynamic Workload Scheduler , offer a flexible, cost-effective way to access TPU resources for AI workloads for up to 7 days without long-term reservations. When you request TPU Flex-start VMs, your request remains in a queue until capacity is available. Once provisioned, the TPU VMs run for your specified duration.

TPU Flex-start VMs are a good fit for quick experimentation, small-scale testing, dynamic provisioning of TPUs for inference workloads, model fine-tuning, and workload runs that take less than 7 days. For more information about other TPU consumption options, see Cloud TPU consumption options .

You can delete your TPU resources at any time to stop billing. For more information about TPU pricing, see Cloud TPU pricing .

Limitations

TPU Flex-start VMs have the following limitations:

You can request TPU Flex-start VMs for a duration of up to 7 days.
You can request the following Cloud TPU versions and zones:
- TPU7x : us-central1-c
- TPU v6e : asia-northeast1-b , us-east5-a , us-south1-ai1b
- TPU v5p : us-east5-a

MIGs with TPUs have the following limitations:

Lifecycle operations: You can't stop, start, resume, or suspend TPU instances. To change configurations that require a restart or to stop incurring charges, you must delete the instances.
Regional MIG zone distribution: You must set the target distribution shape to ANY_SINGLE_ZONE .
Configuration updates in a MIG:
- You can't update a MIG that forms a multi-host TPU slice due to the defined accelerator topology.
- You can update a MIG that forms single-host TPU slices by using the automatic or selective methods . However, the updates for single-host TPU slice don't support the restart ( RESTART ) action. If a restart is necessary and the most disruptive action allowed is replace ( REPLACE ), then the updater will replace the instance; otherwise, the update attempt fails with an error.
For a MIG that forms a multi-host TPU slice, the following limitations also apply:
- Target size policy: You must set the target size policy mode to BULK . After you set this mode, you can't change it.
- Target size: In bulk mode, you can set the target size to either 0 or the number of instances that are required to form the accelerator topology.
- Workload policy: You must specify a workload policy in which the accelerator topology is defined. After you set the workload policy, you can't change or remove the policy from the MIG.
Unsupported features: MIGs with TPUs don't support the following features:
- Instance flexibility
- Resize requests to obtain resources all at once
- Stateful configuration
- For a MIG that forms a multi-host TPU slice, the following are also not supported:

Before you begin

Before requesting TPU Flex-start VMs, you must:

Install the Google Cloud CLI
Create a Google Cloud project
Enable the Compute Engine API ( compute.googleapis.com )
Ensure you have the required permissions:
- roles/compute.instanceAdmin.v1
- roles/iam.serviceAccountUser

For more information, see Set up a Google Cloud project for TPUs .

Ensure you have sufficient preemptible quota to use TPU Flex-start VMs. If your workload requires more cores than your current allocation, you can request a quota increase. For details, see Cloud TPU quotas .

Create TPU Flex-start VMs with MIGs

To use TPU Flex-start VMs, you create a managed instance group (MIG) with a specific instance template configuration.

For general instructions on creating Flex-start VMs, see Create Flex-start VMs .

Create TPU Flex-start VMs with a multi-host slice

Create an instance template

Create an instance template specifying the FLEX_START provisioning model and your chosen run duration.

 gcloud  
compute  
instance-templates  
create  
 TEMPLATE_NAME 
  
 \ 
  
--machine-type = 
 MACHINE_TYPE 
  
 \ 
  
--image-family = 
 IMAGE_FAMILY 
  
 \ 
  
--image-project = 
 IMAGE_PROJECT 
  
 \ 
  
--provisioning-model = 
FLEX_START  
 \ 
  
--instance-termination-action = 
DELETE  
 \ 
  
--max-run-duration = 
 DURATION 
  
 \ 
  
--region = 
 REGION 
  
 \ 
  
--maintenance-policy = 
TERMINATE