Cloud TPU inference

Serving refers to the process of deploying a trained machine learning model to a production environment, where it can be used for inference. Inference is supported on TPU v5e and newer versions. Latency SLOs are a priority for serving.

This document discusses serving a model on a single-host TPU. TPU slices with 8 or less chips have one TPU VM or host and are called single-host TPUs.

Get started

You will need a Google Cloud account and project to use Cloud TPU. For more information, see Set up a Cloud TPU environment .

You need to request the following quota for serving on TPUs:

On-demand v5e resources: TPUv5 lite pod cores for serving per project per zone
Preemptible v5e resources: Preemptible TPU v5 lite pod cores for serving per project per zone
On-demand v6e resources: TPUv6 cores per project per zone
Preemptible v6e resources: Preemptible TPUv6 cores per project per zone

For more information about TPU quota, see TPU quota .

Serve LLMs using JetStream

JetStream is a throughput and memory optimized engine for large language model (LLM) inference on XLA devices (TPUs). You can use JetStream with JAX and PyTorch/XLA models. For an example of using JetStream to serve a JAX LLM, see JetStream MaxText inference on v6e TPU .

Serve LLM models with vLLM

vLLM is an open-source library designed for fast inference and serving of large language models (LLMs). You can use vLLM with PyTorch/XLA. For an example of using vLLM to serve a PyTorch LLM, see Serve an LLM using TPU Trillium on GKE with vLLM .

Profiling

After setting up inference, you can use profilers to analyze the performance and TPU utilization. For more information about profiling, see: