This tutorial walks you through deploying a multi-host TPU inference service using Ray Serve LLM . By leveraging Ray's native TPU support to atomically co-schedule distributed engine workers across complex accelerator topologies, you can deploy large models over a multi-host TPU slice for inference.
This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads on distributed, multi-host TPU slices. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks .
Before reading this page, ensure that you're familiar with the following:
Background
This section describes the key technologies used in this guide.
TPUs
Tensor Processing Units (TPUs) let you accelerate specific workloads running on your nodes, such as machine learning and data processing. The primary advantage of TPUs is performance at scale. This tutorial uses TPU Trillium , the sixth generation of Cloud TPU . Multi-host TPU slices consist of multiple physical nodes communicating using a high-speed inter-chip interconnect (ICI), which works well for high-throughput and low-latency serving.
vLLM on Ray
vLLM
is a high-throughput, memory-efficient LLM serving engine. By integrating with Ray Serve
, vLLM can scale across multiple hosts and access physical hardware topologies natively. This tutorial demonstrates using Ray Serve's LLMConfig
and LLMServer
deployments to orchestrate vLLM inference across multi-host slices, letting the framework handle topology distribution and placement group spreading automatically.
Objectives
This tutorial provides a foundation for understanding and exploring practical LLM deployment for inference in a managed Kubernetes environment that uses multi-host TPUs.
- Prepare your environment with a GKE cluster in Autopilot or Standard.
- Build a custom container image with baked-in dependencies.
- Deploy a Ray LLM Python script to your cluster to orchestrate vLLM inference over a TPU slice.
- Use Ray LLM to serve the Gemma 4 model through
curland an optional web chat interface.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task, install
and then initialize
the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Ensure your project has sufficient quota for TPU Trillium (v6e) capacity in your selected region. For more information, see Cloud TPU quotas .
- Ensure your GKE cluster uses GKE Dataplane V2 and satisfies version requirements for DRANET: 1.35.2-gke.1842000 or laterfor both Standard and Autopilot.
- Ensure that you have the following IAM roles
:
-
roles/container.admin -
roles/iam.serviceAccountAdmin
-
Prepare your environment
In this tutorial, you use Cloud Shell
to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl
and the gcloud
CLI.
To set up your environment with Cloud Shell, follow these steps:
-
In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell
. This launches a session in the bottom pane of the Google Cloud console. -
Create and activate a Python virtual environment:
python3 -m venv ray-env source ray-env/bin/activate -
Install the Ray CLI:
pip install "ray" -
Set the default environment variables:
export PROJECT_ID = $( gcloud config get project ) export CLUSTER_NAME = ray-llm-cluster export REGION = REGION export ZONE = ZONE export NAMESPACE = default export KSA_NAME = ray-ksa export GSA_NAME = tpu-reader-sa export NETWORK_NAME = ${ CLUSTER_NAME } -net export GS_BUCKET = BUCKET_NAME export REPO_NAME = ray-repo export CUSTOM_IMAGE_URI = REGION -docker.pkg.dev/ PROJECT_ID / REPOSITORY /vllm-tpu-ray:vllm-tpuReplace the following:
-
PROJECT_ID: your Google Cloud project ID. -
CLUSTER_NAME: the name of your cluster. -
REGION: the region where your TPU Trillium capacity is available. -
ZONE: the zone where your TPU Trillium capacity is available. For more information, see TPU availability in GKE . -
REPOSITORY: the name of your Artifact Registry repository. -
BUCKET_NAME: the name of your storage bucket.
-
Create and configure Google Cloud resources
Follow these instructions to create the required resources.
Create a GKE cluster and node pool
You can serve Gemma on TPUs in a GKE Autopilot or Standard cluster. GKE managed DRANET dynamically requests and manages high-performance networking resources for your distributed Pods, allowing GKE to automatically provision secondary high-speed networks for accelerator inter-communication without requiring manual VPC setup.
Autopilot
-
In Cloud Shell, create the Autopilot cluster:
gcloud container clusters create-auto ${ CLUSTER_NAME } \ --project = ${ PROJECT_ID } \ --enable-ray-operator \ --location = ${ REGION } -
Configure
kubectlto communicate with your cluster:gcloud container clusters get-credentials ${ CLUSTER_NAME } \ --location = ${ REGION } -
To use GKE managed DRANET in Autopilot mode, deploy the custom ComputeClass resource provided in the repository to opt-in to dynamic networking:
-
Apply the manifest to your cluster:
kubectl apply -f ai-ml/gke-ray/rayserve/llm/tpu/networking/dranet-compute-class.yaml
Standard
-
In Cloud Shell, create a Standard cluster that enables the Ray operator and uses GKE Dataplane V2:
gcloud container clusters create ${ CLUSTER_NAME } \ --project = ${ PROJECT_ID } \ --addons = RayOperator,GcsFuseCsiDriver \ --machine-type = n2-standard-8 \ --enable-dataplane-v2 \ --workload-pool = ${ PROJECT_ID } .svc.id.goog \ --location = ${ ZONE } -
Create a multi-host TPU slice node pool with the DRANET driver enabled:
gcloud container node-pools create v6e-16 \ --location = ${ ZONE } \ --cluster = ${ CLUSTER_NAME } \ --machine-type = ct6e-standard-4t \ --tpu-topology = 4x4 \ --num-nodes = 4 \ --enable-gvnic \ --scopes = https://www.googleapis.com/auth/cloud-platform \ --accelerator-network-profile = auto \ --node-labels = cloud.google.com/gke-networking-dra-driver = true
Configure storage and authentication
Create a Cloud Storage bucket and initialize a Rapid Cache instance to accelerate model loading, then configure authentication for Hugging Face:
-
In your TPU zone, create a storage bucket and initialize the Rapid Cache instance:
gcloud storage buckets create gs:// ${ GS_BUCKET } --project = ${ PROJECT_ID } --default-storage-class = STANDARD --location = ${ REGION } gcloud storage buckets anywhere-caches create gs:// ${ GS_BUCKET } ${ ZONE } \ --ttl = 1d \ --admission-policy = ADMIT_ON_FIRST_MISS -
Configure identity links to help securely mount the weight bucket into your GKE Pods. First, create a dedicated IAM service account and grant it bucket read permissions:
gcloud iam service-accounts create ${ GSA_NAME } gcloud storage buckets add-iam-policy-binding gs:// ${ GS_BUCKET } \ --member = "serviceAccount: ${ GSA_NAME } @ ${ PROJECT_ID } .iam.gserviceaccount.com" \ --role = "roles/storage.objectAdmin" -
Create the Workload Identity Federation for GKE binding and annotate the Kubernetes ServiceAccount object:
gcloud iam service-accounts add-iam-policy-binding ${ GSA_NAME } @ ${ PROJECT_ID } .iam.gserviceaccount.com \ --role = "roles/iam.workloadIdentityUser" \ --member = "serviceAccount: ${ PROJECT_ID } .svc.id.goog[ ${ NAMESPACE } / ${ KSA_NAME } ]" kubectl create serviceaccount ${ KSA_NAME } --namespace ${ NAMESPACE } kubectl annotate serviceaccount ${ KSA_NAME } --namespace ${ NAMESPACE } iam.gke.io/gcp-service-account = ${ GSA_NAME } @ ${ PROJECT_ID } .iam.gserviceaccount.com -
To download the Gemma 4 model weights, you must acknowledge Google's license agreement on Hugging Face. Go to the Gemma 4 model page on Hugging Face .
-
Sign in accept the license terms by clicking Agree and access repository.
-
Navigate to your Hugging Face account settings and generate an Access Token with the
Readrole. -
Export your Hugging Face token and create a Kubernetes secret so Ray can pull the model weights:
export HF_TOKEN = YOUR_HUGGING_FACE_TOKEN kubectl create secret generic hf-secret \ --from-literal = hf_api_token = ${ HF_TOKEN }
Build the custom container image
To ensure the multi-host environment has all required dependencies, build a custom image based on vLLM's TPU image and copy your serving script into it.
-
Create an Artifact Registry repository:
gcloud artifacts repositories create ${ REPO_NAME } \ --repository-format = docker \ --location = ${ REGION } -
Authenticate Docker to your project:
gcloud auth configure-docker ${ REGION } -docker.pkg.dev -
Inspect the
Dockerfilein the sample repository: -
Build and push the image to Artifact Registry:
docker build -t ${ CUSTOM_IMAGE_URI } . docker push ${ CUSTOM_IMAGE_URI }
Pre-stage model weights to Cloud Storage
Before deploying the RayCluster, optimize model loading performance and help ensure high availability across your distributed TPU slice by pre-staging the model weights directly in your Cloud Storage bucket by using a standalone Kubernetes Job. This decoupled approach allows for coordinated parallel streaming, accelerating cluster startup times.
-
The manifest for the downloader job is available in the repository. Review the manifest configuration:
-
Create the downloader job by applying the file in the repository:
envsubst < ai-ml/gke-ray/rayserve/llm/tpu/components/model-downloader-job.yaml | kubectl apply -f - -
Monitor the job until the download stream reports success:
kubectl logs -f job/model-downloader
Create the inference script
The following Python script defines a Ray Serve application powered by Ray Serve's high-level LLMConfig
wrapper.
-
Inspect the
serve_tpu_multihost.pyscript in the sample repository:
Understand the Ray LLM API
The script leverages Ray Serve's native ray.serve.llm
library to abstract away the complexity of multi-host TPU orchestration. By wrapping the vLLM engine, Ray Serve LLM provides a high-performance, scalable framework specifically designed for highly distributed inference workloads in production.
Using the Ray LLM API provides several key benefits:
- Multi-node deployments:Ray Serve LLM enables users to serve massive models that span multiple distributed hosts (like a TPU multi-host slice) with automatic placement, coordination, and topology distribution natively.
- vLLM compatibility:Ray Serve LLM provides an OpenAI-compatible API that aligns with vLLM's server. You can also access vLLM's advanced feature set (such as structured output, multimodal capabilities, and reasoning models) while scaling the workload across your Kubernetes cluster.
- Production-ready features:Ray Serve LLM includes enterprise-grade capabilities like built-in autoscaling, custom request routing for maximized cache hits, and built-in integrations for metrics and observability.
In the provided inference script, the deployment is defined by two main components:
-
LLMConfig:this object defines the serving configuration. It specifies the model source, the engine parameters for vLLM, and theaccelerator_config. By setting{"kind": "tpu", "topology": "4x4"}, Ray Serve LLM automatically provisions a distributed placement group that maps exactly to your physical 16-chip TPU v6e slice. -
build_openai_app:this API automatically wraps the configured vLLM engine in an OpenAI-compatible FastAPI server, giving you an industry-standard REST API (like/v1/chat/completions) out of the box without writing any custom server code.
Deploy the RayService
Deploy the Dynamic Resource Allocation (DRA) networking configuration and the RayService
serving manifest:
-
Request all available NetDevice interfaces on each node by deploying the
ResourceClaimTemplateprovided in the repository: -
Apply the template manifest to your cluster:
kubectl apply -f ai-ml/gke-ray/rayserve/llm/tpu/networking/all-netdev-template.yaml -
The
RayServiceserving manifest is available in the repository. Review the manifest configuration: -
Deploy the service by using the manifest:
Autopilot
-
To deploy the service in an Autopilot cluster, you must first download the manifest and edit it locally to add the opt-in
ComputeClassnodeSelector, which is required for DRANET networking on Autopilot:curl -O https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/main/ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v6e-multihost.yaml -
Add the label under the
nodeSelectorfield so that it looks like this:nodeSelector : cloud.google.com/gke-tpu-accelerator : tpu-v6e-slice cloud.google.com/gke-tpu-topology : 4x4 cloud.google.com/compute-class : dranet-compute-class -
Then, deploy the service by using the modified local manifest:
envsubst < ray-service.tpu-v6e-multihost.yaml | kubectl apply -f -
Standard
To deploy the service in a Standard cluster, deploy the manifest directly from the repository:
envsubst < ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v6e-multihost.yaml | kubectl apply -f - -
Verification
-
Wait for the RayService to be available:
kubectl wait --for = condition = Ready --timeout = 1800s rayservice/vllm-tpu-multihost -
To confirm the model loaded successfully, view the logs from the Ray head Pod:
kubectl logs -f -l ray.io/node-type = head -c ray-head
Serve the model
In this section, you interact with the model. Make sure the model is fully downloaded before proceeding.
Set up port forwarding
Set up port forwarding to the model by running the following command:
kubectl
port-forward
svc/vllm-tpu-multihost-head-svc
8000
:8000
2>&1
>/dev/null
&
Interact with the model using curl
This section shows how you can perform a basic smoke test to verify your deployed Gemma 4 model.
In a new terminal session, use curl
to chat with your model:
curl
-X
POST
http://127.0.0.1:8000/v1/chat/completions
\
-H
"Content-Type: application/json"
\
-d
'{
"model": "google/gemma-4-31B-it",
"messages": [
{
"role": "user",
"content": "Why is GKE managed DRANET preferred for multi-host TPU networking?"
}
],
"max_tokens": 256
}'
The output looks similar to the following:
{
"id"
:
"chatcmpl-392692d3-5325-4832-a3a3-0b084c1045b0"
,
"object"
:
"chat.completion"
,
"created"
:
1779883255
,
"model"
:
"google/gemma-4-31B-it"
,
"choices"
:
[
{
"index"
:
0
,
"message"
:
{
"role"
:
"assistant"
,
"content"
:
"To understand why GKE-managed **DRANET** (Distributed RANET) is preferred for multi-host TPU networking, it is first necessary to understand the fundamental challenge of TPU pods: **the need for massive, low-latency, all-to-all communication.**\n\nWhen you scale a model across multiple TPU hosts (multi-host), the hosts must synchronize gradients and weights constantly. Standard TCP/IP networking introduces too much overhead (latency and CPU jitter) for these operations.\n\nHere is the detailed breakdown of why GKE-managed DRANET is the preferred architecture:\n\n### 1. Bypassing the Kernel (Zero-Copy Networking)\nStandard networking requires the operating system kernel to handle packets, moving data from the network card to kernel space and then to user space.\n* **The DRANET Advantage:** DRANET implements a specialized networking stack that allows for **Kernel Bypass**. It enables the TPU hardware/drivers to write data directly into the memory of the destination host. This reduces latency and eliminates the CPU overhead associated with processing network interrupts.\n\n### 2. High-Bandwidth, Low-Latency Interconnect\nMulti-host TPU training relies on a specialized topology (like a 2D or 3D"
},
"finish_reason"
:
"length"
}
]
}
(Optional) Interact with the model through a Gradio chat interface
In this section, you build a web chat application that lets you interact with your instruction tuned model.
Gradio is a Python library that has a ChatInterface
wrapper that creates user interfaces for chatbots.
Deploy the chat interface
The manifest for the chat interface is available in the repository. Review the manifest configuration:
Apply the manifest:
kubectl
apply
-f
ai-ml/gke-ray/rayserve/llm/tpu/components/gradio.yaml
Wait for the deployment to be available:
kubectl
wait
--for =
condition
=
Available
--timeout =
900s
deployment/gradio
Use the chat interface
In Cloud Shell, run the following command:
kubectl
port-forward
service/gradio
8080
:8080
This creates a port forward from Cloud Shell to the Gradio service.
Click the Web Previewicon
which can be found on the top right of the Cloud Shell taskbar. Click Preview on Port 8080. A new tab opens in your browser.
Interact with Gemma using the Gradio chat interface. Add a prompt and click Submit.
Observe model performance
To view the dashboards for observability metrics of a model running on KubeRay, you can use the dedicated Ray on GKE dashboards.
For detailed instructions on configuring your cluster and accessing the observability dashboards, see Collect and view logs and metrics for RayClusters on Google Kubernetes Engine (GKE) .
Access the Ray Dashboard
To inspect the status of your Ray actors, view detailed application logs, and monitor node-level utilization natively in Ray, you can access the Ray Dashboard.
-
Port-forward the Ray head node service to your local machine:
kubectl port-forward svc/vllm-tpu-multihost-head-svc 8265 :8265 -
Open your browser and navigate to
http://localhost:8265. If you are using Cloud Shell, click the Web Previewbutton and select Preview on port 8265. -
To view your vLLM deployments, model replica health, and query latencies, click the Servetab.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the resources:
-
Delete the RayService:
kubectl delete rayservice vllm-tpu-multihost -
Delete the GKE cluster:
gcloud container clusters delete ${ CLUSTER_NAME } --zone = ${ ZONE }
What's next
- Learn about Ray on Kubernetes .
- Learn how to serve vLLM on GKE with TPUs .
- Learn more about TPUs in GKE .

