Serve Gemma open models using TPUs on GKE with Ray LLM

Autopilot Standard

This tutorial walks you through deploying a multi-host TPU inference service using Ray Serve LLM . By leveraging Ray's native TPU support to atomically co-schedule distributed engine workers across complex accelerator topologies, you can deploy large models over a multi-host TPU slice for inference.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads on distributed, multi-host TPU slices. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks .

Before reading this page, ensure that you're familiar with the following:

Background

This section describes the key technologies used in this guide.

TPUs

Tensor Processing Units (TPUs) let you accelerate specific workloads running on your nodes, such as machine learning and data processing. The primary advantage of TPUs is performance at scale. This tutorial uses TPU Trillium , the sixth generation of Cloud TPU . Multi-host TPU slices consist of multiple physical nodes communicating using a high-speed inter-chip interconnect (ICI), which works well for high-throughput and low-latency serving.

vLLM on Ray

vLLM is a high-throughput, memory-efficient LLM serving engine. By integrating with Ray Serve , vLLM can scale across multiple hosts and access physical hardware topologies natively. This tutorial demonstrates using Ray Serve's LLMConfig and LLMServer deployments to orchestrate vLLM inference across multi-host slices, letting the framework handle topology distribution and placement group spreading automatically.

Objectives

This tutorial provides a foundation for understanding and exploring practical LLM deployment for inference in a managed Kubernetes environment that uses multi-host TPUs.

Prepare your environment with a GKE cluster in Autopilot or Standard.
Build a custom container image with baked-in dependencies.
Deploy a Ray LLM Python script to your cluster to orchestrate vLLM inference over a TPU slice.
Use Ray LLM to serve the Gemma 4 model through curl and an optional web chat interface.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property . If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location . You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure your project has sufficient quota for TPU Trillium (v6e) capacity in your selected region. For more information, see Cloud TPU quotas .
Ensure your GKE cluster uses GKE Dataplane V2 and satisfies version requirements for DRANET: 1.35.2-gke.1842000 or laterfor both Standard and Autopilot.
Ensure that you have the following IAM roles :
- roles/container.admin
- roles/iam.serviceAccountAdmin

Prepare your environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl and the gcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell. This launches a session in the bottom pane of the Google Cloud console.

Create and activate a Python virtual environment:

 python3  
-m  
venv  
ray-env source 
  
ray-env/bin/activate

Install the Ray CLI:
```
 pip  
install  
 "ray" 
 
```

Set the default environment variables:

  export 
  
 PROJECT_ID 
 = 
 $( 
gcloud  
config  
get  
project ) 
 export 
  
 CLUSTER_NAME 
 = 
ray-llm-cluster export 
  
 REGION 
 = 
 REGION 
 export 
  
 ZONE 
 = 
 ZONE 
 export 
  
 NAMESPACE 
 = 
default export 
  
 KSA_NAME 
 = 
ray-ksa export 
  
 GSA_NAME 
 = 
tpu-reader-sa export 
  
 NETWORK_NAME 
 = 
 ${ 
 CLUSTER_NAME 
 } 
-net export 
  
 GS_BUCKET 
 = 
 BUCKET_NAME 
 export 
  
 REPO_NAME 
 = 
ray-repo export 
  
 CUSTOM_IMAGE_URI 
 = 
 REGION 
-docker.pkg.dev/ PROJECT_ID 
/ REPOSITORY 
/vllm-tpu-ray:vllm-tpu

Replace the following:

PROJECT_ID : your Google Cloud project ID.
CLUSTER_NAME : the name of your cluster.
REGION : the region where your TPU Trillium capacity is available.
ZONE : the zone where your TPU Trillium capacity is available. For more information, see TPU availability in GKE .
REPOSITORY : the name of your Artifact Registry repository.
BUCKET_NAME : the name of your storage bucket.

Create and configure Google Cloud resources

Follow these instructions to create the required resources.

Create a GKE cluster and node pool

You can serve Gemma on TPUs in a GKE Autopilot or Standard cluster. GKE managed DRANET dynamically requests and manages high-performance networking resources for your distributed Pods, allowing GKE to automatically provision secondary high-speed networks for accelerator inter-communication without requiring manual VPC setup.

Autopilot

In Cloud Shell, create the Autopilot cluster:

 gcloud  
container  
clusters  
create-auto  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--enable-ray-operator  
 \ 
  
--location = 
 ${ 
 REGION 
 }

Configure kubectl to communicate with your cluster:

 gcloud  
container  
clusters  
get-credentials  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--location = 
 ${ 
 REGION 
 }

To use GKE managed DRANET in Autopilot mode, deploy the custom ComputeClass resource provided in the repository to opt-in to dynamic networking:

  apiVersion 
 : 
  
 cloud.google.com/v1 
 kind 
 : 
  
 ComputeClass 
 metadata 
 : 
  
 name 
 : 
  
 dranet-compute-class 
 spec 
 : 
  
 nodePoolAutoCreation 
 : 
  
 enabled 
 : 
  
 true 
  
 nodePoolConfig 
 : 
  
 dra 
 : 
  
 networking 
 : 
  
 enabled 
 : 
  
 true 
  
 priorities 
 : 
  
 - 
  
 machineType 
 : 
  
 ct6e-standard-4t 
  
 acceleratorNetworkProfile 
 : 
  
 auto

Apply the manifest to your cluster:

 kubectl  
apply  
-f  
ai-ml/gke-ray/rayserve/llm/tpu/networking/dranet-compute-class.yaml

Standard

In Cloud Shell, create a Standard cluster that enables the Ray operator and uses GKE Dataplane V2:

 gcloud  
container  
clusters  
create  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--addons = 
RayOperator,GcsFuseCsiDriver  
 \ 
  
--machine-type = 
n2-standard-8  
 \ 
  
--enable-dataplane-v2  
 \ 
  
--workload-pool = 
 ${ 
 PROJECT_ID 
 } 
.svc.id.goog  
 \ 
  
--location = 
 ${ 
 ZONE 
 }

Create a multi-host TPU slice node pool with the DRANET driver enabled:

 gcloud  
container  
node-pools  
create  
v6e-16  
 \ 
  
--location = 
 ${ 
 ZONE 
 } 
  
 \ 
  
--cluster = 
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--machine-type = 
ct6e-standard-4t  
 \ 
  
--tpu-topology = 
4x4  
 \ 
  
--num-nodes = 
 4 
  
 \ 
  
--enable-gvnic  
 \ 
  
--scopes = 
https://www.googleapis.com/auth/cloud-platform  
 \ 
  
--accelerator-network-profile = 
auto  
 \ 
  
--node-labels = 
cloud.google.com/gke-networking-dra-driver = 
 true

Configure storage and authentication

Create a Cloud Storage bucket and initialize a Rapid Cache instance to accelerate model loading, then configure authentication for Hugging Face:

In your TPU zone, create a storage bucket and initialize the Rapid Cache instance:

 gcloud  
storage  
buckets  
create  
gs:// ${ 
 GS_BUCKET 
 } 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
--default-storage-class = 
STANDARD  
--location = 
 ${ 
 REGION 
 } 
gcloud  
storage  
buckets  
anywhere-caches  
create  
gs:// ${ 
 GS_BUCKET 
 } 
  
 ${ 
 ZONE 
 } 
  
 \ 
  
--ttl = 
1d  
 \ 
  
--admission-policy = 
ADMIT_ON_FIRST_MISS

Configure identity links to help securely mount the weight bucket into your GKE Pods. First, create a dedicated IAM service account and grant it bucket read permissions:

 gcloud  
iam  
service-accounts  
create  
 ${ 
 GSA_NAME 
 } 
gcloud  
storage  
buckets  
add-iam-policy-binding  
gs:// ${ 
 GS_BUCKET 
 } 
  
 \ 
  
--member = 
 "serviceAccount: 
 ${ 
 GSA_NAME 
 } 
 @ 
 ${ 
 PROJECT_ID 
 } 
 .iam.gserviceaccount.com" 
  
 \ 
  
--role = 
 "roles/storage.objectAdmin"

Create the Workload Identity Federation for GKE binding and annotate the Kubernetes ServiceAccount object:

 gcloud  
iam  
service-accounts  
add-iam-policy-binding  
 ${ 
 GSA_NAME 
 } 
@ ${ 
 PROJECT_ID 
 } 
.iam.gserviceaccount.com  
 \ 
  
--role = 
 "roles/iam.workloadIdentityUser" 
  
 \ 
  
--member = 
 "serviceAccount: 
 ${ 
 PROJECT_ID 
 } 
 .svc.id.goog[ 
 ${ 
 NAMESPACE 
 } 
 / 
 ${ 
 KSA_NAME 
 } 
 ]" 
kubectl  
create  
serviceaccount  
 ${ 
 KSA_NAME 
 } 
  
--namespace  
 ${ 
 NAMESPACE 
 } 
kubectl  
annotate  
serviceaccount  
 ${ 
 KSA_NAME 
 } 
  
--namespace  
 ${ 
 NAMESPACE 
 } 
  
iam.gke.io/gcp-service-account = 
 ${ 
 GSA_NAME 
 } 
@ ${ 
 PROJECT_ID 
 } 
.iam.gserviceaccount.com

To download the Gemma 4 model weights, you must acknowledge Google's license agreement on Hugging Face. Go to the Gemma 4 model page on Hugging Face .
Sign in accept the license terms by clicking Agree and access repository.
Navigate to your Hugging Face account settings and generate an Access Token with the Read role.

Export your Hugging Face token and create a Kubernetes secret so Ray can pull the model weights:

  export 
  
 HF_TOKEN 
 = 
 YOUR_HUGGING_FACE_TOKEN 
kubectl  
create  
secret  
generic  
hf-secret  
 \ 
  
--from-literal = 
 hf_api_token 
 = 
 ${ 
 HF_TOKEN 
 }

Build the custom container image

To ensure the multi-host environment has all required dependencies, build a custom image based on vLLM's TPU image and copy your serving script into it.

Create an Artifact Registry repository:

 gcloud  
artifacts  
repositories  
create  
 ${ 
 REPO_NAME 
 } 
  
 \ 
  
--repository-format = 
docker  
 \ 
  
--location = 
 ${ 
 REGION 
 }

Authenticate Docker to your project:

 gcloud  
auth  
configure-docker  
 ${ 
 REGION 
 } 
-docker.pkg.dev

Inspect the Dockerfile in the sample repository:

  FROM 
  
 vllm/vllm-tpu:v0.21.0 
 ENV 
  
 VLLM_TARGET_DEVICE 
 = 
tpu ENV 
  
 VLLM_XLA_CACHE_PATH 
 = 
/data USER 
  
 root 
 RUN 
  
pip  
install  
--no-cache-dir  
-U  
 \ 
  
 "https://s3-us-west-2.amazonaws.com/ray-wheels/master/75b85027a859439fae5634e49aa6443f6fbecfeb/ray-3.0.0.dev0-cp312-cp312-manylinux2014_x86_64.whl" 
 && 
 \ 
  
pip  
install  
--no-cache-dir  
--no-deps  
 "ray[llm]" 
 COPY 
  
serve_tpu_multihost.py  
/home/ray/serve_tpu_multihost.py

Build and push the image to Artifact Registry:

 docker  
build  
-t  
 ${ 
 CUSTOM_IMAGE_URI 
 } 
  
.
docker  
push  
 ${ 
 CUSTOM_IMAGE_URI 
 }

Pre-stage model weights to Cloud Storage

Before deploying the RayCluster, optimize model loading performance and help ensure high availability across your distributed TPU slice by pre-staging the model weights directly in your Cloud Storage bucket by using a standalone Kubernetes Job. This decoupled approach allows for coordinated parallel streaming, accelerating cluster startup times.

The manifest for the downloader job is available in the repository. Review the manifest configuration:

  apiVersion 
 : 
  
 batch/v1 
 kind 
 : 
  
 Job 
 metadata 
 : 
  
 name 
 : 
  
 model-downloader 
 spec 
 : 
  
 ttlSecondsAfterFinished 
 : 
  
 60 
  
 template 
 : 
  
 metadata 
 : 
  
 annotations 
 : 
  
 gke-gcsfuse/volumes 
 : 
  
 "true" 
  
 gke-gcsfuse/memory-limit 
 : 
  
 "0" 
  
 spec 
 : 
  
 serviceAccountName 
 : 
  
 ${KSA_NAME} 
  
 restartPolicy 
 : 
  
 OnFailure 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 downloader 
  
 image 
 : 
  
 python:3.10-slim 
  
 command 
 : 
  
 [ 
 "/bin/sh" 
 , 
  
 "-c" 
 ] 
  
 args 
 : 
  
 - 
  
 | 
  
 pip install -U huggingface_hub filelock 
  
 python -c ' 
  
 import filelock 
  
 class DummyLock: 
  
 def __init__(self, *args, **kwargs): pass 
  
 def __enter__(self): return self 
  
 def __exit__(self, *args): pass 
  
 def acquire(self, *args, **kwargs): pass 
  
 def release(self, *args, **kwargs): pass 
  
 filelock.FileLock = DummyLock 
  
 from huggingface_hub import snapshot_download 
  
 snapshot_download( 
  
 repo_id="google/gemma-4-31B-it", 
  
 local_dir="/data/google/gemma-4-31B-it" 
  
 ) 
  
 ' 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 HF_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 mountPath 
 : 
  
 /data 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 csi 
 : 
  
 driver 
 : 
  
 gcsfuse.csi.storage.gke.io 
  
 volumeAttributes 
 : 
  
 bucketName 
 : 
  
 ${GS_BUCKET} 
  
 mountOptions 
 : 
  
 "implicit-dirs"

Create the downloader job by applying the file in the repository:

 envsubst < 
ai-ml/gke-ray/rayserve/llm/tpu/components/model-downloader-job.yaml  
 | 
  
kubectl  
apply  
-f  
-

Monitor the job until the download stream reports success:
```
 kubectl  
logs  
-f  
job/model-downloader 
```

Create the inference script

The following Python script defines a Ray Serve application powered by Ray Serve's high-level LLMConfig wrapper.

Inspect the serve_tpu_multihost.py script in the sample repository:

  import 
  
 os 
 import 
  
 ray 
 from 
  
 ray 
  
 import 
 serve 
 from 
  
 ray.serve.llm 
  
 import 
 LLMConfig 
 , 
 ModelLoadingConfig 
 , 
 LLMServingArgs 
 , 
 build_openai_app 
 # Read configurations from environment variables 
 MODEL_ID 
 = 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "MODEL_ID" 
 , 
 "google/gemma-4-31B-it" 
 ) 
 MODEL_SOURCE 
 = 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "MODEL_SOURCE" 
 , 
 "/data/google/gemma-4-31B-it" 
 ) 
 # TPU hardware options (i.e. TPU-V6E, TPU-V7X etc.) 
 ACCELERATOR_TYPE 
 = 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "ACCELERATOR_TYPE" 
 , 
 "TPU-V6E" 
 ) 
 TPU_TOPOLOGY 
 = 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "TPU_TOPOLOGY" 
 , 
 "4x4" 
 ) 
 # vLLM engine parameters 
 TENSOR_PARALLEL_SIZE 
 = 
 int 
 ( 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "TENSOR_PARALLEL_SIZE" 
 , 
 "16" 
 )) 
 MAX_MODEL_LEN 
 = 
 int 
 ( 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "MAX_MODEL_LEN" 
 , 
 "8192" 
 )) 
 MAX_NUM_BATCHED_TOKENS 
 = 
 int 
 ( 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "MAX_NUM_BATCHED_TOKENS" 
 , 
 "4096" 
 )) 
 # Define the multi-host TPU LLM config 
 llm_config 
 = 
 LLMConfig 
 ( 
 model_loading_config 
 = 
 dict 
 ( 
 model_id 
 = 
 MODEL_ID 
 , 
 model_source 
 = 
 MODEL_SOURCE 
 ), 
 accelerator_type 
 = 
 ACCELERATOR_TYPE 
 , 
 accelerator_config 
 = 
 { 
 "kind" 
 : 
 "tpu" 
 , 
 "topology" 
 : 
 TPU_TOPOLOGY 
 }, 
 engine_kwargs 
 = 
 { 
 "tensor_parallel_size" 
 : 
 TENSOR_PARALLEL_SIZE 
 , 
 "max_model_len" 
 : 
 MAX_MODEL_LEN 
 , 
 "max_num_batched_tokens" 
 : 
 MAX_NUM_BATCHED_TOKENS 
 , 
 "distributed_executor_backend" 
 : 
 "ray" 
 , 
 } 
 ) 
 deployment 
 = 
 build_openai_app 
 ( 
 LLMServingArgs 
 ( 
 llm_configs 
 = 
 [ 
 llm_config 
 ] 
 ) 
 )

Understand the Ray LLM API

The script leverages Ray Serve's native ray.serve.llm library to abstract away the complexity of multi-host TPU orchestration. By wrapping the vLLM engine, Ray Serve LLM provides a high-performance, scalable framework specifically designed for highly distributed inference workloads in production.

Using the Ray LLM API provides several key benefits:

Multi-node deployments:Ray Serve LLM enables users to serve massive models that span multiple distributed hosts (like a TPU multi-host slice) with automatic placement, coordination, and topology distribution natively.
vLLM compatibility:Ray Serve LLM provides an OpenAI-compatible API that aligns with vLLM's server. You can also access vLLM's advanced feature set (such as structured output, multimodal capabilities, and reasoning models) while scaling the workload across your Kubernetes cluster.
Production-ready features:Ray Serve LLM includes enterprise-grade capabilities like built-in autoscaling, custom request routing for maximized cache hits, and built-in integrations for metrics and observability.

In the provided inference script, the deployment is defined by two main components:

LLMConfig :this object defines the serving configuration. It specifies the model source, the engine parameters for vLLM, and the accelerator_config . By setting {"kind": "tpu", "topology": "4x4"} , Ray Serve LLM automatically provisions a distributed placement group that maps exactly to your physical 16-chip TPU v6e slice.
build_openai_app :this API automatically wraps the configured vLLM engine in an OpenAI-compatible FastAPI server, giving you an industry-standard REST API (like /v1/chat/completions ) out of the box without writing any custom server code.

Deploy the RayService

Deploy the Dynamic Resource Allocation (DRA) networking configuration and the RayService serving manifest:

Request all available NetDevice interfaces on each node by deploying the ResourceClaimTemplate provided in the repository:

  apiVersion 
 : 
  
 resource.k8s.io/v1 
 kind 
 : 
  
 ResourceClaimTemplate 
 metadata 
 : 
  
 name 
 : 
  
 all-netdev 
 spec 
 : 
  
 spec 
 : 
  
 devices 
 : 
  
 requests 
 : 
  
 - 
  
 name 
 : 
  
 req-netdev 
  
 exactly 
 : 
  
 deviceClassName 
 : 
  
 netdev.google.com 
  
 allocationMode 
 : 
  
 All

Apply the template manifest to your cluster:

 kubectl  
apply  
-f  
ai-ml/gke-ray/rayserve/llm/tpu/networking/all-netdev-template.yaml

The RayService serving manifest is available in the repository. Review the manifest configuration:

  apiVersion 
 : 
  
 ray.io/v1 
 kind 
 : 
  
 RayService 
 metadata 
 : 
  
 name 
 : 
  
 vllm-tpu-multihost 
  
 labels 
 : 
  
 ai.gke.io/model 
 : 
  
 "gemma-4-31B-it" 
  
 ai.gke.io/inference-server 
 : 
  
 "vllm" 
 spec 
 : 
  
 serveConfigV2 
 : 
  
 | 
  
 http_options: 
  
 host: 0.0.0.0 
  
 port: 8000 
  
 applications: 
  
 - name: llm 
  
 import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu_multihost:deployment 
  
 runtime_env: 
  
 working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip" 
  
 env_vars: 
  
 # Use local disk to prevent multi-host GCSFuse race conditions 
  
 VLLM_XLA_CACHE_PATH: "/tmp/vllm_xla_cache" 
  
 rayClusterConfig 
 : 
  
 headGroupSpec 
 : 
  
 rayStartParams 
 : 
  
 {} 
  
 template 
 : 
  
 metadata 
 : 
  
 annotations 
 : 
  
 gke-gcsfuse/volumes 
 : 
  
 "true" 
  
 gke-gcsfuse/cpu-limit 
 : 
  
 "0" 
  
 gke-gcsfuse/memory-limit 
 : 
  
 "0" 
  
 gke-gcsfuse/ephemeral-storage-limit 
 : 
  
 "0" 
  
 spec 
 : 
  
 serviceAccountName 
 : 
  
 $KSA_NAME 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ray-head 
  
 image 
 : 
  
 $CUSTOM_IMAGE_URI 
  
 imagePullPolicy 
 : 
  
 Always 
  
 ports 
 : 
  
 - 
  
 containerPort 
 : 
  
 6379 
  
 name 
 : 
  
 gcs 
  
 - 
  
 containerPort 
 : 
  
 8265 
  
 name 
 : 
  
 dashboard 
  
 - 
  
 containerPort 
 : 
  
 10001 
  
 name 
 : 
  
 client 
  
 - 
  
 containerPort 
 : 
  
 8000 
  
 name 
 : 
  
 serve 
  
 resources 
 : 
  
 limits 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 16Gi 
  
 requests 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 16Gi 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 mountPath 
 : 
  
 /dev/shm 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 mountPath 
 : 
  
 /data 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
  
 - 
  
 name 
 : 
  
 gke-gcsfuse-cache 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 csi 
 : 
  
 driver 
 : 
  
 gcsfuse.csi.storage.gke.io 
  
 volumeAttributes 
 : 
  
 bucketName 
 : 
  
 $GS_BUCKET 
  
 mountOptions 
 : 
  
 "implicit-dirs" 
  
 workerGroupSpecs 
 : 
  
 - 
  
 groupName 
 : 
  
 tpu-group 
  
 replicas 
 : 
  
 1 
  
 minReplicas 
 : 
  
 1 
  
 maxReplicas 
 : 
  
 1 
  
 numOfHosts 
 : 
  
 4 
  
 rayStartParams 
 : 
  
 {} 
  
 template 
 : 
  
 metadata 
 : 
  
 annotations 
 : 
  
 gke-gcsfuse/volumes 
 : 
  
 "true" 
  
 gke-gcsfuse/cpu-limit 
 : 
  
 "0" 
  
 gke-gcsfuse/memory-limit 
 : 
  
 "0" 
  
 gke-gcsfuse/ephemeral-storage-limit 
 : 
  
 "0" 
  
 spec 
 : 
  
 serviceAccountName 
 : 
  
 $KSA_NAME 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ray-worker 
  
 image 
 : 
  
 $CUSTOM_IMAGE_URI 
  
 imagePullPolicy 
 : 
  
 Always 
  
 resources 
 : 
  
 limits 
 : 
  
 cpu 
 : 
  
 "20" 
  
 google.com/tpu 
 : 
  
 "4" 
  
 memory 
 : 
  
 200Gi 
  
 requests 
 : 
  
 cpu 
 : 
  
 "20" 
  
 google.com/tpu 
 : 
  
 "4" 
  
 memory 
 : 
  
 200Gi 
  
 claims 
 : 
  
 - 
  
 name 
 : 
  
 netdev 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 HF_HOME 
  
 value 
 : 
  
 "/data/huggingface" 
  
 - 
  
 name 
 : 
  
 HF_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 - 
  
 name 
 : 
  
 JAX_PLATFORMS 
  
 value 
 : 
  
 "tpu,cpu" 
  
 - 
  
 name 
 : 
  
 NODE_IP 
  
 valueFrom 
 : 
  
 fieldRef 
 : 
  
 fieldPath 
 : 
  
 status.hostIP 
  
 - 
  
 name 
 : 
  
 VBAR_CONTROL_SERVICE_URL 
  
 value 
 : 
  
 $(NODE_IP):8353 
  
 - 
  
 name 
 : 
  
 TPU_MULTIHOST_BACKEND 
  
 value 
 : 
  
 "ray" 
  
 - 
  
 name 
 : 
  
 TPU_BACKEND_TYPE 
  
 value 
 : 
  
 "jax" 
  
 - 
  
 name 
 : 
  
 ENABLE_PJRT_COMPATIBILITY 
  
 value 
 : 
  
 "true" 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 mountPath 
 : 
  
 /dev/shm 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 mountPath 
 : 
  
 /data 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
  
 - 
  
 name 
 : 
  
 gke-gcsfuse-cache 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
  
 - 
  
 name 
 : 
  
 gcs-fuse-csi-ephemeral 
  
 csi 
 : 
  
 driver 
 : 
  
 gcsfuse.csi.storage.gke.io 
  
 volumeAttributes 
 : 
  
 bucketName 
 : 
  
 $GS_BUCKET 
  
 mountOptions 
 : 
  
 "implicit-dirs" 
  
 resourceClaims 
 : 
  
 - 
  
 name 
 : 
  
 netdev 
  
 resourceClaimTemplateName 
 : 
  
 all-netdev 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 cloud.google.com/gke-tpu-topology 
 : 
  
 4x4

Deploy the service by using the manifest:

Autopilot

To deploy the service in an Autopilot cluster, you must first download the manifest and edit it locally to add the opt-in ComputeClass nodeSelector , which is required for DRANET networking on Autopilot:
```
 curl  
-O  
https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/main/ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v6e-multihost.yaml 
```

Add the label under the nodeSelector field so that it looks like this:

  nodeSelector 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 cloud.google.com/gke-tpu-topology 
 : 
  
 4x4 
  
 cloud.google.com/compute-class 
 : 
  
 dranet-compute-class

Then, deploy the service by using the modified local manifest:

 envsubst < 
ray-service.tpu-v6e-multihost.yaml  
 | 
  
kubectl  
apply  
-f  
-

Standard

To deploy the service in a Standard cluster, deploy the manifest directly from the repository:

 envsubst < 
ai-ml/gke-ray/rayserve/llm/tpu/ray-service.tpu-v6e-multihost.yaml  
 | 
  
kubectl  
apply  
-f  
-

Verification

Wait for the RayService to be available:

 kubectl  
 wait 
  
--for = 
 condition 
 = 
Ready  
--timeout = 
1800s  
rayservice/vllm-tpu-multihost

To confirm the model loaded successfully, view the logs from the Ray head Pod:

 kubectl  
logs  
-f  
-l  
ray.io/node-type = 
head  
-c  
ray-head

Serve the model

In this section, you interact with the model. Make sure the model is fully downloaded before proceeding.

Set up port forwarding

Set up port forwarding to the model by running the following command:

 kubectl  
port-forward  
svc/vllm-tpu-multihost-head-svc  
 8000 
:8000  
 2>&1 
  
>/dev/null  
&

Interact with the model using curl

This section shows how you can perform a basic smoke test to verify your deployed Gemma 4 model.

In a new terminal session, use curl to chat with your model:

 curl  
-X  
POST  
http://127.0.0.1:8000/v1/chat/completions  
 \ 
  
-H  
 "Content-Type: application/json" 
  
 \ 
  
-d  
 '{ 
 "model": "google/gemma-4-31B-it", 
 "messages": [ 
 { 
 "role": "user", 
 "content": "Why is GKE managed DRANET preferred for multi-host TPU networking?" 
 } 
 ], 
 "max_tokens": 256 
 }'

The output looks similar to the following:

  { 
  
 "id" 
 : 
  
 "chatcmpl-392692d3-5325-4832-a3a3-0b084c1045b0" 
 , 
  
 "object" 
 : 
  
 "chat.completion" 
 , 
  
 "created" 
 : 
  
 1779883255 
 , 
  
 "model" 
 : 
  
 "google/gemma-4-31B-it" 
 , 
  
 "choices" 
 : 
  
 [ 
  
 { 
  
 "index" 
 : 
  
 0 
 , 
  
 "message" 
 : 
  
 { 
  
 "role" 
 : 
  
 "assistant" 
 , 
  
 "content" 
 : 
  
 "To understand why GKE-managed **DRANET** (Distributed RANET) is preferred for multi-host TPU networking, it is first necessary to understand the fundamental challenge of TPU pods: **the need for massive, low-latency, all-to-all communication.**\n\nWhen you scale a model across multiple TPU hosts (multi-host), the hosts must synchronize gradients and weights constantly. Standard TCP/IP networking introduces too much overhead (latency and CPU jitter) for these operations.\n\nHere is the detailed breakdown of why GKE-managed DRANET is the preferred architecture:\n\n### 1. Bypassing the Kernel (Zero-Copy Networking)\nStandard networking requires the operating system kernel to handle packets, moving data from the network card to kernel space and then to user space.\n*   **The DRANET Advantage:** DRANET implements a specialized networking stack that allows for **Kernel Bypass**. It enables the TPU hardware/drivers to write data directly into the memory of the destination host. This reduces latency and eliminates the CPU overhead associated with processing network interrupts.\n\n### 2. High-Bandwidth, Low-Latency Interconnect\nMulti-host TPU training relies on a specialized topology (like a 2D or 3D" 
  
 }, 
  
 "finish_reason" 
 : 
  
 "length" 
  
 } 
  
 ] 
 }

(Optional) Interact with the model through a Gradio chat interface

In this section, you build a web chat application that lets you interact with your instruction tuned model.

Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.

Deploy the chat interface

The manifest for the chat interface is available in the repository. Review the manifest configuration:

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Deployment 
 metadata 
 : 
  
 name 
 : 
  
 gradio 
  
 labels 
 : 
  
 app 
 : 
  
 gradio 
 spec 
 : 
  
 replicas 
 : 
  
 1 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 gradio 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 app 
 : 
  
 gradio 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 gradio 
  
 image 
 : 
  
 us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.7 
  
 resources 
 : 
  
 requests 
 : 
  
 cpu 
 : 
  
 "250m" 
  
 memory 
 : 
  
 "512Mi" 
  
 limits 
 : 
  
 cpu 
 : 
  
 "500m" 
  
 memory 
 : 
  
 "512Mi" 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 CONTEXT_PATH 
  
 value 
 : 
  
 "/v1/chat/completions" 
  
 - 
  
 name 
 : 
  
 HOST 
  
 value 
 : 
  
 "http://vllm-tpu-multihost-serve-svc:8000" 
  
 - 
  
 name 
 : 
  
 LLM_ENGINE 
  
 value 
 : 
  
 "openai-chat" 
  
 - 
  
 name 
 : 
  
 MODEL_ID 
  
 value 
 : 
  
 "google/gemma-4-31B-it" 
  
 - 
  
 name 
 : 
  
 DISABLE_SYSTEM_MESSAGE 
  
 value 
 : 
  
 "true" 
  
 ports 
 : 
  
 - 
  
 containerPort 
 : 
  
 7860 
 --- 
 apiVersion 
 : 
  
 v1 
 kind 
 : 
  
 Service 
 metadata 
 : 
  
 name 
 : 
  
 gradio 
 spec 
 : 
  
 selector 
 : 
  
 app 
 : 
  
 gradio 
  
 ports 
 : 
  
 - 
  
 protocol 
 : 
  
 TCP 
  
 port 
 : 
  
 8080 
  
 targetPort 
 : 
  
 7860 
  
 type 
 : 
  
 ClusterIP

Apply the manifest:

 kubectl  
apply  
-f  
ai-ml/gke-ray/rayserve/llm/tpu/components/gradio.yaml

Wait for the deployment to be available:

 kubectl  
 wait 
  
--for = 
 condition 
 = 
Available  
--timeout = 
900s  
deployment/gradio

Use the chat interface

In Cloud Shell, run the following command:

 kubectl  
port-forward  
service/gradio  
 8080 
:8080

This creates a port forward from Cloud Shell to the Gradio service.

Click the Web Previewicon Web Preview button which can be found on the top right of the Cloud Shell taskbar. Click Preview on Port 8080. A new tab opens in your browser.

Interact with Gemma using the Gradio chat interface. Add a prompt and click Submit.

Observe model performance

To view the dashboards for observability metrics of a model running on KubeRay, you can use the dedicated Ray on GKE dashboards.

For detailed instructions on configuring your cluster and accessing the observability dashboards, see Collect and view logs and metrics for RayClusters on Google Kubernetes Engine (GKE) .

Access the Ray Dashboard

To inspect the status of your Ray actors, view detailed application logs, and monitor node-level utilization natively in Ray, you can access the Ray Dashboard.

Port-forward the Ray head node service to your local machine:

 kubectl  
port-forward  
svc/vllm-tpu-multihost-head-svc  
 8265 
:8265

Open your browser and navigate to http://localhost:8265 . If you are using Cloud Shell, click the Web Previewbutton and select Preview on port 8265.
To view your vLLM deployments, model replica health, and query latencies, click the Servetab.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the resources:

Delete the RayService:

 kubectl  
delete  
rayservice  
vllm-tpu-multihost

Delete the GKE cluster:

 gcloud  
container  
clusters  
delete  
 ${ 
 CLUSTER_NAME 
 } 
  
--zone = 
 ${ 
 ZONE 
 }

What's next

Learn about Ray on Kubernetes .
Learn how to serve vLLM on GKE with TPUs .
Learn more about TPUs in GKE .

Serve Gemma open models using TPUs on GKE with Ray LLM Stay organized with collections Save and categorize content based on your preferences.

Background

TPUs

vLLM on Ray

Objectives

Before you begin

Prepare your environment

Create and configure Google Cloud resources

Create a GKE cluster and node pool

Autopilot

Standard

Configure storage and authentication

Build the custom container image

Pre-stage model weights to Cloud Storage

Create the inference script

Understand the Ray LLM API

Deploy the RayService

Autopilot

Standard

Verification

Serve the model

Set up port forwarding

Interact with the model using curl

(Optional) Interact with the model through a Gradio chat interface

Deploy the chat interface

Use the chat interface

Observe model performance

Access the Ray Dashboard

Clean up

What's next

Serve Gemma open models using TPUs on GKE with Ray LLM