Serve an LLM with multi-cluster Ray Serve and GKE Inference Gateway

Standard

This document explains how to manage inference requests across multiple Ray Serve clusters on Google Kubernetes Engine (GKE) by configuring the Kubernetes Gateway API and GKE Inference Gateway. This configuration lets you centralize traffic management for multiple teams, distribute workloads across regions for higher capacity, and implement model-aware routing based on request body content.

Benefits of using GKE Inference Gateway and Ray Serve

Using GKE Inference Gateway and Ray Serve offers the following benefits:

Path routing: configure each RayService with a path prefix, then serve them with one Gateway routing to multiple Ray Services.
- For more information about setting up path prefix rules, see the Gateway API documentation .
Model-aware routing: choose a RayService to route to based on the request body—for example, by extracting the requested model from an OpenAI-API JSON request.
Governance: require API keys to use your service, or enforce quota for users by using Apigee for authentication and API management .
Multi-region: split traffic across multiple GKE clusters with RayServices to attain higher availability or capacity with multi-cluster Gateways .
Separation of concerns: use separate RayServices, which can be administered by separate teams, follow separate rollouts, and run on different topologies.
Security: use Gateway to act as the SSL terminator to help secure your user traffic over the internet. For more information, see Gateway security .

To configure routing, you need to deploy a Gateway, HTTPRoute, and RayService. A Kubernetes Service for each target Ray cluster is typically created by KubeRay. Ray Serve spreads request load in-cluster, with no need to create an InferencePool or Endpoint Picker.

Model-aware routing for Ray Serve on GKE

Model-aware routing is enabled by a body-based routing extension. Body-based routing lets you direct traffic to different RayServices based solely on the model named in the user's request, which lets you have a single endpoint that can serve many models hosted in multiple Ray clusters. Your users have simplified access, and your app developers have control over configuring each Ray endpoint.

To configure model-aware routing, you deploy the following key components:

A body-based router extension to extract model names from JSON payloads. This router extension is deployed by using Helm.
A GKE Gateway (L7 regional internal Application Load Balancer) to handle the incoming traffic.
HTTPRoute rules to direct traffic to the correct Ray Service by using headers populated by the router extension.
Multiple Ray Serve clusters to manage the lifecycle and autoscaling of siloed models.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property . If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location . You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure you have Helm installed.
Create a Hugging Face account, if you don't already have one.
Ensure that you have a Hugging Face token .

Prepare your environment

Set up environment variables:

  export 
  
 CLUSTER 
 = 
 $( 
whoami ) 
-ray-bbr export 
  
 PROJECT_ID 
 = 
 $( 
gcloud  
config  
get-value  
project ) 
 export 
  
 LOCATION 
 = 
us-central1-b export 
  
 REGION 
 = 
us-central1 export 
  
 HUGGING_FACE_TOKEN 
 = 
 YOUR_HUGGING_FACE_TOKEN

Replace YOUR_HUGGING_FACE_TOKEN with your Hugging Face access token.

Prepare your infrastructure

In this section, you set up a Ray-enabled, Gateway-enabled GKE cluster with L4 GPUs.

Create a cluster with the Ray Operator and Gateway API enabled:

 gcloud  
container  
clusters  
create  
 ${ 
 CLUSTER 
 } 
  
 \ 
  
--project  
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--location  
 ${ 
 LOCATION 
 } 
  
 \ 
  
--cluster-version  
 1 
.35  
 \ 
  
--gateway-api  
standard  
 \ 
  
--addons  
HttpLoadBalancing,RayOperator  
 \ 
  
--enable-ray-cluster-logging  
 \ 
  
--enable-ray-cluster-monitoring  
 \ 
  
--machine-type  
e2-standard-4

Create a GPU node pool for your model workloads:

 gcloud  
container  
node-pools  
create  
gpu-pool  
 \ 
  
--cluster = 
 ${ 
 CLUSTER 
 } 
  
 \ 
  
--location = 
 ${ 
 LOCATION 
 } 
  
 \ 
  
--accelerator = 
 "type=nvidia-l4,count=1,gpu-driver-version=latest" 
  
 \ 
  
--machine-type = 
g2-standard-8  
 \ 
  
--num-nodes = 
 4

Create a proxy-only subnet for the regional internal Application Load Balancer, which is required by body-based routing:

 gcloud  
compute  
networks  
subnets  
create  
bbr-proxy-only-subnet  
 \ 
  
--purpose = 
REGIONAL_MANAGED_PROXY  
 \ 
  
--role = 
ACTIVE  
 \ 
  
--region = 
 ${ 
 REGION 
 } 
  
 \ 
  
--network = 
default  
 \ 
  
--range = 
 192 
.168.10.0/24

Deploy your Hugging Face secret:

 kubectl  
create  
secret  
generic  
hf-secret  
 \ 
  
--from-literal = 
 hf_api_token 
 = 
 ${ 
 HUGGING_FACE_TOKEN 
 }

Deploy the body-based router for model-aware routing

The body-based router extension intercepts requests, parses the JSON body, and extracts the model field into an X-Gateway-Model-Name header.

Create a file named helm-values.yaml with the following content:

  bbr 
 : 
  
 plugins 
 : 
  
 - 
  
 type 
 : 
  
 "body-field-to-header" 
  
 name 
 : 
  
 "openai-model-extractor" 
  
 json 
 : 
  
 field_name 
 : 
  
 "model" 
  
 header_name 
 : 
  
 "X-Gateway-Model-Name"

Install the body-based router by using Helm:

 helm  
install  
body-based-router  
 \ 
  
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing  
 \ 
  
--version  
v1.4.0  
 \ 
  
--set  
provider.name = 
gke  
 \ 
  
--set  
inferenceGateway.name = 
ray-multi-model-gateway  
 \ 
  
--values  
helm-values.yaml

Deploy RayServices

To deploy your models, you must apply the RayService manifests. Each manifest defines a Ray cluster that runs a specific LLM.

Create a file named gemma-2b-it.yaml with the following content:

  apiVersion 
 : 
  
 ray.io/v1 
 kind 
 : 
  
 RayService 
 metadata 
 : 
  
 name 
 : 
  
 gemma-2b-it 
 spec 
 : 
  
 serveConfigV2 
 : 
  
 | 
  
 applications: 
  
 - name: llm_app 
  
 route_prefix: "/" 
  
 import_path: ray.serve.llm:build_openai_app 
  
 args: 
  
 llm_configs: 
  
 - model_loading_config: 
  
 model_id: gemma-2b-it 
  
 model_source: google/gemma-2b-it 
  
 accelerator_type: L4 
  
 log_engine_metrics: true 
  
 deployment_config: 
  
 autoscaling_config: 
  
 min_replicas: 2 
  
 max_replicas: 2 
  
 health_check_period_s: 600 
  
 health_check_timeout_s: 300 
  
 rayClusterConfig 
 : 
  
 headGroupSpec 
 : 
  
 rayStartParams 
 : 
  
 dashboard-host 
 : 
  
 "0.0.0.0" 
  
 num-cpus 
 : 
  
 "0" 
  
 template 
 : 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ray-head 
  
 image 
 : 
  
 rayproject/ray-llm:2.54.0-py311-cu128 
  
 resources 
 : 
  
 limits 
 : 
  
 memory 
 : 
  
 "8Gi" 
  
 ephemeral-storage 
 : 
  
 "32Gi" 
  
 requests 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 "8Gi" 
  
 ephemeral-storage 
 : 
  
 "32Gi" 
  
 ports 
 : 
  
 - 
  
 containerPort 
 : 
  
 6379 
  
 name 
 : 
  
 gcs-server 
  
 - 
  
 containerPort 
 : 
  
 8265 
  
 name 
 : 
  
 dashboard 
  
 - 
  
 containerPort 
 : 
  
 10001 
  
 name 
 : 
  
 client 
  
 - 
  
 containerPort 
 : 
  
 8000 
  
 name 
 : 
  
 serve 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_THROUGHPUT_OPTIMIZED 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_ENABLE_HA_PROXY 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 rayVersion 
 : 
  
 2.54.0 
  
 workerGroupSpecs 
 : 
  
 - 
  
 replicas 
 : 
  
 2 
  
 minReplicas 
 : 
  
 2 
  
 maxReplicas 
 : 
  
 2 
  
 groupName 
 : 
  
 gpu-group 
  
 rayStartParams 
 : 
  
 {} 
  
 template 
 : 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 llm 
  
 image 
 : 
  
 rayproject/ray-llm:2.54.0-py311-cu128 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_THROUGHPUT_OPTIMIZED 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_ENABLE_HA_PROXY 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 resources 
 : 
  
 limits 
 : 
  
 nvidia.com/gpu 
 : 
  
 "1" 
  
 ephemeral-storage 
 : 
  
 "24Gi" 
  
 requests 
 : 
  
 cpu 
 : 
  
 "6" 
  
 memory 
 : 
  
 "24Gi" 
  
 nvidia.com/gpu 
 : 
  
 "1" 
  
 ephemeral-storage 
 : 
  
 "24Gi" 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-accelerator 
 : 
  
 nvidia-l4

Create a file named qwen2.5-3b.yaml with the following content:

  apiVersion 
 : 
  
 ray.io/v1 
 kind 
 : 
  
 RayService 
 metadata 
 : 
  
 name 
 : 
  
 qwen-25-3b 
 spec 
 : 
  
 serveConfigV2 
 : 
  
 | 
  
 applications: 
  
 - name: llm_app 
  
 route_prefix: "/" 
  
 import_path: ray.serve.llm:build_openai_app 
  
 args: 
  
 llm_configs: 
  
 - model_loading_config: 
  
 model_id: qwen-2.5-3b 
  
 model_source: Qwen/Qwen2.5-3B 
  
 accelerator_type: L4 
  
 log_engine_metrics: true 
  
 deployment_config: 
  
 autoscaling_config: 
  
 min_replicas: 2 
  
 max_replicas: 2 
  
 health_check_period_s: 600 
  
 health_check_timeout_s: 300 
  
 rayClusterConfig 
 : 
  
 headGroupSpec 
 : 
  
 rayStartParams 
 : 
  
 dashboard-host 
 : 
  
 "0.0.0.0" 
  
 num-cpus 
 : 
  
 "0" 
  
 template 
 : 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 ray-head 
  
 image 
 : 
  
 rayproject/ray-llm:2.54.0-py311-cu128 
  
 resources 
 : 
  
 limits 
 : 
  
 memory 
 : 
  
 "8Gi" 
  
 ephemeral-storage 
 : 
  
 "32Gi" 
  
 requests 
 : 
  
 cpu 
 : 
  
 "2" 
  
 memory 
 : 
  
 "8Gi" 
  
 ephemeral-storage 
 : 
  
 "32Gi" 
  
 ports 
 : 
  
 - 
  
 containerPort 
 : 
  
 6379 
  
 name 
 : 
  
 gcs-server 
  
 - 
  
 containerPort 
 : 
  
 8265 
  
 name 
 : 
  
 dashboard 
  
 - 
  
 containerPort 
 : 
  
 10001 
  
 name 
 : 
  
 client 
  
 - 
  
 containerPort 
 : 
  
 8000 
  
 name 
 : 
  
 serve 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_THROUGHPUT_OPTIMIZED 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_ENABLE_HA_PROXY 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 rayVersion 
 : 
  
 2.54.0 
  
 workerGroupSpecs 
 : 
  
 - 
  
 replicas 
 : 
  
 2 
  
 minReplicas 
 : 
  
 2 
  
 maxReplicas 
 : 
  
 2 
  
 groupName 
 : 
  
 gpu-group 
  
 rayStartParams 
 : 
  
 {} 
  
 template 
 : 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 llm 
  
 image 
 : 
  
 rayproject/ray-llm:2.54.0-py311-cu128 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_THROUGHPUT_OPTIMIZED 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 RAY_SERVE_ENABLE_HA_PROXY 
  
 value 
 : 
  
 "1" 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 resources 
 : 
  
 limits 
 : 
  
 nvidia.com/gpu 
 : 
  
 "1" 
  
 ephemeral-storage 
 : 
  
 "24Gi" 
  
 requests 
 : 
  
 cpu 
 : 
  
 "6" 
  
 memory 
 : 
  
 "24Gi" 
  
 nvidia.com/gpu 
 : 
  
 "1" 
  
 ephemeral-storage 
 : 
  
 "24Gi" 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-accelerator 
 : 
  
 nvidia-l4

Deploy the models:

 kubectl  
apply  
-f  
gemma-2b-it.yaml
kubectl  
apply  
-f  
qwen2.5-3b.yaml

Configure health checks

To help ensure the load balancer accurately monitors Ray worker health, you must apply the HealthCheckPolicy resource.

Create a file named healthcheck-policy.yaml with the following content:

  apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 HealthCheckPolicy 
 metadata 
 : 
  
 name 
 : 
  
 gemma-serve-healthcheck 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 default 
 : 
  
 checkIntervalSec 
 : 
  
 5 
  
 timeoutSec 
 : 
  
 5 
  
 healthyThreshold 
 : 
  
 2 
  
 unhealthyThreshold 
 : 
  
 2 
  
 config 
 : 
  
 type 
 : 
  
 HTTP 
  
 httpHealthCheck 
 : 
  
 port 
 : 
  
 8000 
  
 requestPath 
 : 
  
 /-/healthz 
  
 targetRef 
 : 
  
 group 
 : 
  
 "" 
  
 kind 
 : 
  
 Service 
  
 name 
 : 
  
 gemma-2b-it-serve-svc 
 --- 
 apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 HealthCheckPolicy 
 metadata 
 : 
  
 name 
 : 
  
 qwen-serve-healthcheck 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 default 
 : 
  
 checkIntervalSec 
 : 
  
 5 
  
 timeoutSec 
 : 
  
 5 
  
 healthyThreshold 
 : 
  
 2 
  
 unhealthyThreshold 
 : 
  
 2 
  
 config 
 : 
  
 type 
 : 
  
 HTTP 
  
 httpHealthCheck 
 : 
  
 port 
 : 
  
 8000 
  
 requestPath 
 : 
  
 /-/healthz 
  
 targetRef 
 : 
  
 group 
 : 
  
 "" 
  
 kind 
 : 
  
 Service 
  
 name 
 : 
  
 qwen-25-3b-serve-svc

Apply the health check policy:

 kubectl  
apply  
-f  
healthcheck-policy.yaml

Configure routing

To configure routing, you must apply the Gateway and HTTPRoute manifests. The HTTPRoute contains rules that match the X-Gateway-Model-Name header (populated by the body-based router) to route traffic to the appropriate Ray service.

Create a file named gateway.yaml with the following content:

  apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 Gateway 
 metadata 
 : 
  
 name 
 : 
  
 ray-multi-model-gateway 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 gatewayClassName 
 : 
  
 gke-l7-rilb 
  
 listeners 
 : 
  
 - 
  
 allowedRoutes 
 : 
  
 namespaces 
 : 
  
 from 
 : 
  
 Same 
  
 name 
 : 
  
 http 
  
 port 
 : 
  
 80 
  
 protocol 
 : 
  
 HTTP 
 --- 
 apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 HTTPRoute 
 metadata 
 : 
  
 name 
 : 
  
 ray-multi-model-route 
 spec 
 : 
  
 parentRefs 
 : 
  
 - 
  
 name 
 : 
  
 ray-multi-model-gateway 
  
 rules 
 : 
  
 - 
  
 matches 
 : 
  
 - 
  
 headers 
 : 
  
 - 
  
 type 
 : 
  
 Exact 
  
 name 
 : 
  
 X-Gateway-Model-Name 
  
 value 
 : 
  
 gemma-2b-it 
  
 # Must match model named in JSON request! 
  
 path 
 : 
  
 type 
 : 
  
 PathPrefix 
  
 value 
 : 
  
 / 
  
 backendRefs 
 : 
  
 - 
  
 name 
 : 
  
 gemma-2b-it-serve-svc 
  
 # Ray service name plus "-serve-svc". 
  
 kind 
 : 
  
 Service 
  
 port 
 : 
  
 8000 
  
 - 
  
 matches 
 : 
  
 - 
  
 headers 
 : 
  
 - 
  
 type 
 : 
  
 Exact 
  
 name 
 : 
  
 X-Gateway-Model-Name 
  
 value 
 : 
  
 qwen-2.5-3b 
  
 # Matches another extracted model name 
  
 path 
 : 
  
 type 
 : 
  
 PathPrefix 
  
 value 
 : 
  
 / 
  
 backendRefs 
 : 
  
 - 
  
 name 
 : 
  
 qwen-25-3b-serve-svc 
  
 # Target Ray Service. 
  
 kind 
 : 
  
 Service 
  
 port 
 : 
  
 8000

Apply the gateway and route:
```
 kubectl  
apply  
-f  
gateway.yaml 
```

Test the deployment

After the Gateway is provisioned and both Ray clusters are ready, you can test routing by sending requests with different model names in the JSON body.

Get the Gateway IP address:

 kubectl  
get  
gateways  
ray-multi-model-gateway

Start a shell in a network that can reach the Gateway address. You can use curl on one of the Ray cluster Pods:

  POD_NAME 
 = 
 $( 
kubectl  
get  
pods  
-l  
ray.io/node-type = 
head  
-o  
 jsonpath 
 = 
 '{.items[0].metadata.name}' 
 ) 
kubectl  
 exec 
  
-it  
 $POD_NAME 
  
--  
bash

Send requests by testing routing to Gemma:

 curl  
http:// GATEWAY_IP_ADDRESS 
/v1/chat/completions  
 \ 
  
--header  
 'Content-Type: application/json' 
  
 \ 
  
--data  
 '{ 
 "model": "gemma-2b-it", 
 "messages": [{"role": "user", "content": "Tell me about GKE."}] 
 }'

Replace GATEWAY_IP_ADDRESS with the IP address from the previous step.

The output is similar to the following:

 {"id":"chatcmpl-594f7cab-f991-4522-9829-acdbb65d9f67","object":"chat.completion","created":1776379509,"model":"gemma-2b-it","choices":[{"index":0,"message":{"role":"assistant","content":"**Google Kubernetes Engine (GKE)** is a fully managed container orchestration service for Kubernetes [...]

Test routing to Qwen:

 curl  
http:// GATEWAY_IP_ADDRESS 
/v1/chat/completions  
 \ 
  
--header  
 'Content-Type: application/json' 
  
 \ 
  
--data  
 '{ 
 "model": "qwen-2.5-3b", 
 "messages": [{"role": "user", "content": "How does Ray Serve work?"}] 
 }'