This document explains how to manage inference requests across multiple Ray Serve clusters on Google Kubernetes Engine (GKE) by configuring the Kubernetes Gateway API and GKE Inference Gateway. This configuration lets you centralize traffic management for multiple teams, distribute workloads across regions for higher capacity, and implement model-aware routing based on request body content.
Benefits of using GKE Inference Gateway and Ray Serve
Using GKE Inference Gateway and Ray Serve offers the following benefits:
- Path routing: configure each RayService with a path prefix, then serve
them with one Gateway routing to multiple Ray Services.
- For more information about setting up path prefix rules, see the Gateway API documentation .
- Model-aware routing: choose a RayService to route to based on the request body—for example, by extracting the requested model from an OpenAI-API JSON request.
- Governance: require API keys to use your service, or enforce quota for users by using Apigee for authentication and API management .
- Multi-region: split traffic across multiple GKE clusters with RayServices to attain higher availability or capacity with multi-cluster Gateways .
- Separation of concerns: use separate RayServices, which can be administered by separate teams, follow separate rollouts, and run on different topologies.
- Security: use Gateway to act as the SSL terminator to help secure your user traffic over the internet. For more information, see Gateway security .
To configure routing, you need to deploy a Gateway, HTTPRoute, and RayService. A Kubernetes Service for each target Ray cluster is typically created by KubeRay. Ray Serve spreads request load in-cluster, with no need to create an InferencePool or Endpoint Picker.
Model-aware routing for Ray Serve on GKE
Model-aware routing is enabled by a body-based routing extension. Body-based routing lets you direct traffic to different RayServices based solely on the model named in the user's request, which lets you have a single endpoint that can serve many models hosted in multiple Ray clusters. Your users have simplified access, and your app developers have control over configuring each Ray endpoint.
To configure model-aware routing, you deploy the following key components:
- A body-based router extension to extract model names from JSON payloads. This router extension is deployed by using Helm.
- A GKE Gateway (L7 regional internal Application Load Balancer) to handle the incoming traffic.
- HTTPRoute rules to direct traffic to the correct Ray Service by using headers populated by the router extension.
- Multiple Ray Serve clusters to manage the lifecycle and autoscaling of siloed models.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task, install
and then initialize
the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Ensure you have Helm installed.
- Create a Hugging Face account, if you don't already have one.
- Ensure that you have a Hugging Face token .
Prepare your environment
Set up environment variables:
export
CLUSTER
=
$(
whoami )
-ray-bbr export
PROJECT_ID
=
$(
gcloud
config
get-value
project )
export
LOCATION
=
us-central1-b export
REGION
=
us-central1 export
HUGGING_FACE_TOKEN
=
YOUR_HUGGING_FACE_TOKEN
Replace YOUR_HUGGING_FACE_TOKEN
with your Hugging Face
access token.
Prepare your infrastructure
In this section, you set up a Ray-enabled, Gateway-enabled GKE cluster with L4 GPUs.
-
Create a cluster with the Ray Operator and Gateway API enabled:
gcloud container clusters create ${ CLUSTER } \ --project ${ PROJECT_ID } \ --location ${ LOCATION } \ --cluster-version 1 .35 \ --gateway-api standard \ --addons HttpLoadBalancing,RayOperator \ --enable-ray-cluster-logging \ --enable-ray-cluster-monitoring \ --machine-type e2-standard-4 -
Create a GPU node pool for your model workloads:
gcloud container node-pools create gpu-pool \ --cluster = ${ CLUSTER } \ --location = ${ LOCATION } \ --accelerator = "type=nvidia-l4,count=1,gpu-driver-version=latest" \ --machine-type = g2-standard-8 \ --num-nodes = 4 -
Create a proxy-only subnet for the regional internal Application Load Balancer, which is required by body-based routing:
gcloud compute networks subnets create bbr-proxy-only-subnet \ --purpose = REGIONAL_MANAGED_PROXY \ --role = ACTIVE \ --region = ${ REGION } \ --network = default \ --range = 192 .168.10.0/24 -
Deploy your Hugging Face secret:
kubectl create secret generic hf-secret \ --from-literal = hf_api_token = ${ HUGGING_FACE_TOKEN }
Deploy the body-based router for model-aware routing
The body-based router extension intercepts requests, parses the JSON body, and
extracts the model field into an X-Gateway-Model-Name
header.
-
Create a file named
helm-values.yamlwith the following content:bbr : plugins : - type : "body-field-to-header" name : "openai-model-extractor" json : field_name : "model" header_name : "X-Gateway-Model-Name" -
Install the body-based router by using Helm:
helm install body-based-router \ oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing \ --version v1.4.0 \ --set provider.name = gke \ --set inferenceGateway.name = ray-multi-model-gateway \ --values helm-values.yaml
Deploy RayServices
To deploy your models, you must apply the RayService
manifests. Each manifest
defines a Ray cluster that runs a specific LLM.
-
Create a file named
gemma-2b-it.yamlwith the following content:apiVersion : ray.io/v1 kind : RayService metadata : name : gemma-2b-it spec : serveConfigV2 : | applications: - name: llm_app route_prefix: "/" import_path: ray.serve.llm:build_openai_app args: llm_configs: - model_loading_config: model_id: gemma-2b-it model_source: google/gemma-2b-it accelerator_type: L4 log_engine_metrics: true deployment_config: autoscaling_config: min_replicas: 2 max_replicas: 2 health_check_period_s: 600 health_check_timeout_s: 300 rayClusterConfig : headGroupSpec : rayStartParams : dashboard-host : "0.0.0.0" num-cpus : "0" template : spec : containers : - name : ray-head image : rayproject/ray-llm:2.54.0-py311-cu128 resources : limits : memory : "8Gi" ephemeral-storage : "32Gi" requests : cpu : "2" memory : "8Gi" ephemeral-storage : "32Gi" ports : - containerPort : 6379 name : gcs-server - containerPort : 8265 name : dashboard - containerPort : 10001 name : client - containerPort : 8000 name : serve env : - name : RAY_SERVE_THROUGHPUT_OPTIMIZED value : "1" - name : RAY_SERVE_ENABLE_HA_PROXY value : "1" - name : HUGGING_FACE_HUB_TOKEN valueFrom : secretKeyRef : name : hf-secret key : hf_api_token rayVersion : 2.54.0 workerGroupSpecs : - replicas : 2 minReplicas : 2 maxReplicas : 2 groupName : gpu-group rayStartParams : {} template : spec : containers : - name : llm image : rayproject/ray-llm:2.54.0-py311-cu128 env : - name : RAY_SERVE_THROUGHPUT_OPTIMIZED value : "1" - name : RAY_SERVE_ENABLE_HA_PROXY value : "1" - name : HUGGING_FACE_HUB_TOKEN valueFrom : secretKeyRef : name : hf-secret key : hf_api_token resources : limits : nvidia.com/gpu : "1" ephemeral-storage : "24Gi" requests : cpu : "6" memory : "24Gi" nvidia.com/gpu : "1" ephemeral-storage : "24Gi" nodeSelector : cloud.google.com/gke-accelerator : nvidia-l4 -
Create a file named
qwen2.5-3b.yamlwith the following content:apiVersion : ray.io/v1 kind : RayService metadata : name : qwen-25-3b spec : serveConfigV2 : | applications: - name: llm_app route_prefix: "/" import_path: ray.serve.llm:build_openai_app args: llm_configs: - model_loading_config: model_id: qwen-2.5-3b model_source: Qwen/Qwen2.5-3B accelerator_type: L4 log_engine_metrics: true deployment_config: autoscaling_config: min_replicas: 2 max_replicas: 2 health_check_period_s: 600 health_check_timeout_s: 300 rayClusterConfig : headGroupSpec : rayStartParams : dashboard-host : "0.0.0.0" num-cpus : "0" template : spec : containers : - name : ray-head image : rayproject/ray-llm:2.54.0-py311-cu128 resources : limits : memory : "8Gi" ephemeral-storage : "32Gi" requests : cpu : "2" memory : "8Gi" ephemeral-storage : "32Gi" ports : - containerPort : 6379 name : gcs-server - containerPort : 8265 name : dashboard - containerPort : 10001 name : client - containerPort : 8000 name : serve env : - name : RAY_SERVE_THROUGHPUT_OPTIMIZED value : "1" - name : RAY_SERVE_ENABLE_HA_PROXY value : "1" - name : HUGGING_FACE_HUB_TOKEN valueFrom : secretKeyRef : name : hf-secret key : hf_api_token rayVersion : 2.54.0 workerGroupSpecs : - replicas : 2 minReplicas : 2 maxReplicas : 2 groupName : gpu-group rayStartParams : {} template : spec : containers : - name : llm image : rayproject/ray-llm:2.54.0-py311-cu128 env : - name : RAY_SERVE_THROUGHPUT_OPTIMIZED value : "1" - name : RAY_SERVE_ENABLE_HA_PROXY value : "1" - name : HUGGING_FACE_HUB_TOKEN valueFrom : secretKeyRef : name : hf-secret key : hf_api_token resources : limits : nvidia.com/gpu : "1" ephemeral-storage : "24Gi" requests : cpu : "6" memory : "24Gi" nvidia.com/gpu : "1" ephemeral-storage : "24Gi" nodeSelector : cloud.google.com/gke-accelerator : nvidia-l4 -
Deploy the models:
kubectl apply -f gemma-2b-it.yaml kubectl apply -f qwen2.5-3b.yaml
Configure health checks
To help ensure the load balancer accurately monitors Ray worker health, you must
apply the HealthCheckPolicy
resource.
-
Create a file named
healthcheck-policy.yamlwith the following content:apiVersion : networking.gke.io/v1 kind : HealthCheckPolicy metadata : name : gemma-serve-healthcheck namespace : default spec : default : checkIntervalSec : 5 timeoutSec : 5 healthyThreshold : 2 unhealthyThreshold : 2 config : type : HTTP httpHealthCheck : port : 8000 requestPath : /-/healthz targetRef : group : "" kind : Service name : gemma-2b-it-serve-svc --- apiVersion : networking.gke.io/v1 kind : HealthCheckPolicy metadata : name : qwen-serve-healthcheck namespace : default spec : default : checkIntervalSec : 5 timeoutSec : 5 healthyThreshold : 2 unhealthyThreshold : 2 config : type : HTTP httpHealthCheck : port : 8000 requestPath : /-/healthz targetRef : group : "" kind : Service name : qwen-25-3b-serve-svc -
Apply the health check policy:
kubectl apply -f healthcheck-policy.yaml
Configure routing
To configure routing, you must apply the Gateway
and HTTPRoute
manifests.
The HTTPRoute
contains rules that match the X-Gateway-Model-Name
header
(populated by the body-based router) to route traffic to the appropriate Ray
service.
-
Create a file named
gateway.yamlwith the following content:apiVersion : gateway.networking.k8s.io/v1 kind : Gateway metadata : name : ray-multi-model-gateway namespace : default spec : gatewayClassName : gke-l7-rilb listeners : - allowedRoutes : namespaces : from : Same name : http port : 80 protocol : HTTP --- apiVersion : gateway.networking.k8s.io/v1 kind : HTTPRoute metadata : name : ray-multi-model-route spec : parentRefs : - name : ray-multi-model-gateway rules : - matches : - headers : - type : Exact name : X-Gateway-Model-Name value : gemma-2b-it # Must match model named in JSON request! path : type : PathPrefix value : / backendRefs : - name : gemma-2b-it-serve-svc # Ray service name plus "-serve-svc". kind : Service port : 8000 - matches : - headers : - type : Exact name : X-Gateway-Model-Name value : qwen-2.5-3b # Matches another extracted model name path : type : PathPrefix value : / backendRefs : - name : qwen-25-3b-serve-svc # Target Ray Service. kind : Service port : 8000 -
Apply the gateway and route:
kubectl apply -f gateway.yaml
Test the deployment
After the Gateway is provisioned and both Ray clusters are ready, you can test routing by sending requests with different model names in the JSON body.
-
Get the Gateway IP address:
kubectl get gateways ray-multi-model-gateway -
Start a shell in a network that can reach the Gateway address. You can use curl on one of the Ray cluster Pods:
POD_NAME = $( kubectl get pods -l ray.io/node-type = head -o jsonpath = '{.items[0].metadata.name}' ) kubectl exec -it $POD_NAME -- bash -
Send requests by testing routing to Gemma:
curl http:// GATEWAY_IP_ADDRESS /v1/chat/completions \ --header 'Content-Type: application/json' \ --data '{ "model": "gemma-2b-it", "messages": [{"role": "user", "content": "Tell me about GKE."}] }'Replace
GATEWAY_IP_ADDRESSwith the IP address from the previous step.The output is similar to the following:
{"id":"chatcmpl-594f7cab-f991-4522-9829-acdbb65d9f67","object":"chat.completion","created":1776379509,"model":"gemma-2b-it","choices":[{"index":0,"message":{"role":"assistant","content":"**Google Kubernetes Engine (GKE)** is a fully managed container orchestration service for Kubernetes [...] -
Test routing to Qwen:
curl http:// GATEWAY_IP_ADDRESS /v1/chat/completions \ --header 'Content-Type: application/json' \ --data '{ "model": "qwen-2.5-3b", "messages": [{"role": "user", "content": "How does Ray Serve work?"}] }'The output is similar to the following:
{"id":"chatcmpl-dfe3f3b7-45fc-481c-b53e-2fc09c033cdb","object":"chat.completion","created":1776380249,"model":"qwen-2.5-3b","choices":[{"index":0,"message":{"role":"assistant","content":"Ray Serve facilitates the hosting and deployment of scalable microservices. [...]
The body-based router automatically extracts the value of the model
field and
ensures each request reaches the correct backend service configured in the gateway.yaml
file.
Clean up
Delete the cluster:
gcloud
container
clusters
delete
${
CLUSTER
}
What's next
- Learn about performance optimizations for Ray Serve .
- Read more about Gateway on GKE .

