Set up the multi-cluster GKE Inference Gateway

Autopilot Standard

This document describes how to set up the multi-cluster Google Kubernetes Engine (GKE) Inference Gateway to intelligently load-balance your AI/ML inference workloads across multiple GKE clusters, which can span different regions. This setup uses Gateway API, Multi Cluster Ingress, and custom resources like InferencePool and InferenceObjective to improve scalability, help ensure high availability, and optimize resource utilization for your model-serving deployments.

To understand this document, be familiar with the following:

AI/ML orchestration on GKE .
Generative AI terminology .
GKE networking concepts , including:
Load balancing in Google Cloud , especially how load balancers interact with GKE.

This document is for the following personas:

Machine learning (ML) engineers, Platform admins and operators, or Data and AI specialists who want to use GKE's container orchestration capabilities for serving AI/ML workloads.
Cloud architects or Networking specialists who interact with GKE networking.

To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks .

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property . If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location . You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Enable the Compute Engine API, Google Kubernetes Engine API, Model Armor, and the Network Services API.

Go to Enable access to APIs and follow the instructions.
Enable the Autoscaling API.

Go to Autoscaling API and follow the instructions.
Hugging Face prerequisites:
- Create a Hugging Face account if you don't already have one.
- Request and get approval for access to the Llama 3.1 model on Hugging Face.
- Sign the license consent agreement on the model's page on Hugging Face.
- Generate a Hugging Face access token with at least Read permissions.

Requirements

Ensure your project has sufficient quota for H100 GPUs. For more information, see Plan GPU quota and Allocation quotas .
Use GKE version 1.34.1-gke.1127000 or later.
Use gcloud CLI version 480.0.0 or later.
Your node service accounts must have permissions to write metrics to the Autoscaling API.
You must have the following IAM roles on the project: roles/container.admin and roles/iam.serviceAccountAdmin .

Set up multi-cluster Inference Gateway

To set up the multi-cluster GKE Inference Gateway, follow these steps:

Create clusters and node pools

To host your AI/ML inference workloads and enable cross-regional load balancing, create two GKE clusters in different regions, each with an H100 GPU node pool.

Create the first cluster:

 gcloud  
container  
clusters  
create  
 CLUSTER_1_NAME 
  
 \ 
  
--region  
 LOCATION 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--gateway-api = 
standard  
 \ 
  
--release-channel  
 "rapid" 
  
 \ 
  
--cluster-version = 
 GKE_VERSION 
  
 \ 
  
--machine-type = 
 " MACHINE_TYPE 
" 
  
 \ 
  
--disk-type = 
 " DISK_TYPE 
" 
  
 \ 
  
--enable-managed-prometheus  
--monitoring = 
SYSTEM,DCGM  
 \ 
  
--hpa-profile = 
performance  
 \ 
  
--async  
 # Allows the command to return immediately

Replace the following:

CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
LOCATION : the region for the first cluster, for example europe-west3 .
PROJECT_ID : your project ID.
GKE_VERSION : the GKE version to use, for example 1.34.1-gke.1127000 .
MACHINE_TYPE : the machine type for the cluster nodes, for example c2-standard-16 .
DISK_TYPE : the disk type for the cluster nodes, for example pd-standard .

Create an H100 node pool for the first cluster:
```
 gcloud  
container  
node-pools  
create  
 NODE_POOL_NAME 
  
 \ 
  
--accelerator  
 "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--location = 
 CLUSTER_1_ZONE 
  
 \ 
  
--node-locations = 
 CLUSTER_1_ZONE 
  
 \ 
  
--cluster = 
 CLUSTER_1_NAME 
  
 \ 
  
--machine-type = 
 NODE_POOL_MACHINE_TYPE 
  
 \ 
  
--num-nodes = 
 NUM_NODES 
  
 \ 
  
--spot  
 \ 
  
--async  
 # Allows the command to return immediately 
 
```
Replace the following:
- NODE_POOL_NAME : the name of the node pool, for example h100 .
- PROJECT_ID : your project ID.
- CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
- CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
- NODE_POOL_MACHINE_TYPE : the machine type for the node pool, for example a3-highgpu-2g .
- NUM_NODES : the number of nodes in the node pool, for example 3 .
Note: Using the --spot flag creates a Spot VM node pool, which can be preempted. Spot VMs are often suitable for AI/ML inference workloads because Spot VMs offer significant cost savings, and inference tasks can often be designed to be resilient to interruptions.
Get the credentials:
```
 gcloud  
container  
clusters  
get-credentials  
 CLUSTER_1_NAME 
  
 \ 
  
--location  
 CLUSTER_1_ZONE 
  
 \ 
  
--project = 
 PROJECT_ID 
 
```
Replace the following:
- PROJECT_ID : your project ID.
- CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
- CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .

On the first cluster, create a secret for the Hugging Face token:

 kubectl  
create  
secret  
generic  
hf-token  
 \ 
  
--from-literal = 
 token 
 = 
 HF_TOKEN

Replace the HF_TOKEN : your Hugging Face access token.

Create the second cluster in a different region from the first cluster:

 gcloud  
container  
clusters  
create  
gke-east  
--region  
 LOCATION 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--gateway-api = 
standard  
 \ 
  
--release-channel  
 "rapid" 
  
 \ 
  
--cluster-version = 
 GKE_VERSION 
  
 \ 
  
--machine-type = 
 " MACHINE_TYPE 
" 
  
 \ 
  
--disk-type = 
 " DISK_TYPE 
" 
  
 \ 
  
--enable-managed-prometheus  
 \ 
  
--monitoring = 
SYSTEM,DCGM  
 \ 
  
--hpa-profile = 
performance  
 \ 
  
--async  
 # Allows the command to return immediately while the 
cluster  
is  
created  
 in 
  
the  
background.

Replace the following:

LOCATION : the region for the second cluster. This must be a different region than the first cluster. For example, us-east4 .
PROJECT_ID : your project ID.
GKE_VERSION : the GKE version to use, for example 1.34.1-gke.1127000 .
MACHINE_TYPE : the machine type for the cluster nodes, for example c2-standard-16 .
DISK_TYPE : the disk type for the cluster nodes, for example pd-standard .

Create an H100 node pool for the second cluster:

 gcloud  
container  
node-pools  
create  
h100  
 \ 
  
--accelerator  
 "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--location = 
 CLUSTER_2_ZONE 
  
 \ 
  
--node-locations = 
 CLUSTER_2_ZONE 
  
 \ 
  
--cluster = 
 CLUSTER_2_NAME 
  
 \ 
  
--machine-type = 
 NODE_POOL_MACHINE_TYPE 
  
 \ 
  
--num-nodes = 
 NUM_NODES 
  
 \ 
  
--spot  
 \ 
  
--async  
 # Allows the command to return immediately

Replace the following:

PROJECT_ID : your project ID.
CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
NODE_POOL_MACHINE_TYPE : the machine type for the node pool, for example a3-highgpu-2g .
NUM_NODES : the number of nodes in the node pool, for example 3 .

For the second cluster, get credentials and create a secret for the Hugging Face token:
```
 gcloud  
container  
clusters  
get-credentials  
 CLUSTER_2_NAME 
  
 \ 
  
--location  
 CLUSTER_2_ZONE 
  
 \ 
  
--project = 
 PROJECT_ID 
kubectl  
create  
secret  
generic  
hf-token  
--from-literal = 
 token 
 = 
 HF_TOKEN 
 
```
Replace the following:
- CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
- CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
- PROJECT_ID : your project ID.
- HF_TOKEN : your Hugging Face access token.

Register clusters to a fleet

To enable multi-cluster capabilities, such as the multi-cluster GKE Inference Gateway, register your clusters to a fleet.

 gcloud  
container  
fleet  
memberships  
register  
 CLUSTER_1_NAME 
  
 \ 
  
--gke-cluster  
 CLUSTER_1_ZONE 
/ CLUSTER_1_NAME 
  
 \ 
  
--location = 
global  
 \ 
  
--project = 
 PROJECT_ID 
gcloud  
container  
fleet  
memberships  
register  
 CLUSTER_2_NAME 
  
 \ 
  
--gke-cluster  
 CLUSTER_2_ZONE 
/ CLUSTER_2_NAME 
  
 \ 
  
--location = 
global  
 \ 
  
--project = 
 PROJECT_ID

Replace the following:

CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
PROJECT_ID : your project ID.
CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .

To allow a single Gateway to manage traffic across multiple clusters, enable the multi-cluster Ingress feature and designate a config cluster:
```
 gcloud  
container  
fleet  
ingress  
 enable 
  
 \ 
  
--config-membership = 
projects/ PROJECT_ID 
/locations/global/memberships/ CLUSTER_1_NAME 
 
```
Replace the following:
- PROJECT_ID : your project ID.
- CLUSTER_1_NAME : the name of the first cluster, for example gke-west .

Create proxy-only subnets

For an internal gateway, create a proxy-only subnet in each region. The internal Gateway's Envoy proxies use these dedicated subnets to handle traffic within your VPC network.

Create a subnet in the first cluster's region:

 gcloud  
compute  
networks  
subnets  
create  
 CLUSTER_1_REGION 
-subnet  
 \ 
  
--purpose = 
GLOBAL_MANAGED_PROXY  
 \ 
  
--role = 
ACTIVE  
 \ 
  
--region = 
 CLUSTER_1_REGION 
  
 \ 
  
--network = 
default  
 \ 
  
--range = 
 10 
.0.0.0/23  
 \ 
  
--project = 
 PROJECT_ID

Create a subnet in the second cluster's region:

 gcloud  
compute  
networks  
subnets  
create  
 CLUSTER_2_REGION 
-subnet  
 \ 
  
--purpose = 
GLOBAL_MANAGED_PROXY  
 \ 
  
--role = 
ACTIVE  
 \ 
  
--region = 
 CLUSTER_2_REGION 
  
 \ 
  
--network = 
default  
 \ 
  
--range = 
 10 
.5.0.0/23  
 \ 
  
--project = 
 PROJECT_ID

Replace the following:

PROJECT_ID : your project ID.
CLUSTER_1_REGION : the region for the first cluster, for example europe-west3 .
CLUSTER_2_REGION : the region for the second cluster, for example us-east4 .

Install the required CRDs

The multi-cluster GKE Inference Gateway uses custom resources such as InferencePool and InferenceObjective . The GKE Gateway API controller manages the InferencePool Custom Resource Definition (CRD). However, you must manually install the InferenceObjective CRD, which is in alpha, on your clusters.

Define context variables for your clusters:
```
  CLUSTER1_CONTEXT 
 = 
 "gke_ PROJECT_ID 
_ CLUSTER_1_ZONE 
_ CLUSTER_1_NAME 
" 
 CLUSTER2_CONTEXT 
 = 
 "gke_ PROJECT_ID 
_ CLUSTER_2_ZONE 
_ CLUSTER_2_NAME 
" 
 
```
Replace the following:
- PROJECT_ID : your project ID.
- CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
- CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
- CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
- CLUSTER_2_NAME : the name of the second cluster, for example gke-east .

Install the InferenceObjective CRD on both clusters:

  # Copyright 2025 Google LLC 
 # 
 # Licensed under the Apache License, Version 2.0 (the "License"); 
 # you may not use this file except in compliance with the License. 
 # You may obtain a copy of the License at 
 # 
 #      http://www.apache.org/licenses/LICENSE-2.0 
 # 
 # Unless required by applicable law or agreed to in writing, software 
 # distributed under the License is distributed on an "AS IS" BASIS, 
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
 # See the License for the specific language governing permissions and 
 # limitations under the License. 
 apiVersion 
 : 
  
 admissionregistration.k8s.io/v1 
 kind 
 : 
  
 ValidatingAdmissionPolicy 
 metadata 
 : 
  
 name 
 : 
  
 restrict-toleration 
 spec 
 : 
  
 failurePolicy 
 : 
  
 Fail 
  
 paramKind 
 : 
  
 apiVersion 
 : 
  
 v1 
  
 kind 
 : 
  
 ConfigMap 
  
 matchConstraints 
 : 
  
 # GKE will mutate any pod specifying a CC label in a nodeSelector 
  
 # or in a nodeAffinity with a toleration for the CC node label. 
  
 # Mutation hooks will always mutate the K8s object before validating 
  
 # the admission request. 
  
 # Pods created by Jobs, CronJobs, Deployments, etc. will also be validated. 
  
 # See https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#admission-control-phases for details 
  
 resourceRules 
 : 
  
 - 
  
 apiGroups 
 : 
  
 [ 
 "" 
 ] 
  
 apiVersions 
 : 
  
 [ 
 "v1" 
 ] 
  
 operations 
 : 
  
 [ 
 "CREATE" 
 , 
  
 "UPDATE" 
 ] 
  
 resources 
 : 
  
 [ 
 "pods" 
 ] 
  
 matchConditions 
 : 
  
 - 
  
 name 
 : 
  
 'match-tolerations' 
  
 # Validate only if compute class toleration exists 
  
 # and the CC label tolerated is listed in the configmap. 
  
 expression 
 : 
 > 
  
 object.spec.tolerations.exists(t, has(t.key) 
  
&&  
 t.key == 'cloud.google.com/compute-class' 
  
&&  
 params.data.computeClasses.split('\\n').exists(cc, cc == t.value)) 
  
 validations 
 : 
  
 # ConfigMap with permitted namespace list referenced via `params`. 
  
 - 
  
 expression 
 : 
  
 "params.data.namespaces.split('\\n').exists(ns, 
  
 ns 
  
 == 
  
 object.metadata.namespace)" 
  
 messageExpression 
 : 
  
 "'Compute 
  
 class 
  
 toleration 
  
 not 
  
 permitted 
  
 on 
  
 workloads 
  
 in 
  
 namespace 
  
 ' 
  
 + 
  
 object.metadata.namespace" 
 --- 
 apiVersion 
 : 
  
 admissionregistration.k8s.io/v1 
 kind 
 : 
  
 ValidatingAdmissionPolicyBinding 
 metadata 
 : 
  
 name 
 : 
  
 restrict-toleration-binding 
 spec 
 : 
  
 policyName 
 : 
  
 restrict-toleration 
  
 validationActions 
 : 
  
 [ 
 "Deny" 
 ] 
  
 paramRef 
 : 
  
 name 
 : 
  
 allowed-ccc-namespaces 
  
 namespace 
 : 
  
 default 
  
 parameterNotFoundAction 
 : 
  
 Deny 
 --- 
 apiVersion 
 : 
  
 v1 
 kind 
 : 
  
 ConfigMap 
 metadata 
 : 
  
 name 
 : 
  
 allowed-ccc-namespaces 
  
 namespace 
 : 
  
 default 
 data 
 : 
  
 # Replace example namespaces in line-separated list below. 
  
 namespaces 
 : 
  
 | 
  
 foo 
  
 bar 
  
 baz 
  
 # ComputeClass names to monitor with this validation policy. 
  
 # The 'autopilot' and 'autopilot-spot' CCs are present on 
  
 # all NAP Standard and Autopilot clusters. 
  
 computeClasses 
 : 
  
 | 
  
 MY_COMPUTE_CLASS 
  
 autopilot 
  
 autopilot-spot

 kubectl  
apply  
-f  
https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml  
--context = 
 CLUSTER1_CONTEXT 
kubectl  
apply  
-f  
https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml  
--context = 
 CLUSTER2_CONTEXT

Replace the following:

CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy resources to the target clusters

To make your AI/ML inference workloads available on each cluster, deploy the required resources, such as the model servers and InferenceObjective custom resources.

Deploy the model servers to both clusters:

 kubectl  
apply  
-f  
https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml  
--context = 
 CLUSTER1_CONTEXT 
kubectl  
apply  
-f  
https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml  
--context = 
 CLUSTER2_CONTEXT

Replace the following:

CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy the InferenceObjective resources to both clusters. Save the following sample manifest to a file named inference-objective.yaml :

  apiVersion 
 : 
  
 inference.networking.x-k8s.io/v1alpha2 
 kind 
 : 
  
 InferenceObjective 
 metadata 
 : 
  
 name 
 : 
  
 food-review 
 spec 
 : 
  
 priority 
 : 
  
 10 
  
 poolRef 
 : 
  
 name 
 : 
  
 llama3-8b-instruct 
  
 group 
 : 
  
 "inference.networking.k8s.io"

Apply the manifest to both clusters:
```
 kubectl  
apply  
-f  
inference-objective.yaml  
--context = 
 CLUSTER1_CONTEXT 
kubectl  
apply  
-f  
inference-objective.yaml  
--context = 
 CLUSTER2_CONTEXT 
 
```
Replace the following:
- CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
- CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy the InferencePool resources to both clusters by using Helm:

 helm  
install  
vllm-llama3-8b-instruct  
 \ 
  
--kube-context  
 CLUSTER1_CONTEXT 
  
 \ 
  
--set  
inferencePool.modelServers.matchLabels.app = 
vllm-llama3-8b-instruct  
 \ 
  
--set  
provider.name = 
gke  
 \ 
  
--version  
v1.1.0  
 \ 
  
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

 helm  
install  
vllm-llama3-8b-instruct  
 \ 
  
--kube-context  
 CLUSTER2_CONTEXT 
  
 \ 
  
--set  
inferencePool.modelServers.matchLabels.app = 
vllm-llama3-8b-instruct  
 \ 
  
--set  
provider.name = 
gke  
 \ 
  
--set  
inferenceExtension.monitoring.gke.enabled = 
 true 
  
 \ 
  
--version  
v1.1.0  
 \ 
  
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

Replace the following:

CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Mark the InferencePool resources as exported on both clusters. This annotation makes the InferencePool available for import by the config cluster, which is a required step for multi-cluster routing.
```
 kubectl  
annotate  
inferencepool  
vllm-llama3-8b-instruct  
networking.gke.io/export = 
 "True" 
  
 \ 
  
--context = 
 CLUSTER1_CONTEXT 
 
```
```
 kubectl  
annotate  
inferencepool  
vllm-llama3-8b-instruct  
networking.gke.io/export = 
 "True" 
  
 \ 
  
--context = 
 CLUSTER2_CONTEXT 
 
```
Replace the following:
- CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
- CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy resources to the config cluster

To define how traffic is routed and load-balanced across the InferencePool resources in all registered clusters, deploy the Gateway , HTTPRoute , and HealthCheckPolicy resources. You deploy these resources only to the designated config cluster, which is gke-west in this document.

Create a file named mcig.yaml with the following content:

  --- 
 apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 Gateway 
 metadata 
 : 
  
 name 
 : 
  
 cross-region-gateway 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 gatewayClassName 
 : 
  
 gke-l7-cross-regional-internal-managed-mc 
  
 addresses 
 : 
  
 - 
  
 type 
 : 
  
 networking.gke.io/ephemeral-ipv4-address/europe-west3 
  
 value 
 : 
  
 "europe-west3" 
  
 - 
  
 type 
 : 
  
 networking.gke.io/ephemeral-ipv4-address/us-east4 
  
 value 
 : 
  
 "us-east4" 
  
 listeners 
 : 
  
 - 
  
 name 
 : 
  
 http 
  
 protocol 
 : 
  
 HTTP 
  
 port 
 : 
  
 80 
 --- 
 apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 HTTPRoute 
 metadata 
 : 
  
 name 
 : 
  
 vllm-llama3-8b-instruct-default 
 spec 
 : 
  
 parentRefs 
 : 
  
 - 
  
 name 
 : 
  
 cross-region-gateway 
  
 kind 
 : 
  
 Gateway 
  
 rules 
 : 
  
 - 
  
 backendRefs 
 : 
  
 - 
  
 group 
 : 
  
 networking.gke.io 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
 --- 
 apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 HealthCheckPolicy 
 metadata 
 : 
  
 name 
 : 
  
 health-check-policy 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 targetRef 
 : 
  
 group 
 : 
  
 "networking.gke.io" 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 default 
 : 
  
 config 
 : 
  
 type 
 : 
  
 HTTP 
  
 httpHealthCheck 
 : 
  
 requestPath 
 : 
  
 /health 
  
 port 
 : 
  
 8000

Apply the manifest:
```
 kubectl  
apply  
-f  
mcig.yaml  
--context = 
 CLUSTER1_CONTEXT 
 
```
Replace CLUSTER1_CONTEXT with the context for the first cluster (the config cluster), for example gke_my-project_europe-west3-c_gke-west .

Enable custom metrics reporting

To enable custom metrics reporting and improve cross-regional load balancing, you export KV Cache usage metrics from all clusters. The load balancer uses this exported KV Cache usage data as a custom load signal. Using this custom load signal allows for more intelligent load balancing decisions based on each cluster's actual workload.

Create a file named metrics.yaml with the following content:

  apiVersion 
 : 
  
 autoscaling.gke.io/v1beta1 
 kind 
 : 
  
 AutoscalingMetric 
 metadata 
 : 
  
 name 
 : 
  
 gpu-cache 
  
 namespace 
 : 
  
 default 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 vllm-llama3-8b-instruct 
  
 endpoints 
 : 
  
 - 
  
 port 
 : 
  
 8000 
  
 path 
 : 
  
 /metrics 
  
 metrics 
 : 
  
 - 
  
 name 
 : 
  
 vllm:kv_cache_usage_perc 
  
 # For vLLM versions v0.10.2 and newer 
  
 exportName 
 : 
  
 kv-cache 
  
 - 
  
 name 
 : 
  
 vllm:gpu_cache_usage_perc 
  
 # For vLLM versions v0.6.2 and newer 
  
 exportName 
 : 
  
 kv-cache-old

Apply the metrics configuration to both clusters:
```
 kubectl  
apply  
-f  
metrics.yaml  
--context = 
 CLUSTER1_CONTEXT 
kubectl  
apply  
-f  
metrics.yaml  
--context = 
 CLUSTER2_CONTEXT 
 
```
Replace the following:
- CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
- CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Configure the load balancing policy

To optimize how your AI/ML inference requests are distributed across your GKE clusters, configure a load balancing policy. Choosing the right balancing mode helps ensure efficient resource utilization, prevents overloading individual clusters, and improves the overall performance and responsiveness of your inference services.

Configure timeouts

If your requests are expected to have long durations, configure a longer timeout for the load balancer. In the GCPBackendPolicy , set the timeoutSec field to at least twice your estimated P99 request latency.

For example, the following manifest sets the load balancer timeout to 100 seconds.

  apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 GCPBackendPolicy 
 metadata 
 : 
  
 name 
 : 
  
 my-backend-policy 
 spec 
 : 
  
 targetRef 
 : 
  
 group 
 : 
  
 "networking.gke.io" 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 default 
 : 
  
 timeoutSec 
 : 
  
 100 
  
 balancingMode 
 : 
  
 CUSTOM_METRICS 
  
 trafficDuration 
 : 
  
 LONG 
  
 customMetrics 
 : 
  
 - 
  
 name 
 : 
  
 gke.named_metrics.kv-cache 
  
 dryRun 
 : 
  
 false

For more information, see multi-cluster Gateway limitations .

Because the Custom metricsand In-flight requestsload balancing modes are mutually exclusive, configure only one of these modes in your GCPBackendPolicy .

Choose a load balancing mode for your deployment.

Custom metrics

For optimal load balancing, start with a target utilization of 60%. To achieve this target, set maxUtilization: 60 in your GCPBackendPolicy 's customMetrics configuration.

Create a file named backend-policy.yaml with the following content to enable load balancing based on the kv-cache custom metric:

  apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 GCPBackendPolicy 
 metadata 
 : 
  
 name 
 : 
  
 my-backend-policy 
 spec 
 : 
  
 targetRef 
 : 
  
 group 
 : 
  
 "networking.gke.io" 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 default 
 : 
  
 balancingMode 
 : 
  
 CUSTOM_METRICS 
  
 trafficDuration 
 : 
  
 LONG 
  
 customMetrics 
 : 
  
 - 
  
 name 
 : 
  
 gke.named_metrics.kv-cache 
  
 dryRun 
 : 
  
 false 
  
 maxUtilization 
 : 
  
 60

Apply the new policy:
```
 kubectl  
apply  
-f  
backend-policy.yaml  
--context = 
 CLUSTER1_CONTEXT 
 
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project-europe-west3-c-gke-west .

In-flight requests

To use the in-flight balancing mode, estimate the number of in-flight requests each backend can handle and explicitly configure a capacity value.

Create a file named backend-policy.yaml with the following content to enable load balancing based on the number of in-flight requests:

  kind 
 : 
  
 GCPBackendPolicy 
 apiVersion 
 : 
  
 networking.gke.io/v1 
 metadata 
 : 
  
 name 
 : 
  
 my-backend-policy 
 spec 
 : 
  
 targetRef 
 : 
  
 group 
 : 
  
 "networking.gke.io" 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 default 
 : 
  
 balancingMode 
 : 
  
 IN_FLIGHT 
  
 trafficDuration 
 : 
  
 LONG 
  
 maxInFlightRequestsPerEndpoint 
 : 
  
 1000 
  
 dryRun 
 : 
  
 false

Apply the new policy:
```
 kubectl  
apply  
-f  
backend-policy.yaml  
--context = 
 CLUSTER1_CONTEXT 
 
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .

Verify the deployment

To verify the internal load balancer, you must send requests from within your VPC network because, as internal load balancers use private IP addresses. Run a temporary Pod inside one of the clusters to send requests from your VPC network and verify the internal load balancer:

Start an interactive shell session in a temporary Pod:
```
 kubectl  
run  
-it  
--rm  
--image = 
curlimages/curl  
curly  
--context = 
 CLUSTER1_CONTEXT 
  
--  
/bin/sh 
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .

From the new shell, get the Gateway IP address and send a test request:

  GW_IP 
 = 
 $( 
kubectl  
get  
gateway/cross-region-gateway  
-n  
default  
-o  
 jsonpath 
 = 
 '{.status.addresses[0].value}' 
 ) 
curl  
-i  
-X  
POST  
 ${ 
 GW_IP 
 } 
:80/v1/completions  
-H  
 'Content-Type: application/json' 
  
-d  
 '{ 
 "model": "food-review-1", 
 "prompt": "What is the best pizza in the world?", 
 "max_tokens": 100, 
 "temperature": 0 
 }'

The following is an example of a successful response:

  { 
  
 "id" 
 : 
  
 "cmpl-..." 
 , 
  
 "object" 
 : 
  
 "text_completion" 
 , 
  
 "created" 
 : 
  
 1704067200 
 , 
  
 "model" 
 : 
  
 "food-review-1" 
 , 
  
 "choices" 
 : 
  
 [ 
  
 { 
  
 "text" 
 : 
  
 "The best pizza in the world is subjective, but many argue for Neapolitan pizza..." 
 , 
  
 "index" 
 : 
  
 0 
 , 
  
 "logprobs" 
 : 
  
 null 
 , 
  
 "finish_reason" 
 : 
  
 "length" 
  
 } 
  
 ], 
  
 "usage" 
 : 
  
 { 
  
 "prompt_tokens" 
 : 
  
 10 
 , 
  
 "completion_tokens" 
 : 
  
 100 
 , 
  
 "total_tokens" 
 : 
  
 110 
  
 } 
 }

What's next

Learn more about the GKE Gateway API .
Learn more about multi-cluster GKE Inference Gateway .
Learn more about Multi Cluster Ingress .

Set up the multi-cluster GKE Inference Gateway Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Requirements

Set up multi-cluster Inference Gateway

Create clusters and node pools

Register clusters to a fleet

Create proxy-only subnets

Install the required CRDs

Deploy resources to the target clusters

Deploy resources to the config cluster

Enable custom metrics reporting

Configure the load balancing policy

Configure timeouts

Custom metrics

In-flight requests

Verify the deployment

What's next

Set up the multi-cluster GKE Inference Gateway