Set up the multi-cluster GKE Inference Gateway

This document describes how to set up the multi-cluster Google Kubernetes Engine (GKE) Inference Gateway to intelligently load-balance your AI/ML inference workloads across multiple GKE clusters, which can span different regions. This setup uses Gateway API, Multi Cluster Ingress, and custom resources like InferencePool and InferenceObjective to improve scalability, help ensure high availability, and optimize resource utilization for your model-serving deployments.

To understand this document, be familiar with the following:

This document is for the following personas:

  • Machine learning (ML) engineers, Platform admins and operators, or Data and AI specialists who want to use GKE's container orchestration capabilities for serving AI/ML workloads.
  • Cloud architects or Networking specialists who interact with GKE networking.

To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks .

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
  • Enable the Compute Engine API, Google Kubernetes Engine API, Model Armor, and the Network Services API.

    Go to Enable access to APIs and follow the instructions.

  • Enable the Autoscaling API.

    Go to Autoscaling API and follow the instructions.

  • Hugging Face prerequisites:

    • Create a Hugging Face account if you don't already have one.
    • Request and get approval for access to the Llama 3.1 model on Hugging Face.
    • Sign the license consent agreement on the model's page on Hugging Face.
    • Generate a Hugging Face access token with at least Read permissions.

Requirements

  • Ensure your project has sufficient quota for H100 GPUs. For more information, see Plan GPU quota and Allocation quotas .
  • Use GKE version 1.34.1-gke.1127000 or later.
  • Use gcloud CLI version 480.0.0 or later.
  • Your node service accounts must have permissions to write metrics to the Autoscaling API.
  • You must have the following IAM roles on the project: roles/container.admin and roles/iam.serviceAccountAdmin .

Set up multi-cluster Inference Gateway

To set up the multi-cluster GKE Inference Gateway, follow these steps:

Create clusters and node pools

To host your AI/ML inference workloads and enable cross-regional load balancing, create two GKE clusters in different regions, each with an H100 GPU node pool.

  1. Create the first cluster:

     gcloud  
    container  
    clusters  
    create  
     CLUSTER_1_NAME 
      
     \ 
      
    --region  
     LOCATION 
      
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --gateway-api = 
    standard  
     \ 
      
    --release-channel  
     "rapid" 
      
     \ 
      
    --cluster-version = 
     GKE_VERSION 
      
     \ 
      
    --machine-type = 
     " MACHINE_TYPE 
    " 
      
     \ 
      
    --disk-type = 
     " DISK_TYPE 
    " 
      
     \ 
      
    --enable-managed-prometheus  
    --monitoring = 
    SYSTEM,DCGM  
     \ 
      
    --hpa-profile = 
    performance  
     \ 
      
    --async  
     # Allows the command to return immediately 
     
    

    Replace the following:

    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
    • LOCATION : the region for the first cluster, for example europe-west3 .
    • PROJECT_ID : your project ID.
    • GKE_VERSION : the GKE version to use, for example 1.34.1-gke.1127000 .
    • MACHINE_TYPE : the machine type for the cluster nodes, for example c2-standard-16 .
    • DISK_TYPE : the disk type for the cluster nodes, for example pd-standard .
  2. Create an H100 node pool for the first cluster:

     gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --accelerator  
     "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" 
      
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --location = 
     CLUSTER_1_ZONE 
      
     \ 
      
    --node-locations = 
     CLUSTER_1_ZONE 
      
     \ 
      
    --cluster = 
     CLUSTER_1_NAME 
      
     \ 
      
    --machine-type = 
     NODE_POOL_MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --spot  
     \ 
      
    --async  
     # Allows the command to return immediately 
     
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool, for example h100 .
    • PROJECT_ID : your project ID.
    • CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
    • NODE_POOL_MACHINE_TYPE : the machine type for the node pool, for example a3-highgpu-2g .
    • NUM_NODES : the number of nodes in the node pool, for example 3 .
  3. Get the credentials:

     gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_1_NAME 
      
     \ 
      
    --location  
     CLUSTER_1_ZONE 
      
     \ 
      
    --project = 
     PROJECT_ID 
     
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
    • CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
  4. On the first cluster, create a secret for the Hugging Face token:

     kubectl  
    create  
    secret  
    generic  
    hf-token  
     \ 
      
    --from-literal = 
     token 
     = 
     HF_TOKEN 
     
    

    Replace the HF_TOKEN : your Hugging Face access token.

  5. Create the second cluster in a different region from the first cluster:

     gcloud  
    container  
    clusters  
    create  
    gke-east  
    --region  
     LOCATION 
      
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --gateway-api = 
    standard  
     \ 
      
    --release-channel  
     "rapid" 
      
     \ 
      
    --cluster-version = 
     GKE_VERSION 
      
     \ 
      
    --machine-type = 
     " MACHINE_TYPE 
    " 
      
     \ 
      
    --disk-type = 
     " DISK_TYPE 
    " 
      
     \ 
      
    --enable-managed-prometheus  
     \ 
      
    --monitoring = 
    SYSTEM,DCGM  
     \ 
      
    --hpa-profile = 
    performance  
     \ 
      
    --async  
     # Allows the command to return immediately while the 
    cluster  
    is  
    created  
     in 
      
    the  
    background. 
    

    Replace the following:

    • LOCATION : the region for the second cluster. This must be a different region than the first cluster. For example, us-east4 .
    • PROJECT_ID : your project ID.
    • GKE_VERSION : the GKE version to use, for example 1.34.1-gke.1127000 .
    • MACHINE_TYPE : the machine type for the cluster nodes, for example c2-standard-16 .
    • DISK_TYPE : the disk type for the cluster nodes, for example pd-standard .
  6. Create an H100 node pool for the second cluster:

     gcloud  
    container  
    node-pools  
    create  
    h100  
     \ 
      
    --accelerator  
     "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" 
      
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --location = 
     CLUSTER_2_ZONE 
      
     \ 
      
    --node-locations = 
     CLUSTER_2_ZONE 
      
     \ 
      
    --cluster = 
     CLUSTER_2_NAME 
      
     \ 
      
    --machine-type = 
     NODE_POOL_MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --spot  
     \ 
      
    --async  
     # Allows the command to return immediately 
     
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
    • CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
    • NODE_POOL_MACHINE_TYPE : the machine type for the node pool, for example a3-highgpu-2g .
    • NUM_NODES : the number of nodes in the node pool, for example 3 .
  7. For the second cluster, get credentials and create a secret for the Hugging Face token:

     gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_2_NAME 
      
     \ 
      
    --location  
     CLUSTER_2_ZONE 
      
     \ 
      
    --project = 
     PROJECT_ID 
    kubectl  
    create  
    secret  
    generic  
    hf-token  
    --from-literal = 
     token 
     = 
     HF_TOKEN 
     
    

    Replace the following:

    • CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
    • CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
    • PROJECT_ID : your project ID.
    • HF_TOKEN : your Hugging Face access token.

Register clusters to a fleet

To enable multi-cluster capabilities, such as the multi-cluster GKE Inference Gateway, register your clusters to a fleet.

  1. Register both clusters to your project's fleet:

     gcloud  
    container  
    fleet  
    memberships  
    register  
     CLUSTER_1_NAME 
      
     \ 
      
    --gke-cluster  
     CLUSTER_1_ZONE 
    / CLUSTER_1_NAME 
      
     \ 
      
    --location = 
    global  
     \ 
      
    --project = 
     PROJECT_ID 
    gcloud  
    container  
    fleet  
    memberships  
    register  
     CLUSTER_2_NAME 
      
     \ 
      
    --gke-cluster  
     CLUSTER_2_ZONE 
    / CLUSTER_2_NAME 
      
     \ 
      
    --location = 
    global  
     \ 
      
    --project = 
     PROJECT_ID 
     
    

    Replace the following:

    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
    • CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
    • PROJECT_ID : your project ID.
    • CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
    • CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
  2. To allow a single Gateway to manage traffic across multiple clusters, enable the multi-cluster Ingress feature and designate a config cluster:

     gcloud  
    container  
    fleet  
    ingress  
     enable 
      
     \ 
      
    --config-membership = 
    projects/ PROJECT_ID 
    /locations/global/memberships/ CLUSTER_1_NAME 
     
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .

Create proxy-only subnets

For an internal gateway, create a proxy-only subnet in each region. The internal Gateway's Envoy proxies use these dedicated subnets to handle traffic within your VPC network.

  1. Create a subnet in the first cluster's region:

     gcloud  
    compute  
    networks  
    subnets  
    create  
     CLUSTER_1_REGION 
    -subnet  
     \ 
      
    --purpose = 
    GLOBAL_MANAGED_PROXY  
     \ 
      
    --role = 
    ACTIVE  
     \ 
      
    --region = 
     CLUSTER_1_REGION 
      
     \ 
      
    --network = 
    default  
     \ 
      
    --range = 
     10 
    .0.0.0/23  
     \ 
      
    --project = 
     PROJECT_ID 
     
    
  2. Create a subnet in the second cluster's region:

     gcloud  
    compute  
    networks  
    subnets  
    create  
     CLUSTER_2_REGION 
    -subnet  
     \ 
      
    --purpose = 
    GLOBAL_MANAGED_PROXY  
     \ 
      
    --role = 
    ACTIVE  
     \ 
      
    --region = 
     CLUSTER_2_REGION 
      
     \ 
      
    --network = 
    default  
     \ 
      
    --range = 
     10 
    .5.0.0/23  
     \ 
      
    --project = 
     PROJECT_ID 
     
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • CLUSTER_1_REGION : the region for the first cluster, for example europe-west3 .
    • CLUSTER_2_REGION : the region for the second cluster, for example us-east4 .

Install the required CRDs

The multi-cluster GKE Inference Gateway uses custom resources such as InferencePool and InferenceObjective . The GKE Gateway API controller manages the InferencePool Custom Resource Definition (CRD). However, you must manually install the InferenceObjective CRD, which is in alpha, on your clusters.

  1. Define context variables for your clusters:

      CLUSTER1_CONTEXT 
     = 
     "gke_ PROJECT_ID 
    _ CLUSTER_1_ZONE 
    _ CLUSTER_1_NAME 
    " 
     CLUSTER2_CONTEXT 
     = 
     "gke_ PROJECT_ID 
    _ CLUSTER_2_ZONE 
    _ CLUSTER_2_NAME 
    " 
     
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • CLUSTER_1_ZONE : the zone for the first cluster, for example europe-west3-c .
    • CLUSTER_1_NAME : the name of the first cluster, for example gke-west .
    • CLUSTER_2_ZONE : the zone for the second cluster, for example us-east4-a .
    • CLUSTER_2_NAME : the name of the second cluster, for example gke-east .
  2. Install the InferenceObjective CRD on both clusters:

      # Copyright 2025 Google LLC 
     # 
     # Licensed under the Apache License, Version 2.0 (the "License"); 
     # you may not use this file except in compliance with the License. 
     # You may obtain a copy of the License at 
     # 
     #      http://www.apache.org/licenses/LICENSE-2.0 
     # 
     # Unless required by applicable law or agreed to in writing, software 
     # distributed under the License is distributed on an "AS IS" BASIS, 
     # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
     # See the License for the specific language governing permissions and 
     # limitations under the License. 
     apiVersion 
     : 
      
     admissionregistration.k8s.io/v1 
     kind 
     : 
      
     ValidatingAdmissionPolicy 
     metadata 
     : 
      
     name 
     : 
      
     restrict-toleration 
     spec 
     : 
      
     failurePolicy 
     : 
      
     Fail 
      
     paramKind 
     : 
      
     apiVersion 
     : 
      
     v1 
      
     kind 
     : 
      
     ConfigMap 
      
     matchConstraints 
     : 
      
     # GKE will mutate any pod specifying a CC label in a nodeSelector 
      
     # or in a nodeAffinity with a toleration for the CC node label. 
      
     # Mutation hooks will always mutate the K8s object before validating 
      
     # the admission request. 
      
     # Pods created by Jobs, CronJobs, Deployments, etc. will also be validated. 
      
     # See https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#admission-control-phases for details 
      
     resourceRules 
     : 
      
     - 
      
     apiGroups 
     : 
      
     [ 
     "" 
     ] 
      
     apiVersions 
     : 
      
     [ 
     "v1" 
     ] 
      
     operations 
     : 
      
     [ 
     "CREATE" 
     , 
      
     "UPDATE" 
     ] 
      
     resources 
     : 
      
     [ 
     "pods" 
     ] 
      
     matchConditions 
     : 
      
     - 
      
     name 
     : 
      
     'match-tolerations' 
      
     # Validate only if compute class toleration exists 
      
     # and the CC label tolerated is listed in the configmap. 
      
     expression 
     : 
     > 
      
     object.spec.tolerations.exists(t, has(t.key) 
      
    &&  
     t.key == 'cloud.google.com/compute-class' 
      
    &&  
     params.data.computeClasses.split('\\n').exists(cc, cc == t.value)) 
      
     validations 
     : 
      
     # ConfigMap with permitted namespace list referenced via `params`. 
      
     - 
      
     expression 
     : 
      
     "params.data.namespaces.split('\\n').exists(ns, 
      
     ns 
      
     == 
      
     object.metadata.namespace)" 
      
     messageExpression 
     : 
      
     "'Compute 
      
     class 
      
     toleration 
      
     not 
      
     permitted 
      
     on 
      
     workloads 
      
     in 
      
     namespace 
      
     ' 
      
     + 
      
     object.metadata.namespace" 
     --- 
     apiVersion 
     : 
      
     admissionregistration.k8s.io/v1 
     kind 
     : 
      
     ValidatingAdmissionPolicyBinding 
     metadata 
     : 
      
     name 
     : 
      
     restrict-toleration-binding 
     spec 
     : 
      
     policyName 
     : 
      
     restrict-toleration 
      
     validationActions 
     : 
      
     [ 
     "Deny" 
     ] 
      
     paramRef 
     : 
      
     name 
     : 
      
     allowed-ccc-namespaces 
      
     namespace 
     : 
      
     default 
      
     parameterNotFoundAction 
     : 
      
     Deny 
     --- 
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     ConfigMap 
     metadata 
     : 
      
     name 
     : 
      
     allowed-ccc-namespaces 
      
     namespace 
     : 
      
     default 
     data 
     : 
      
     # Replace example namespaces in line-separated list below. 
      
     namespaces 
     : 
      
     | 
      
     foo 
      
     bar 
      
     baz 
      
     # ComputeClass names to monitor with this validation policy. 
      
     # The 'autopilot' and 'autopilot-spot' CCs are present on 
      
     # all NAP Standard and Autopilot clusters. 
      
     computeClasses 
     : 
      
     | 
      
     MY_COMPUTE_CLASS 
      
     autopilot 
      
     autopilot-spot 
     
    
     kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml  
    --context = 
     CLUSTER1_CONTEXT 
    kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml  
    --context = 
     CLUSTER2_CONTEXT 
     
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy resources to the target clusters

To make your AI/ML inference workloads available on each cluster, deploy the required resources, such as the model servers and InferenceObjective custom resources.

  1. Deploy the model servers to both clusters:

     kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml  
    --context = 
     CLUSTER1_CONTEXT 
    kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml  
    --context = 
     CLUSTER2_CONTEXT 
     
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .
  2. Deploy the InferenceObjective resources to both clusters. Save the following sample manifest to a file named inference-objective.yaml :

      apiVersion 
     : 
      
     inference.networking.x-k8s.io/v1alpha2 
     kind 
     : 
      
     InferenceObjective 
     metadata 
     : 
      
     name 
     : 
      
     food-review 
     spec 
     : 
      
     priority 
     : 
      
     10 
      
     poolRef 
     : 
      
     name 
     : 
      
     llama3-8b-instruct 
      
     group 
     : 
      
     "inference.networking.k8s.io" 
     
    
  3. Apply the manifest to both clusters:

     kubectl  
    apply  
    -f  
    inference-objective.yaml  
    --context = 
     CLUSTER1_CONTEXT 
    kubectl  
    apply  
    -f  
    inference-objective.yaml  
    --context = 
     CLUSTER2_CONTEXT 
     
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .
  4. Deploy the InferencePool resources to both clusters by using Helm:

     helm  
    install  
    vllm-llama3-8b-instruct  
     \ 
      
    --kube-context  
     CLUSTER1_CONTEXT 
      
     \ 
      
    --set  
    inferencePool.modelServers.matchLabels.app = 
    vllm-llama3-8b-instruct  
     \ 
      
    --set  
    provider.name = 
    gke  
     \ 
      
    --version  
    v1.1.0  
     \ 
      
    oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool 
    
     helm  
    install  
    vllm-llama3-8b-instruct  
     \ 
      
    --kube-context  
     CLUSTER2_CONTEXT 
      
     \ 
      
    --set  
    inferencePool.modelServers.matchLabels.app = 
    vllm-llama3-8b-instruct  
     \ 
      
    --set  
    provider.name = 
    gke  
     \ 
      
    --set  
    inferenceExtension.monitoring.gke.enabled = 
     true 
      
     \ 
      
    --version  
    v1.1.0  
     \ 
      
    oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool 
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .
  5. Mark the InferencePool resources as exported on both clusters. This annotation makes the InferencePool available for import by the config cluster, which is a required step for multi-cluster routing.

     kubectl  
    annotate  
    inferencepool  
    vllm-llama3-8b-instruct  
    networking.gke.io/export = 
     "True" 
      
     \ 
      
    --context = 
     CLUSTER1_CONTEXT 
     
    
     kubectl  
    annotate  
    inferencepool  
    vllm-llama3-8b-instruct  
    networking.gke.io/export = 
     "True" 
      
     \ 
      
    --context = 
     CLUSTER2_CONTEXT 
     
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Deploy resources to the config cluster

To define how traffic is routed and load-balanced across the InferencePool resources in all registered clusters, deploy the Gateway , HTTPRoute , and HealthCheckPolicy resources. You deploy these resources only to the designated config cluster, which is gke-west in this document.

  1. Create a file named mcig.yaml with the following content:

      --- 
     apiVersion 
     : 
      
     gateway.networking.k8s.io/v1 
     kind 
     : 
      
     Gateway 
     metadata 
     : 
      
     name 
     : 
      
     cross-region-gateway 
      
     namespace 
     : 
      
     default 
     spec 
     : 
      
     gatewayClassName 
     : 
      
     gke-l7-cross-regional-internal-managed-mc 
      
     addresses 
     : 
      
     - 
      
     type 
     : 
      
     networking.gke.io/ephemeral-ipv4-address/europe-west3 
      
     value 
     : 
      
     "europe-west3" 
      
     - 
      
     type 
     : 
      
     networking.gke.io/ephemeral-ipv4-address/us-east4 
      
     value 
     : 
      
     "us-east4" 
      
     listeners 
     : 
      
     - 
      
     name 
     : 
      
     http 
      
     protocol 
     : 
      
     HTTP 
      
     port 
     : 
      
     80 
     --- 
     apiVersion 
     : 
      
     gateway.networking.k8s.io/v1 
     kind 
     : 
      
     HTTPRoute 
     metadata 
     : 
      
     name 
     : 
      
     vllm-llama3-8b-instruct-default 
     spec 
     : 
      
     parentRefs 
     : 
      
     - 
      
     name 
     : 
      
     cross-region-gateway 
      
     kind 
     : 
      
     Gateway 
      
     rules 
     : 
      
     - 
      
     backendRefs 
     : 
      
     - 
      
     group 
     : 
      
     networking.gke.io 
      
     kind 
     : 
      
     GCPInferencePoolImport 
      
     name 
     : 
      
     vllm-llama3-8b-instruct 
     --- 
     apiVersion 
     : 
      
     networking.gke.io/v1 
     kind 
     : 
      
     HealthCheckPolicy 
     metadata 
     : 
      
     name 
     : 
      
     health-check-policy 
      
     namespace 
     : 
      
     default 
     spec 
     : 
      
     targetRef 
     : 
      
     group 
     : 
      
     "networking.gke.io" 
      
     kind 
     : 
      
     GCPInferencePoolImport 
      
     name 
     : 
      
     vllm-llama3-8b-instruct 
      
     default 
     : 
      
     config 
     : 
      
     type 
     : 
      
     HTTP 
      
     httpHealthCheck 
     : 
      
     requestPath 
     : 
      
     /health 
      
     port 
     : 
      
     8000 
     
    
  2. Apply the manifest:

     kubectl  
    apply  
    -f  
    mcig.yaml  
    --context = 
     CLUSTER1_CONTEXT 
     
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster (the config cluster), for example gke_my-project_europe-west3-c_gke-west .

Enable custom metrics reporting

To enable custom metrics reporting and improve cross-regional load balancing, you export KV Cache usage metrics from all clusters. The load balancer uses this exported KV Cache usage data as a custom load signal. Using this custom load signal allows for more intelligent load balancing decisions based on each cluster's actual workload.

  1. Create a file named metrics.yaml with the following content:

      apiVersion 
     : 
      
     autoscaling.gke.io/v1beta1 
     kind 
     : 
      
     AutoscalingMetric 
     metadata 
     : 
      
     name 
     : 
      
     gpu-cache 
      
     namespace 
     : 
      
     default 
     spec 
     : 
      
     selector 
     : 
      
     matchLabels 
     : 
      
     app 
     : 
      
     vllm-llama3-8b-instruct 
      
     endpoints 
     : 
      
     - 
      
     port 
     : 
      
     8000 
      
     path 
     : 
      
     /metrics 
      
     metrics 
     : 
      
     - 
      
     name 
     : 
      
     vllm:kv_cache_usage_perc 
      
     # For vLLM versions v0.10.2 and newer 
      
     exportName 
     : 
      
     kv-cache 
      
     - 
      
     name 
     : 
      
     vllm:gpu_cache_usage_perc 
      
     # For vLLM versions v0.6.2 and newer 
      
     exportName 
     : 
      
     kv-cache-old 
     
    
  2. Apply the metrics configuration to both clusters:

     kubectl  
    apply  
    -f  
    metrics.yaml  
    --context = 
     CLUSTER1_CONTEXT 
    kubectl  
    apply  
    -f  
    metrics.yaml  
    --context = 
     CLUSTER2_CONTEXT 
     
    

    Replace the following:

    • CLUSTER1_CONTEXT : the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .
    • CLUSTER2_CONTEXT : the context for the second cluster, for example gke_my-project_us-east4-a_gke-east .

Configure the load balancing policy

To optimize how your AI/ML inference requests are distributed across your GKE clusters, configure a load balancing policy. Choosing the right balancing mode helps ensure efficient resource utilization, prevents overloading individual clusters, and improves the overall performance and responsiveness of your inference services.

Configure timeouts

If your requests are expected to have long durations, configure a longer timeout for the load balancer. In the GCPBackendPolicy , set the timeoutSec field to at least twice your estimated P99 request latency.

For example, the following manifest sets the load balancer timeout to 100 seconds.

  apiVersion 
 : 
  
 networking.gke.io/v1 
 kind 
 : 
  
 GCPBackendPolicy 
 metadata 
 : 
  
 name 
 : 
  
 my-backend-policy 
 spec 
 : 
  
 targetRef 
 : 
  
 group 
 : 
  
 "networking.gke.io" 
  
 kind 
 : 
  
 GCPInferencePoolImport 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 default 
 : 
  
 timeoutSec 
 : 
  
 100 
  
 balancingMode 
 : 
  
 CUSTOM_METRICS 
  
 trafficDuration 
 : 
  
 LONG 
  
 customMetrics 
 : 
  
 - 
  
 name 
 : 
  
 gke.named_metrics.kv-cache 
  
 dryRun 
 : 
  
 false 
 

For more information, see multi-cluster Gateway limitations .

Because the Custom metricsand In-flight requestsload balancing modes are mutually exclusive, configure only one of these modes in your GCPBackendPolicy .

Choose a load balancing mode for your deployment.

Custom metrics

For optimal load balancing, start with a target utilization of 60%. To achieve this target, set maxUtilization: 60 in your GCPBackendPolicy 's customMetrics configuration.

  1. Create a file named backend-policy.yaml with the following content to enable load balancing based on the kv-cache custom metric:

      apiVersion 
     : 
      
     networking.gke.io/v1 
     kind 
     : 
      
     GCPBackendPolicy 
     metadata 
     : 
      
     name 
     : 
      
     my-backend-policy 
     spec 
     : 
      
     targetRef 
     : 
      
     group 
     : 
      
     "networking.gke.io" 
      
     kind 
     : 
      
     GCPInferencePoolImport 
      
     name 
     : 
      
     vllm-llama3-8b-instruct 
      
     default 
     : 
      
     balancingMode 
     : 
      
     CUSTOM_METRICS 
      
     trafficDuration 
     : 
      
     LONG 
      
     customMetrics 
     : 
      
     - 
      
     name 
     : 
      
     gke.named_metrics.kv-cache 
      
     dryRun 
     : 
      
     false 
      
     maxUtilization 
     : 
      
     60 
     
    
  2. Apply the new policy:

     kubectl  
    apply  
    -f  
    backend-policy.yaml  
    --context = 
     CLUSTER1_CONTEXT 
     
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project-europe-west3-c-gke-west .

In-flight requests

To use the in-flight balancing mode, estimate the number of in-flight requests each backend can handle and explicitly configure a capacity value.

  1. Create a file named backend-policy.yaml with the following content to enable load balancing based on the number of in-flight requests:

      kind 
     : 
      
     GCPBackendPolicy 
     apiVersion 
     : 
      
     networking.gke.io/v1 
     metadata 
     : 
      
     name 
     : 
      
     my-backend-policy 
     spec 
     : 
      
     targetRef 
     : 
      
     group 
     : 
      
     "networking.gke.io" 
      
     kind 
     : 
      
     GCPInferencePoolImport 
      
     name 
     : 
      
     vllm-llama3-8b-instruct 
      
     default 
     : 
      
     balancingMode 
     : 
      
     IN_FLIGHT 
      
     trafficDuration 
     : 
      
     LONG 
      
     maxInFlightRequestsPerEndpoint 
     : 
      
     1000 
      
     dryRun 
     : 
      
     false 
     
    
  2. Apply the new policy:

     kubectl  
    apply  
    -f  
    backend-policy.yaml  
    --context = 
     CLUSTER1_CONTEXT 
     
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .

Verify the deployment

To verify the internal load balancer, you must send requests from within your VPC network because, as internal load balancers use private IP addresses. Run a temporary Pod inside one of the clusters to send requests from your VPC network and verify the internal load balancer:

  1. Start an interactive shell session in a temporary Pod:

     kubectl  
    run  
    -it  
    --rm  
    --image = 
    curlimages/curl  
    curly  
    --context = 
     CLUSTER1_CONTEXT 
      
    --  
    /bin/sh 
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west .

  2. From the new shell, get the Gateway IP address and send a test request:

      GW_IP 
     = 
     $( 
    kubectl  
    get  
    gateway/cross-region-gateway  
    -n  
    default  
    -o  
     jsonpath 
     = 
     '{.status.addresses[0].value}' 
     ) 
    curl  
    -i  
    -X  
    POST  
     ${ 
     GW_IP 
     } 
    :80/v1/completions  
    -H  
     'Content-Type: application/json' 
      
    -d  
     '{ 
     "model": "food-review-1", 
     "prompt": "What is the best pizza in the world?", 
     "max_tokens": 100, 
     "temperature": 0 
     }' 
     
    

    The following is an example of a successful response:

      { 
      
     "id" 
     : 
      
     "cmpl-..." 
     , 
      
     "object" 
     : 
      
     "text_completion" 
     , 
      
     "created" 
     : 
      
     1704067200 
     , 
      
     "model" 
     : 
      
     "food-review-1" 
     , 
      
     "choices" 
     : 
      
     [ 
      
     { 
      
     "text" 
     : 
      
     "The best pizza in the world is subjective, but many argue for Neapolitan pizza..." 
     , 
      
     "index" 
     : 
      
     0 
     , 
      
     "logprobs" 
     : 
      
     null 
     , 
      
     "finish_reason" 
     : 
      
     "length" 
      
     } 
      
     ], 
      
     "usage" 
     : 
      
     { 
      
     "prompt_tokens" 
     : 
      
     10 
     , 
      
     "completion_tokens" 
     : 
      
     100 
     , 
      
     "total_tokens" 
     : 
      
     110 
      
     } 
     } 
     
    

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: