Deploy GKE Inference Gateway

This page describes how to deploy GKE Inference Gateway.

This page is intended for Networking specialists responsible for managing GKE infrastructure and platform administrators who manage AI workloads.

Before reading this page, ensure that you're familiar with the following:

About GKE Inference Gateway .
AI/ML orchestration on GKE .
Generative AI glossary .
Load balancing in Google Cloud , especially how load balancers interact with GKE.
GKE Service Extensions. For more information, see the GKE Gateway controller documentation.
Customize GKE Gateway traffic using Service Extensions .

GKE Inference Gateway enhances Google Kubernetes Engine (GKE) Gateway to optimize the serving of generative AI applications and workloads on GKE. It provides efficient management and scaling of AI workloads, enables workload-specific performance objectives such as latency, and enhances resource utilization, observability, and AI safety.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property . If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location . You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Enable the Compute Engine API, the Network Services API, and the Model Armor API if needed.

Go to Enable access to APIs and follow the instructions.
Make sure that you have the following roles on the project: roles/container.admin , roles/iam.serviceAccountAdmin .
Ensure your project has sufficient quota for H100 GPUs. To learn more, see Plan GPU quota and Allocation quotas .
Create a Hugging Face account if you don't already have one. You will need this to access the model resources for this tutorial.
Request access to the Llama 3.1 model and generate an access token. Access to this model requires an approved request on Hugging Face, and the deployment will fail if access has not been granted.
- Sign the license consent agreement:You must sign the consent agreement to use the Llama 3.1 model. Go to the model's page on Hugging Face, verify your account, and accept the terms.
- Generate an access token:To access the model, you need a Hugging Face token. In your Hugging Face account, go to Your Profile > Settings > Access Tokens, create a new token with at least Read permissions, and copy it to your clipboard.

GKE Gateway controller requirements

GKE version 1.32.3 or later.
Google Cloud CLI version 407.0.0 or later.
Gateway API is supported on VPC-native clusters only.
You must enable a proxy-only subnet .
Your cluster must have the HttpLoadBalancing add-on enabled.
If you are using Istio, you must upgrade Istio to one of the following versions:
- 1.15.2 or later
- 1.14.5 or later
- 1.13.9 or later
If you are using Shared VPC, then in the host project, you need to assign the Compute Network User role to the GKE Service account for the service project.

Restrictions and limitations

The following restrictions and limitations apply:

Multi-cluster Gateways are not supported.
GKE Inference Gateway is only supported on the gke-l7-regional-external-managed and gke-l7-rilb GatewayClass resources.
Cross-region internal Application Load Balancers are not supported.

Configure GKE Inference Gateway

To configure GKE Inference Gateway, consider this example. A team runs vLLM and Llama3 models and actively experiments with two distinct LoRA fine-tuned adapters: "food-review" and "cad-fabricator".

The high-level workflow for configuring GKE Inference Gateway is as follows:

Prepare your environment : set up the necessary infrastructure and components.
Create an inference pool : define a pool of model servers using the InferencePool Custom Resource.
Specify inference objectives : specify inference objectives using the InferenceObjective Custom Resource
Create the Gateway : expose the inference service using Gateway API.
Create the HTTPRoute : define how HTTP traffic is routed to the inference service.
Send inference requests : make requests to the deployed model.

Prepare your environment

Install Helm .
Create a GKE cluster:
- Create a GKE Autopilot or Standard cluster with version 1.32.3 or later. For a one-click deployment reference setup, see the cluster-toolkit gke-a3-highgpu sample .
- Configure the nodes with your preferred compute family and accelerator.
- Use GKE Inference Quickstart for pre-configured and tested deployment manifests, based on your selected accelerator, model, and performance needs.

Install needed Custom Resource Definitions (CRDs) in your GKE cluster:

For GKE versions 1.34.0-gke.1626000 or later, install only the alpha InferenceObjective CRD:

 kubectl  
apply  
-f  
https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml

For GKE versions earlier than 1.34.0-gke.1626000 , install both the v1 InferencePool and alpha InferenceObjective CRDs:
```
 kubectl  
apply  
-f  
https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml 
```
For more information, see the compatibility matrix .

If you are using GKE version earlier than v1.32.2-gke.1182001 and you want to use Model Armor with GKE Inference Gateway, you must install the traffic and routing extension CRDs:

 kubectl  
apply  
-f  
https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcptrafficextensions.yaml
kubectl  
apply  
-f  
https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcproutingextensions.yaml

Create a model server and model deployment

This section shows how to deploy a model server and model. The example uses a vLLM model server with a Llama3 model. The deployment is labeled as app:vllm-llama3-8b-instruct . This deployment also uses two LoRA adapters named food-review and cad-fabricator from Hugging Face.

You can adapt this example with your own model server container and model, serving port, and deployment name. You can also configure LoRA adapters in the deployment, or deploy the base model. The following steps describe how to create the necessary Kubernetes resources.

Create a Kubernetes Secret to store your Hugging Face token. This token is used to access the base model and the LoRA adapters:
```
 kubectl  
create  
secret  
generic  
hf-token  
--from-literal = 
 token 
 = 
 HF_TOKEN 
 
```
Replace HF_TOKEN with your Hugging Face token.
Deploy the model server and model. The following command applies a manifest that defines a Kubernetes Deployment for a vLLM model server with a Llama3 model:
```
 kubectl  
apply  
-f  
https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/release-1.0/config/manifests/vllm/gpu-deployment.yaml 
```
Note: If you are using Autopilot, you must manually add a nodeSelector to the default gpu-deployment.yaml manifest to schedule GPU workloads. For more information, see Request GPUs in your containers .

Create an inference pool

The InferencePool Kubernetes custom resource defines a group of Pods with a common base large language model (LLM) and compute configuration. The selector field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods. The targetPort field defines the ports that the model server uses within the Pods. The extensionRef field references an extension service that provides additional capability for the inference pool. The InferencePool enables GKE Inference Gateway to route traffic to your model server Pods.

Before you create the InferencePool , ensure that the Pods that the InferencePool selects are already running.

To create an InferencePool using Helm, perform the following steps:

 helm  
install  
vllm-llama3-8b-instruct  
 \ 
  
--set  
inferencePool.modelServers.matchLabels.app = 
vllm-llama3-8b-instruct  
 \ 
  
--set  
provider.name = 
gke  
 \ 
  
--set  
inferenceExtension.monitoring.gke.enabled = 
 true 
  
 \ 
  
--version  
v1.0.1  
 \ 
  
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

Change the following field to match your Deployment:

inferencePool.modelServers.matchLabels.app : the key of the label used to select your model server Pods.

For monitoring, metrics scraping for Google Cloud Managed Service for Prometheus is enabled by default.

To disable this feature, add the --set inferenceExtension.monitoring.gke.enabled=false flag to the command.
If you use the default monitoring on a GKE Autopilot cluster, you must also add the --set provider.gke.autopilot=true flag.

The Helm install automatically installs the necessary timeout policy, endpoint-picker and the Pods needed for observability.

This creates an InferencePool object: vllm-llama3-8b-instruct referencing the model endpoint services within the Pods. It also creates a deployment of the Endpoint Picker named app:vllm-llama3-8b-instruct-epp for this created InferencePool .

Specify inference objectives

The InferenceObjective custom resource lets you specify priority of requests.

The metadata.name field of the InferenceObjective resource specifies the name of the Inference Objective, the Priority field specifies its serving criticality, and the poolRef field specifies the InferencePool on which the model is served.

  apiVersion 
 : 
  
 inference.networking.x-k8s.io/v1alpha2 
 kind 
 : 
  
 InferenceObjective 
 metadata 
 : 
  
 name 
 : 
  
  NAME 
 
 spec 
 : 
  
 priority 
 : 
  
  VALUE 
 
  
 poolRef 
 : 
  
 name 
 : 
  
  INFERENCE_POOL_NAME 
 
  
 group 
 : 
  
 "inference.networking.k8s.io"

Replace the following:

NAME : the name of your Inference Objective. For example, food-review .
VALUE : the priority for the Inference Objective. This is an integer where a higher value indicates a more critical request. For example, 10.
INFERENCE_POOL_NAME : the name of the InferencePool you created in the previous step. For example, vllm-llama3-8b-instruct .

To create an InferenceObjective , perform the following steps:

Save the following manifest as inference-objectives.yaml . This manifest creates two InferenceObjective resources. The first configures the food-review Inference Objective on the vllm-llama3-8b-instruct InferencePool with a priority of 10. The second configures the llama3-base-model Inference Objective to be served with a higher priority of 20.

  apiVersion 
 : 
  
 inference.networking.x-k8s.io/v1alpha2 
 kind 
 : 
  
 InferenceObjective 
 metadata 
 : 
  
 name 
 : 
  
 food-review 
 spec 
 : 
  
 priority 
 : 
  
 10 
  
 poolRef 
 : 
  
 name 
 : 
  
 vllm-llama3-8b-instruct 
  
 group 
 : 
  
 "inference.networking.k8s.io" 
 --- 
 apiVersion 
 : 
  
 inference.networking.x-k8s.io/v1alpha2 
 kind 
 : 
  
 InferenceObjective 
 metadata 
 : 
  
 name 
 : 
  
 llama3-base-model 
 spec 
 : 
  
 priority 
 : 
  
 20 
  
 # Higher priority 
  
 poolRef 
 : 
  
 name 
 : 
  
 vllm-llama3-8b-instruct

Apply the sample manifest to your cluster:

 kubectl  
apply  
-f  
inference-objectives.yaml

Create the Gateway

The Gateway resource is the entry point for external traffic into your Kubernetes cluster. It defines the listeners that accept incoming connections.

The GKE Inference Gateway works with the following Gateway Classes:

gke-l7-rilb : for regional internal Application Load Balancers.
gke-l7-regional-external-managed : for regional external Application Load Balancers.

For more information, see Gateway Classes documentation.

To create a Gateway, perform the following steps:

Save the following sample manifest as gateway.yaml :

  apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 Gateway 
 metadata 
 : 
  
 name 
 : 
  
  GATEWAY_NAME 
 
 spec 
 : 
  
 gatewayClassName 
 : 
  
  GATEWAY_CLASS 
 
  
 listeners 
 : 
  
 - 
  
 protocol 
 : 
  
 HTTP 
  
 port 
 : 
  
 80 
  
 name 
 : 
  
 http

Replace the following:

GATEWAY_NAME : a unique name for your Gateway resource. For example, inference-gateway .
GATEWAY_CLASS : the Gateway Class you want to use. For example, gke-l7-regional-external-managed .

Apply the manifest to your cluster:
```
 kubectl  
apply  
-f  
gateway.yaml 
```

Note: For more information about configuring TLS to secure your Gateway with HTTPS, see the GKE documentation on TLS configuration .

Create the `HTTPRoute`

The HTTPRoute resource defines how the GKE Gateway routes incoming HTTP requests to backend services, such as your InferencePool . The HTTPRoute resource specifies matching rules (for example, headers or paths) and the backend to which traffic should be forwarded.

To create an HTTPRoute , save the following sample manifest as httproute.yaml :

  apiVersion 
 : 
  
 gateway.networking.k8s.io/v1 
 kind 
 : 
  
 HTTPRoute 
 metadata 
 : 
  
 name 
 : 
  
  HTTPROUTE_NAME 
 
 spec 
 : 
  
 parentRefs 
 : 
  
 - 
  
 name 
 : 
  
  GATEWAY_NAME 
 
  
 rules 
 : 
  
 - 
  
 matches 
 : 
  
 - 
  
 path 
 : 
  
 type 
 : 
  
 PathPrefix 
  
 value 
 : 
  
  PATH_PREFIX 
 
  
 backendRefs 
 : 
  
 - 
  
 name 
 : 
  
  INFERENCE_POOL_NAME 
 
  
 group 
 : 
  
 "inference.networking.k8s.io" 
  
 kind 
 : 
  
 InferencePool

Replace the following:

HTTPROUTE_NAME : a unique name for your HTTPRoute resource. For example, my-route .
GATEWAY_NAME : the name of the Gateway resource that you created. For example, inference-gateway .
PATH_PREFIX : the path prefix that you use to match incoming requests. For example, / to match all.
INFERENCE_POOL_NAME : the name of the InferencePool resource that you want to route traffic to. For example, vllm-llama3-8b-instruct .

Apply the manifest to your cluster:

 kubectl  
apply  
-f  
httproute.yaml

Send inference request

After you have configured GKE Inference Gateway, you can send inference requests to your deployed model. This lets you generate text based on your input prompt and specified parameters.

To send inference requests, perform the following steps:

Set the following environment variables:
```
  export 
  
 GATEWAY_NAME 
 = 
 GATEWAY_NAME 
 export 
  
 PORT_NUMBER 
 = 
 PORT_NUMBER 
  
 # Use 80 for HTTP 
 
```
Replace the following:
- GATEWAY_NAME : the name of your Gateway resource.
- PORT_NUMBER : the port number you configured in the Gateway.

To get the Gateway endpoint, run the following command:

  echo 
  
 "Waiting for the Gateway IP address..." 
 IP 
 = 
 "" 
 while 
  
 [ 
  
-z  
 " 
 $IP 
 " 
  
 ] 
 ; 
  
 do 
  
 IP 
 = 
 $( 
kubectl  
get  
gateway/ ${ 
 GATEWAY_NAME 
 } 
  
-o  
 jsonpath 
 = 
 '{.status.addresses[0].value}' 
  
 2 
>/dev/null ) 
  
 if 
  
 [ 
  
-z  
 " 
 $IP 
 " 
  
 ] 
 ; 
  
 then 
  
 echo 
  
 "Gateway IP not found, waiting 5 seconds..." 
  
sleep  
 5 
  
 fi 
 done 
 echo 
  
 "Gateway IP address is: 
 $IP 
 " 
 PORT 
 = 
 ${ 
 PORT_NUMBER 
 }

To send a request to the /v1/completions endpoint using curl , run the following command:

 curl  
-i  
-X  
POST  
 ${ 
 IP 
 } 
: ${ 
 PORT 
 } 
/v1/completions  
 \ 
-H  
 'Content-Type: application/json' 
  
 \ 
-H  
 'Authorization: Bearer $(gcloud auth application-default print-access-token)' 
  
 \ 
-d  
 '{ 
 "model": " MODEL_NAME 
", 
 "prompt": " PROMPT_TEXT 
", 
 "max_tokens": MAX_TOKENS 
, 
 "temperature": " TEMPERATURE 
" 
 }'

Replace the following:

MODEL_NAME : the name of the model or LoRA adapter to use.
PROMPT_TEXT : the input prompt for the model.
MAX_TOKENS : the maximum number of tokens to generate in the response.
TEMPERATURE : controls the randomness of the output. Use the value 0 for deterministic output, or a higher number for more creative output.

The following example shows you how to send a sample request to GKE Inference Gateway:

 curl  
-i  
-X  
POST  
 ${ 
 IP 
 } 
: ${ 
 PORT 
 } 
/v1/completions  
-H  
 'Content-Type: application/json' 
  
-H  
 'Authorization: Bearer $(gcloud auth print-access-token)' 
  
-d  
 '{ 
 "model": "food-review-1", 
 "prompt": "What is the best pizza in the world?", 
 "max_tokens": 2048, 
 "temperature": "0" 
 }'

Be aware of the following behaviours:

Request body: the request body can include additional parameters like stop and top_p . Refer to the OpenAI API specification for a complete list of options.
Error handling: implement proper error handling in your client code to handle potential errors in the response. For example, check the HTTP status code in the curl response. A non- 200 status code generally indicates an error.
Authentication and authorization: for production deployments, secure your API endpoint with authentication and authorization mechanisms. Include the appropriate headers (for example, Authorization ) in your requests.

Compatibility matrix

The table outlines the compatibility and support matrix for the Gateway API Inference Extension Custom Resource Definitions (CRDs). It details which CRD versions are supported by GKE compared to the open-source (OSS) Gateway API Inference Extension project, including specific version requirements and installation notes.

CRD Name	CRD API Version	GKE Managed Support	OSS (Gateway API Inference Extension) Support
V1 InferencePool	inference.networking.k8s.io/v1	Supported on GKE 1.32.3 or later and CRD installed by default on GKE 1.34.0-gke.1626000 or later	Supported starting from Gateway API Inference Extension v1.0.0
Alpha InferencePool (Recommend users starting with v1 InferencePool as alpha InferencePool version has been deprecated)	inference.networking.x-k8s.io/v1alpha2	Supported on GKE 1.32.3 or later. However, CRD is not installed by default on GKE. Users need to manually install the CRD from Gateway API Inference Extension.	Supported starting from Gateway API Inference Extension v0.2.0
Alpha InferenceObjective	inference.networking.x-k8s.io/v1alpha2	GKE doesn't manage InferenceObjective	Supported starting from Gateway API Inference Extension v1.0.0
Alpha InferenceModel (Recommend users starting with InferenceObjective as InferenceModel has been deprecated)	inference.networking.x-k8s.io/v1alpha2	GKE doesn't manage InferenceModel	Supported starting from Gateway API Inference Extension v0.2.0.

Deploy GKE Inference Gateway Stay organized with collections Save and categorize content based on your preferences.

Before you begin

GKE Gateway controller requirements

Restrictions and limitations

Configure GKE Inference Gateway

Prepare your environment

Create a model server and model deployment

Create an inference pool

Specify inference objectives

Create the Gateway

Create the HTTPRoute

Send inference request

Compatibility matrix

What's next

Deploy GKE Inference Gateway

Create the `HTTPRoute`