Inference serving is critical in deploying your generative AI models to real-world applications. GKE provides a robust and scalable platform for managing your containerized workloads, making it a compelling choice for serving your models in development or production. With GKE, you can use Kubernetes' capabilities for orchestration, scaling, and high availability to efficiently deploy and manage your inference services.
Recognizing the specific demands of AI/ML inference, Google Cloud has introduced GKE Gen AI capabilities—a suite of features specifically designed to enhance and optimize inference serving on GKE. For more information about specific features, see GKE Gen AI capabilities .
Get started with AI/ML model inference on GKE
You can start exploring AI/ML model inference on GKE in minutes. You can use GKE's free tier , which lets you get started with Kubernetes without incurring costs for cluster management.
- Try the Deploy Models steps to deploy a containerized model and model server.
- Read Overview of inference best practices on GKE , which has guidance and resources for planning and running your inference workloads on GKE.
Terminology
This page uses the following terminology related to inference on GKE:
- Inference: the process of running a generative AI model, such as a large language model or diffusion model, within a GKE cluster to generate text, embeddings, or other outputs from input data. Model inference on GKE leverages accelerators to efficiently handle complex computations for real-time or batch processing.
- Model: a generative AI model that has learned patterns from data and is used for inference. Models vary in size and architecture, from smaller domain-specific models to massive multi-billion parameter neural networks that are optimized for diverse language tasks.
- Model server: a containerized service responsible for receiving inference requests and returning inferences. This service might be a Python app, or a more robust solution like vLLM , JetStream , TensorFlow Serving , or Triton Inference Server . The model server handles loading models into memory, and executes computations on accelerators to return inferences efficiently.
- Accelerator: specialized hardware, such as Graphics Processing Units (GPUs) from NVIDIA and Tensor Processing Units (TPUs) from Google, that can be attached to GKE nodes to speed up computations, particularly for training and inference tasks.
- Quantization: a technique used to reduce the size of AI/ML models and improve inference speed by converting model weights and activations from higher-precision data types to lower-precision data types.
Benefits of GKE for inference
Inference serving on GKE provides several benefits:
- Efficient price-performance:get value and speed for your inference serving needs. GKE lets you choose from a range of powerful accelerators (GPUs and TPUs) , so you only pay for the performance you need.
- Faster deployment: accelerate your time to market with tailored best practices, qualifications, and best practices provided by GKE Gen AI capabilities .
- Scalable performance: scale out performance with prebuilt monitoring by using GKE Inference Gateway , Horizontal Pod Autoscaling (HPA), and custom metrics. You can run a range of pre-trained or custom models, from 8 billion parameters up to 671 billion parameters.
- Full portability: benefit from full portability with open standards. Google contributes to key Kubernetes APIs, including Gateway and LeaderWorkerSet , and all APIs are portable with Kubernetes distributions.
- Ecosystem support: build on GKE's robust ecosystem which supports tools like Kueue for advanced resource queuing and management, and Ray for distributed computing, to facilitate scalable and efficient model training and inference.
How inference on GKE works
This section describes, at a high-level, the steps to use GKE for inference serving:
-
Containerize your model: To containerize an application is to create a container image , which is an executable package that includes everything needed to run the application: code, runtime, system tools, system libraries, and settings. A simple application can be containerized as a single unit, while a more complex application can be broken up into multiple, containerized components. Deploy a model by containerizing the model server (such as vLLM), and loading model weights from Cloud Storage or a repository like Hugging Face. When you use GKE Inference Quickstart , the containerized image is automatically managed in the manifest for you.
-
Create a GKE cluster: create a GKE cluster to host your deployment. Choose Autopilot for a managed experience or Standard for customization. Configure the cluster size, node types, and accelerators. For an optimized configuration, use Inference Quickstart .
-
Deploy your model as a Kubernetes Deployment: create a Kubernetes Deployment to manage your inference service. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster. Specify the Docker image, replicas, and settings. Kubernetes pulls the image and runs your containers on the GKE cluster nodes. Configure the Pods with your model server and model, including LoRA adapters if needed.
-
Expose your inference service: make your inference service accessible by creating a Kubernetes Service to provide a network endpoint for your Deployment. Use Inference Gateway for intelligent load balancing and routing that's specifically tailored for generative AI inference workloads. Use Inference Gateway for intelligent load balancing tailored for generative AI workloads, or see the comparison of load balancing strategies to choose the best option for your needs.
-
Handle inference requests: send data from your application clients to your Service's endpoint, in the expected format (JSON, gRPC). If you are using a load balancer, it distributes requests to model replicas. The model server processes the request, runs the model, and returns the inference.
-
Scale and monitor your inference deployment: scale inference with HPA to automatically adjust replicas based on CPU or latency. The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically increases or decreases the number of Pods in a workload (such as a Deployment) based on observed metrics like CPU utilization or custom metrics. Use Inference Quickstart to get auto-generated scaling recommendations. To track performance, use Cloud Monitoring and Cloud Logging with prebuilt observability, including dashboards for popular model servers like vLLM .
For detailed examples that use specific models, model servers, and accelerators, see Inference examples .
GKE Gen AI capabilities
You can use these capabilities together or individually to address key challenges in serving generative AI models and improving resource utilization within your GKE environment, at no additional cost.
Analyze the performance and cost-efficiency of your inference workloads. Specify your business needs and get tailored best practices for the combination of accelerators, scaling and storage configurations, and model servers that best meets your needs. You can access this service with the gcloud CLI and Google Cloud console.
For more information, see Analyze model serving performance and costs with GKE Inference Quickstart .
- Saves time by automating the initial steps of choosing and configuring your infrastructure.
- Lets you maintain full control over your Kubernetes setup for further tuning.
Get routing based on metrics, like KV cache utilization, for better latency.
For more information, see About GKE Inference Gateway
- Share fined-tuned models that use LoRA files, with affinity-based endpoint picking for cost-efficiency.
- Achieve high availability by dynamically accessing GPU and TPU capacity across regions.
- Enhance the security of your models with Model Armor add-on policies.
Access data in Cloud Storage quickly using Cloud Storage FUSE with caching and parallel downloads . For more information about using Cloud Storage FUSE for AI/ML workloads, see the reference architecture .
Google Cloud Managed Lustre is a high performance, fully managed parallel file system optimized for AI that can attach to 10,000 or more Pods. For more information about using Managed Lustre for AI/ML workloads, see the reference architecture .
Google Cloud Hyperdisk ML is a network-attached disk that can be attached to up to 2,500 Pods.
- Optimize inference startup time by minimizing weight loading model latency on GKE.
- For deployments with limited node scaling, consider using Cloud Storage FUSE to mount model weights.
- For inference workloads that demand consistent scale out performance, Google Cloud Managed Lustre supports high-throughput and low-latency file access from multiple Pods simultaneously.
- For massive-scale scenarios demanding consistent, low latency access to large model weights, Google Cloud Hyperdisk ML offers a dedicated block storage solution.
Inference performance metrics
To optimize your inference workloads, it's important to understand how to measure their performance. The following table describes the key metrics for benchmarking inference performance on GKE.
request_latency / total_output_tokens
.(request_latency - time_to_first_token) / (total_output_tokens - 1)
.total_output_tokens_generated_by_server / elapsed_time_in_seconds
.total_input_tokens_generated_by_server / elapsed_time_in_seconds
.total_tokens_generated_by_server / elapsed_time_in_seconds
. This metric counts both input and output tokens, helping you compare workloads with high prefill versus high decode times.Planning for inference
A successful inference deployment requires careful planning across several key areas, including cost-efficiency, performance, and resource obtainability. For detailed recommendations on how to build a scalable, performant, and cost-effective inference platform, see Overview of inference best practices on GKE .
Try inference examples
Find GKE deployment examples for generative AI models, accelerators, and model servers. If you are just getting started, we recommend exploring the Serve Gemma open models using GPUs on GKE with vLLM tutorial.
Or, search for a tutorial by keyword:
| Accelerator | Model Server | Tutorial |
|---|---|---|
|
GPUs
|
vLLM | Serve LLMs like DeepSeek-R1 671B or Llama 3.1 405B on GKE |
|
GPUs
|
vLLM | Serve Gemma open models using GPUs on GKE with vLLM |
|
GPUs
|
vLLM | Serve an LLM with GKE Inference Gateway |
|
GPUs
|
vLLM | Serve open LLMs on GKE with a pre-configured architecture |
|
GPUs
|
Ray Serve | Serve an LLM on L4 GPUs with Ray |
|
GPUs
|
TGI | Serve an LLM with multiple GPUs in GKE |
|
GPUs
|
TorchServe | Serve T5 on GKE with TorchServe |
|
TPUs
|
vLLM | Serve an LLM using TPU Trillium on GKE with vLLM |
|
TPUs
|
vLLM | Serve an LLM using TPUs on GKE with KubeRay |
|
TPUs
|
MaxDiffusion | Serve Stable Diffusion XL (SDXL) using TPUs on GKE with MaxDiffusion |
|
TPUs
|
vLLM | Serve LLMs using multi-host TPUs |
|
TPUs
|
vLLM | Serve open LLMs on TPUs with a pre-configured architecture |
What's next
- Visit the AI/ML orchestration on GKE portal to explore our official guides, tutorials, and use cases for running AI/ML workloads on GKE.
- Explore experimental samples for leveraging GKE to accelerate your AI/ML initiatives in GKE AI Labs .

