AI Pre-trained Inference GKE cluster and workload

To create an AI application, provision a secure, private Google Kubernetes Engine (GKE) cluster optimized for AI workloads, and then deploy your workload using a helm chart. This guide describes the following templates, which you can customize to deploy an AI application:

  • AI Pre-trained Inference GKE cluster : create the foundational infrastructure required for high-performance model serving. This template sets up a secure, private GKE cluster optimized for AI inference.

  • AI Pre-trained Inference GKE workload ( Preview ): deploy a helm chart that includes the configuration for an AI workload. Use the helm chart to deploy a pre-trained Gemma model using the vLLM serving engine. The workload is configured for efficient GPU resource requests and a Horizontal Pod Autoscaler (HPA) to scale based on GPU cache usage.

For example, you might deploy the cluster and workload templates to address the following business needs:

Example Business need Implementation
Real-time video analysis
A security firm needs to process video streams from hundreds of cameras to detect anomalies or specific objects in real-time. Deploy video processing models on the GPU-enabled node pool. GPUs can process the high-throughput, low-latency demands of concurrent video streams.
Specialized document processing
An insurance company needs to automatically extract information from thousands of daily claim forms, which contain varied layouts and handwriting. Use the GKE cluster to host custom models, and ensure that data never leaves the secure environment during processing.
High-volume recommendation engine
An ecommerce platform needs to serve personalized product recommendations to users during peak holiday shopping events. Use the Google Kubernetes Engine Gateway API to route high volumes of user traffic to the recommendation models. The Gateway API can handle sudden traffic spikes without latency degradation.

Architecture

The following image shows the components and connections in the template:

A cluster connected to a node pool in the design canvas

The following describes the component configurations in this template:

  • GKE Standard cluster : a secure and private cluster where your AI workload runs.

    The following table describes the cluster configuration in this template:

    Configuration Purpose
    node_locations is set to ["us-central1-a", "us-central1-b", "us-central1-c"] . Ensures high availability and resilience by spreading the cluster's nodes across three zones in the us-central1 region.
    enable_intranode_visibility is set to true . Enables visibility for pod-to-pod traffic in the same node in VPC Flow Logs. This visibility is required for network monitoring, troubleshooting, and security analysis.
    gateway_api_config is enabled using {"channel":"CHANNEL_STANDARD"} . The GKE Inference Gateway API helps you manage ingress traffic to your Kubernetes services. The API helps you configure fine-grained routing, advanced load balancing, and centralized policy attachment.
    private_cluster_config.enable_private_endpoint is set to false . private_cluster_config.enable_private_nodes is true . control_plane_endpoints_config.dns_endpoint_config.allow_external_traffic is set to true . Ensures that the worker nodes where your AI models run have private IP addresses. This isolates your nodes from the public internet. The GKE control plane is configured to be publicly accessible to let you manage the cluster outside your Virtual Private Cloud (VPC) network.
    release_channel is set to {"channel":"REGULAR"} . Ensures that your GKE cluster receives stable and predictable updates, providing a balance between new features and reliability.
  • GKE node pool : a group of worker nodes that run the application's containers.

    The following table describes the node pool configurations in this template:

    Configuration Purpose
    autoscaling.min_node_count is set to 0 . autoscaling.max_node_count is set to 3 (default is 100 ). The node pool can scale down completely when there are no AI workloads running, reducing costs during idle periods. The upper limit for scaling helps control costs and resource consumption.
    The node_config.guest_accelerator parameter is added. gpu_driver_installation_config.gpu_driver_version: is set to "LATEST" . gpu_sharing_config is enabled with TIME_SHARING . max_shared_clients_per_gpu: is set to 2 . Specifies the use of NVIDIA L4 GPUs for AI inference tasks. The necessary GPU drivers are automatically installed. Multiple smaller workloads can share a single GPU.
    node_config.machine_type is changed to "g2-standard-8" . The machine type is specifically designed to complement the L4 GPUs. vCPUs (8) and memory (32 GB) are created to support the GPU and run your AI inference applications.
    node_config.oauth_scopes includes https://www.googleapis.com/auth/cloud-platform . The node's service account has broad access to Google Cloud services, enabling API interaction for tasks like logging, monitoring, and pulling container images.
    node_config.shielded_instance_config.enable_secure_boot is set to true . Secure Boot helps protect your nodes from boot-level malware by verifying the cryptographic signatures of the bootloader and kernel before they execute.

Helm chart configuration

The following table lists the helm chart configurations, which have been customized for deploying and scaling an AI inference service on GKE.

Configuration Purpose
replicaCount: 1 Creates a single initial replica.
image.repository: vllm/vllm-openai Uses a vLLM image, an optimized library for Large Language Model (LLM) inference, exposed using an OpenAI-compatible API.
model.id: google/gemma-7b-it Defines the Gemma 7B instruction-tuned model as the model to be served.
model.hfSecret: hf-secret Indicates that the model requires authentication using a Kubernetes Secret for secure credential management.
resources.limits and requests for nvidia.com/gpu: "1" Ensures that each pod gets a dedicated GPU.
nodeSelector.cloud.google.com/gke-accelerator: nvidia-l4 Ensures that your AI model pods are scheduled exclusively on GKE Standard nodes equipped with NVIDIA L4 GPUs, which are ideal for cost-effective and high-performance inference.
hpa.enabled: true Enables Horizontal Pod Autoscaler, which lets the application automatically scale the number of pods (between minReplicas: 1 and maxReplicas: 10 ) based on targetCPUUtilizationPercentage: 80% . Ensures performance during peak loads and cost efficiency during low usage.
tensorParallelSize: 1 Indicates that the model is not split across multiple GPUs within a single pod.
maxModelLen: 512 Controls the maximum sequence length that the Gemma 7B model can process.
service.type: ClusterIP The service is configured for internal access within the cluster.
pdb.enabled: true and minAvailable: 1 A Pod Disruption Budget is enabled to ensure high availability. At least one replica of your AI model remains available during voluntary disruptions like node maintenance.

Create your AI application

Use the AI Pre-trained Inference GKE cluster and workload templates to deploy your AI application.

Deploy your AI infrastructure

Configure and deploy the AI Pre-trained Inference GKE cluster template to create the foundational infrastructure where your AI workload runs.

  1. Duplicate and deploy the AI Pre-trained Inference GKE cluster template as an application.

    A GKE cluster is created in the deployment project that you choose.

  2. Configure the components. For more information, see the following:

  3. Click Deploy. The application deploys after several minutes.

  4. In the Application detailspanel, click the Outputstab.

  5. Identify the cluster_idfor your application. You'll use this information when you deploy your helm chart.

Deploy your AI workload

Use the AI Pre-trained Inference GKE workload template to deploy your AI workload into the cluster you created. You'll deploy a helm chart that includes your AI workload configuration.

  1. From the Google catalogpage, on the AI Pre-trained Inference GKE workloadtemplate, click Create new application.

  2. In the Namefield, enter a unique name for your application.

  3. In the GKE Deployment Targetarea, do the following:

    1. From the Project list, select the project where you deployed the GKE cluster from your AI Pre-trained Inference GKE cluster application.

    2. From the Regionlist, select the region where you deployed the GKE cluster.

    3. From the Clusterslist, select the deployed GKE cluster.

    4. From the Namespacelist, enter the namespace where you deployed your GKE cluster. If you didn't change the name, enter default .

    5. Click Create application.

    The application is created and the configuration files are displayed.

  4. In the Helm chartpanel, do the following:

    1. Review the configuration details .

    2. Optional: customize the configuration to meet your unique needs.

    3. To deploy the helm chart to your cluster, click Deploy.

      For detailed steps, see Deploy applications .

    After several minutes, the helm chart configuration deploys to your GKE cluster.

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: