Prepare GKE infrastructure for DRA workloads


This page explains how to set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). On this page, you'll create clusters that can deploy GPU or TPU workloads, and manually install the drivers that you need to enable DRA.

This page is intended for platform administrators who want to reduce the complexity and overhead of setting up infrastructure with specialized hardware devices.

About DRA

DRA is a built-in Kubernetes feature that lets you flexibly request, allocate, and share hardware in your cluster among Pods and containers. For more information, see About dynamic resource allocation .

Limitations

  • Node auto-provisioning isn't supported.
  • Autopilot clusters don't support DRA.
  • Automatic GPU driver installation isn't supported with DRA.
  • You can't use the following GPU sharing features:
    • Time-sharing GPUs
    • Multi-instance GPUs
    • Multi-process Service (MPS)

Requirements

To use DRA, your GKE version must be version 1.32.1-gke.1489001 or later.

You should also be familiar with the following requirements and limitations, depending on the type of hardware that you want to use:

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update .
  • If you're not using the Cloud Shell, install the Helm CLI:

     curl  
    -fsSL  
    -o  
    get_helm.sh  
    https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
    chmod  
     700 
      
    get_helm.sh
    ./get_helm.sh 
    

Create a GKE Standard cluster

Create a Standard mode cluster that enables the Kubernetes beta APIs for DRA:

   
gcloud  
container  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--enable-kubernetes-unstable-apis = 
 "resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices" 
  
 \ 
  
--cluster-version = 
 GKE_VERSION 
 

Replace the following:

  • CLUSTER_NAME : a name for your cluster.
  • GKE_VERSION : the GKE version to use for the cluster and nodes. Must be 1.32.1-gke.1489001 or later.

Create a GKE node pool with GPUs or TPUs

On GKE, you can use DRA with both GPUs and TPUs. The node pool configuration settings—such as machine type, accelerator type, count, node operating system, and node locations—depend on your requirements.

GPU

To use DRA for GPUs, you must do the following when you create the node pool:

  • Disable automatic GPU driver installation with gpu-driver-version=disabled .
  • Disable GPU device plugin by adding the gke-no-default-nvidia-gpu-device-plugin=true node label.
  • Let the DRA Driver DaemonSet run on the nodes by adding the nvidia.com/gpu.present=true node label.

To create a GPU node pool for DRA, follow these steps:

  1. Create a node pool with the required hardware. The following example creates a node pool that has g2-standard-24 instances on Container-Optimized OS with two L4 GPUs.

     gcloud  
    container  
    node-pools  
    create  
     NODEPOOL_NAME 
      
     \ 
      
    --cluster = 
     CLUSTER_NAME 
      
     \ 
      
    --machine-type  
     "g2-standard-24" 
      
     \ 
      
    --accelerator  
     "type=nvidia-l4,count=2,gpu-driver-version=disabled" 
      
     \ 
      
    --num-nodes  
     "1" 
      
     \ 
      
    --node-labels = 
    gke-no-default-nvidia-gpu-device-plugin = 
    true,nvidia.com/gpu.present = 
     true 
     
    

    Replace the following:

    • NODEPOOL_NAME : the name for your node pool.
    • CLUSTER_NAME : the name of your cluster.
  2. Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers .

TPU

To use DRA for TPUs, you must disable TPU device plugin by adding the gke-no-default-tpu-device-plugin=true node label.

Create a node pool that uses TPUs. The following example creates a TPU Trillium node pool:

 gcloud  
container  
node-pools  
create  
 NODEPOOL_NAME 
  
 \ 
  
--cluster  
 CLUSTER_NAME 
  
--num-nodes  
 1 
  
 \ 
  
--node-labels  
 "gke-no-default-tpu-device-plugin=true,gke-no-default-tpu-dra-plugin=true" 
  
 \ 
  
--machine-type = 
ct6e-standard-8t 

Replace the following:

  • NODEPOOL_NAME : the name for your node pool.
  • CLUSTER_NAME : the name of your cluster.

Install DRA drivers

GPU

  1. Pull and update the Helm chart that contains the NVIDIA DRA driver:

     helm  
    repo  
    add  
    nvidia  
    https://helm.ngc.nvidia.com/nvidia  
     \ 
     && 
    helm  
    repo  
    update 
    
  2. Install the NVIDIA DRA driver with version 25.3.0-rc.4:

     helm  
    install  
    nvidia-dra-driver-gpu  
    nvidia/nvidia-dra-driver-gpu  
    --version = 
     "25.3.0-rc.4" 
      
    --create-namespace  
    --namespace  
    nvidia-dra-driver-gpu  
     \ 
      
    --set  
     nvidiaDriverRoot 
     = 
     "/home/kubernetes/bin/nvidia/" 
      
     \ 
      
    --set  
     gpuResourcesEnabledOverride 
     = 
     true 
      
     \ 
      
    --set  
    resources.computeDomains.enabled = 
     false 
      
     \ 
      
    --set  
    kubeletPlugin.priorityClassName = 
     "" 
      
     \ 
      
    --set  
    kubeletPlugin.tolerations [ 
     0 
     ] 
    .key = 
    nvidia.com/gpu  
     \ 
      
    --set  
    kubeletPlugin.tolerations [ 
     0 
     ] 
    .operator = 
    Exists  
     \ 
      
    --set  
    kubeletPlugin.tolerations [ 
     0 
     ] 
    .effect = 
    NoSchedule 
    

    For Ubuntu nodes, use the nvidiaDriverRoot="/opt/nvidia" directory path.

TPU

You can install DRA drivers for TPUs with the provided Helm chart. To get access to the Helm charts, complete the following steps:

  1. Clone the ai-on-gke repository to access the Helm charts that contain the DRA drivers for GPUs and TPUs:

     git  
    clone  
    https://github.com/ai-on-gke/common-infra.git 
    
  2. Navigate to the directory that contains the charts:

      cd 
      
    common-infra/common/charts 
    
  3. Install the TPU DRA driver:

     ./tpu-dra-driver/install-tpu-dra-driver.sh 
    

Verify that your infrastructure is ready for DRA

Verify that the DRA driver Pod is running.

GPU

 kubectl  
get  
pods  
-n  
nvidia-dra-driver-gpu
NAME  
READY  
STATUS  
RESTARTS  
AGE
nvidia-dra-driver-gpu-kubelet-plugin-52cdm  
 1 
/1  
Running  
 0 
  
46s 

TPU

 kubectl  
get  
pods  
-n  
tpu-dra-driver
NAME  
READY  
STATUS  
RESTARTS  
AGE
tpu-dra-driver-kubeletplugin-h6m57  
 1 
/1  
Running  
 0 
  
30s 

Confirm that the ResourceSlice lists the hardware devices that you added:

 kubectl  
get  
resourceslices  
-o  
yaml 

If you used the example in the previous section, the ResourceSlice resembles the following, depending on the type of hardware you used:

GPU

The following example creates a g2-standard-24 machine with two L4 GPUs.

  apiVersion 
 : 
  
 v1 
 items 
 : 
 - 
  
 apiVersion 
 : 
  
 resource.k8s.io/v1beta1 
  
 kind 
 : 
  
 ResourceSlice 
  
 metadata 
 : 
  
 # lines omitted for clarity 
  
 spec 
 : 
  
 devices 
 : 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 architecture 
 : 
  
 string 
 : 
  
 Ada Lovelace 
  
 brand 
 : 
  
 string 
 : 
  
 Nvidia 
  
 cudaComputeCapability 
 : 
  
 version 
 : 
  
 8.9.0 
  
 cudaDriverVersion 
 : 
  
 version 
 : 
  
 12.9.0 
  
 driverVersion 
 : 
  
 version 
 : 
  
 575.57.8 
  
 index 
 : 
  
 int 
 : 
  
 0 
  
 minor 
 : 
  
 int 
 : 
  
 0 
  
 productName 
 : 
  
 string 
 : 
  
 NVIDIA L4 
  
 type 
 : 
  
 string 
 : 
  
 gpu 
  
 uuid 
 : 
  
 string 
 : 
  
 GPU-4d403095-4294-6ddd-66fd-cfe5778ef56e 
  
 capacity 
 : 
  
 memory 
 : 
  
 value 
 : 
  
 23034Mi 
  
 name 
 : 
  
 gpu-0 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 architecture 
 : 
  
 string 
 : 
  
 Ada Lovelace 
  
 brand 
 : 
  
 string 
 : 
  
 Nvidia 
  
 cudaComputeCapability 
 : 
  
 version 
 : 
  
 8.9.0 
  
 cudaDriverVersion 
 : 
  
 version 
 : 
  
 12.9.0 
  
 driverVersion 
 : 
  
 version 
 : 
  
 575.57.8 
  
 index 
 : 
  
 int 
 : 
  
 1 
  
 minor 
 : 
  
 int 
 : 
  
 1 
  
 productName 
 : 
  
 string 
 : 
  
 NVIDIA L4 
  
 type 
 : 
  
 string 
 : 
  
 gpu 
  
 uuid 
 : 
  
 string 
 : 
  
 GPU-cc326645-f91d-d013-1c2f-486827c58e50 
  
 capacity 
 : 
  
 memory 
 : 
  
 value 
 : 
  
 23034Mi 
  
 name 
 : 
  
 gpu-1 
  
 driver 
 : 
  
 gpu.nvidia.com 
  
 nodeName 
 : 
  
 gke-cluster-gpu-pool-9b10ff37-mf70 
  
 pool 
 : 
  
 generation 
 : 
  
 1 
  
 name 
 : 
  
 gke-cluster-gpu-pool-9b10ff37-mf70 
  
 resourceSliceCount 
 : 
  
 1 
 kind 
 : 
  
 List 
 metadata 
 : 
  
 resourceVersion 
 : 
  
 "" 
 

TPU

  apiVersion 
 : 
  
 v1 
 items 
 : 
 - 
  
 apiVersion 
 : 
  
 resource.k8s.io/v1beta1 
  
 kind 
 : 
  
 ResourceSlice 
  
 metadata 
 : 
  
 # lines omitted for clarity 
  
 spec 
 : 
  
 devices 
 : 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 index 
 : 
  
 int 
 : 
  
 0 
  
 tpuGen 
 : 
  
 string 
 : 
  
 v6e 
  
 uuid 
 : 
  
 string 
 : 
  
 tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
  
 name 
 : 
  
 "0" 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 index 
 : 
  
 int 
 : 
  
 1 
  
 tpuGen 
 : 
  
 string 
 : 
  
 v6e 
  
 uuid 
 : 
  
 string 
 : 
  
 tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
  
 name 
 : 
  
 "1" 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 index 
 : 
  
 int 
 : 
  
 2 
  
 tpuGen 
 : 
  
 string 
 : 
  
 v6e 
  
 uuid 
 : 
  
 string 
 : 
  
 tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
  
 name 
 : 
  
 "2" 
  
 - 
  
 basic 
 : 
  
 attributes 
 : 
  
 index 
 : 
  
 int 
 : 
  
 3 
  
 tpuGen 
 : 
  
 string 
 : 
  
 v6e 
  
 uuid 
 : 
  
 string 
 : 
  
 tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
  
 name 
 : 
  
 "3" 
  
 driver 
 : 
  
 tpu.google.com 
  
 nodeName 
 : 
  
 gke-tpu-b4d4b61b-fwbg 
  
 pool 
 : 
  
 generation 
 : 
  
 1 
  
 name 
 : 
  
 gke-tpu-b4d4b61b-fwbg 
  
 resourceSliceCount 
 : 
  
 1 
 kind 
 : 
  
 List 
 metadata 
 : 
  
 resourceVersion 
 : 
  
 "" 
 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: