Prepare GKE infrastructure for DRA workloads

This document explains how to set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA) . The setup steps include creating node pools that use GPUs or TPUs, and installing DRA drivers in your cluster. This document is intended for platform administrators who want to reduce the complexity and overhead of setting up infrastructure with specialized hardware devices.

Limitations

  • Node auto-provisioning isn't supported.
  • Autopilot clusters don't support DRA.
  • Automatic GPU driver installation isn't supported with DRA.
  • You can't use the following GPU sharing features:
    • Time-sharing GPUs
    • Multi-instance GPUs
    • Multi-process Service (MPS)
  • For TPUs, you must enable the v1beta1 and v1beta2 versions of the DRA API kinds. This limitation doesn't apply to GPUs, which can use v1 API versions.

Requirements

To use DRA, your GKE cluster must run 1.34 or later.

You should also be familiar with the following requirements and limitations, depending on the type of hardware that you want to use:

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
  • Have a GKE Standard cluster that runs version 1.34 or later. You can also create a regional cluster .

  • If you're not using the Cloud Shell, install the Helm CLI:

     curl  
    -fsSL  
    -o  
    get_helm.sh  
    https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
    chmod  
     700 
      
    get_helm.sh
    ./get_helm.sh 
    
  • To use DRA for TPUs, enable the v1beta1 and v1beta2 versions of the DRA API kinds:

     gcloud  
    container  
    clusters  
    update  
     CLUSTER_NAME 
      
     \ 
      
    --location = 
     CONTROL_PLANE_LOCATION 
      
     \ 
      
    --enable-kubernetes-unstable-apis = 
     "resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices,resource.k8s.io/v1beta2/deviceclasses,resource.k8s.io/v1beta2/resourceclaims,resource.k8s.io/v1beta2/resourceclaimtemplates,resource.k8s.io/v1beta2/resourceslices" 
     
    

Create a GKE node pool with GPUs or TPUs

On GKE, you can use DRA with both GPUs and TPUs. The node pool configuration settings—such as machine type, accelerator type, count, node operating system, and node locations—depend on your requirements. To create a node pool that supports DRA, select one of the following options:

GPU

To use DRA for GPUs, you must do the following when you create the node pool:

  • Disable automatic GPU driver installation by specifying the gpu-driver-version=disabled option in the --accelerator flag when you configure GPUs for a node pool.
  • Disable the GPU device plugin by adding the gke-no-default-nvidia-gpu-device-plugin=true node label.
  • Let the DRA driver DaemonSet run on the nodes by adding the nvidia.com/gpu.present=true node label.

To create a GPU node pool for DRA, follow these steps:

  1. Create a node pool with the required hardware. The following example creates a node pool that has a g2-standard-24 instance on Container-Optimized OS with two L4 GPUs.

     gcloud  
    container  
    node-pools  
    create  
     NODEPOOL_NAME 
      
     \ 
      
    --cluster = 
     CLUSTER_NAME 
      
     \ 
      
    --location = 
     CONTROL_PLANE_LOCATION 
      
     \ 
      
    --machine-type  
     "g2-standard-24" 
      
     \ 
      
    --accelerator  
     "type=nvidia-l4,count=2,gpu-driver-version=disabled" 
      
     \ 
      
    --num-nodes  
     "1" 
      
     \ 
      
    --node-labels = 
    gke-no-default-nvidia-gpu-device-plugin = 
    true,nvidia.com/gpu.present = 
     true 
     
    

    Replace the following:

    • NODEPOOL_NAME : the name for your node pool.
    • CLUSTER_NAME : the name of your cluster.
    • CONTROL_PLANE_LOCATION : the region or zone of the cluster control plane, such as us-central1 or us-central1-a .
  2. Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers .

TPU

To use DRA for TPUs, you must disable the TPU device plugin by adding the gke-no-default-tpu-device-plugin=true node label. The following example creates a TPU Trillium node pool with DRA support:

 gcloud  
container  
node-pools  
create  
 NODEPOOL_NAME 
  
 \ 
  
--cluster  
 CLUSTER_NAME 
  
--num-nodes  
 1 
  
 \ 
  
--location = 
 CONTROL_PLANE_LOCATION 
  
 \ 
  
--node-labels  
 "gke-no-default-tpu-device-plugin=true,gke-no-default-tpu-dra-plugin=true" 
  
 \ 
  
--machine-type = 
ct6e-standard-8t 

Replace the following:

  • NODEPOOL_NAME : the name for your node pool.
  • CLUSTER_NAME : the name of your cluster.
  • CONTROL_PLANE_LOCATION : the region or zone of the cluster control plane, such as us-central1 or us-central1-a .

Install DRA drivers

GPU

  1. Pull and update the Helm chart that contains the NVIDIA DRA driver:

     helm  
    repo  
    add  
    nvidia  
    https://helm.ngc.nvidia.com/nvidia  
     \ 
     && 
    helm  
    repo  
    update 
    
  2. Install the NVIDIA DRA driver with version 25.3.2 :

     helm  
    install  
    nvidia-dra-driver-gpu  
    nvidia/nvidia-dra-driver-gpu  
     \ 
      
    --version = 
     "25.3.2" 
      
    --create-namespace  
    --namespace = 
    nvidia-dra-driver-gpu  
     \ 
      
    --set  
     nvidiaDriverRoot 
     = 
     "/home/kubernetes/bin/nvidia/" 
      
     \ 
      
    --set  
     gpuResourcesEnabledOverride 
     = 
     true 
      
     \ 
      
    --set  
    resources.computeDomains.enabled = 
     false 
      
     \ 
      
    --set  
    kubeletPlugin.priorityClassName = 
     "" 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].key=nvidia.com/gpu' 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].operator=Exists' 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].effect=NoSchedule' 
     
    

    For Ubuntu nodes, use the nvidiaDriverRoot="/opt/nvidia" directory path.

TPU

  1. Clone the ai-on-gke repository to access the Helm charts that contain the DRA drivers for GPUs and TPUs:

     git  
    clone  
    https://github.com/ai-on-gke/common-infra.git 
    
  2. Navigate to the directory that contains the charts:

      cd 
      
    common-infra/common/charts 
    
  3. Install the TPU DRA driver:

     ./tpu-dra-driver/install-tpu-dra-driver.sh 
    

Verify that your infrastructure is ready for DRA

  1. To verify that your DRA driver Pods are running, select one of the following options:

    GPU

     kubectl  
    get  
    pods  
    -n  
    nvidia-dra-driver-gpu 
    

    The output is similar to the following:

     NAME                                         READY   STATUS    RESTARTS   AGE
    nvidia-dra-driver-gpu-kubelet-plugin-52cdm   1/1     Running   0          46s 
    

    TPU

     kubectl  
    get  
    pods  
    -n  
    tpu-dra-driver 
    

    The output is similar to the following:

     NAME                                         READY   STATUS    RESTARTS   AGE
    tpu-dra-driver-kubeletplugin-h6m57           1/1     Running   0          30s 
    
  2. Confirm that the ResourceSlice lists the hardware devices that you added:

     kubectl  
    get  
    resourceslices  
    -o  
    yaml 
    

    If you used the example in the previous section, the output is similar to the following, depending on whether you configured GPUs or TPUs:

    GPU

      apiVersion 
     : 
      
     v1 
     items 
     : 
     - 
      
     apiVersion 
     : 
      
     resource.k8s.io/v1 
      
     kind 
     : 
      
     ResourceSlice 
      
     metadata 
     : 
      
     # Multiple lines are omitted here. 
      
     spec 
     : 
      
     devices 
     : 
      
     - 
      
     attributes 
     : 
      
     architecture 
     : 
      
     string 
     : 
      
     Ada Lovelace 
      
     brand 
     : 
      
     string 
     : 
      
     Nvidia 
      
     cudaComputeCapability 
     : 
      
     version 
     : 
      
     8.9.0 
      
     cudaDriverVersion 
     : 
      
     version 
     : 
      
     13.0.0 
      
     driverVersion 
     : 
      
     version 
     : 
      
     580.65.6 
      
     index 
     : 
      
     int 
     : 
      
     0 
      
     minor 
     : 
      
     int 
     : 
      
     0 
      
     pcieBusID 
     : 
      
     string 
     : 
      
     "0000:00:03.0" 
      
     productName 
     : 
      
     string 
     : 
      
     NVIDIA L4 
      
     resource.kubernetes.io/pcieRoot 
     : 
      
     string 
     : 
      
     pci0000:00 
      
     type 
     : 
      
     string 
     : 
      
     gpu 
      
     uuid 
     : 
      
     string 
     : 
      
     GPU-ccc19e5e-e3cd-f911-65c8-89bcef084e3f 
      
     capacity 
     : 
      
     memory 
     : 
      
     value 
     : 
      
     23034Mi 
      
     name 
     : 
      
     gpu-0 
      
     - 
      
     attributes 
     : 
      
     architecture 
     : 
      
     string 
     : 
      
     Ada Lovelace 
      
     brand 
     : 
      
     string 
     : 
      
     Nvidia 
      
     cudaComputeCapability 
     : 
      
     version 
     : 
      
     8.9.0 
      
     cudaDriverVersion 
     : 
      
     version 
     : 
      
     13.0.0 
      
     driverVersion 
     : 
      
     version 
     : 
      
     580.65.6 
      
     index 
     : 
      
     int 
     : 
      
     1 
      
     minor 
     : 
      
     int 
     : 
      
     1 
      
     pcieBusID 
     : 
      
     string 
     : 
      
     "0000:00:04.0" 
      
     productName 
     : 
      
     string 
     : 
      
     NVIDIA L4 
      
     resource.kubernetes.io/pcieRoot 
     : 
      
     string 
     : 
      
     pci0000:00 
      
     type 
     : 
      
     string 
     : 
      
     gpu 
      
     uuid 
     : 
      
     string 
     : 
      
     GPU-f783198d-42f9-7cef-9ea1-bb10578df978 
      
     capacity 
     : 
      
     memory 
     : 
      
     value 
     : 
      
     23034Mi 
      
     name 
     : 
      
     gpu-1 
      
     driver 
     : 
      
     gpu.nvidia.com 
      
     nodeName 
     : 
      
     gke-cluster-1-dra-gpu-pool-b56c4961-7vnm 
      
     pool 
     : 
      
     generation 
     : 
      
     1 
      
     name 
     : 
      
     gke-cluster-1-dra-gpu-pool-b56c4961-7vnm 
      
     resourceSliceCount 
     : 
      
     1 
     kind 
     : 
      
     List 
     metadata 
     : 
      
     resourceVersion 
     : 
      
     "" 
     
    

    TPU

      apiVersion 
     : 
      
     v1 
     items 
     : 
     - 
      
     apiVersion 
     : 
      
     resource.k8s.io/v1beta1 
      
     kind 
     : 
      
     ResourceSlice 
      
     metadata 
     : 
      
     # lines omitted for clarity 
      
     spec 
     : 
      
     devices 
     : 
      
     - 
      
     basic 
     : 
      
     attributes 
     : 
      
     index 
     : 
      
     int 
     : 
      
     0 
      
     tpuGen 
     : 
      
     string 
     : 
      
     v6e 
      
     uuid 
     : 
      
     string 
     : 
      
     tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
      
     name 
     : 
      
     "0" 
      
     - 
      
     basic 
     : 
      
     attributes 
     : 
      
     index 
     : 
      
     int 
     : 
      
     1 
      
     tpuGen 
     : 
      
     string 
     : 
      
     v6e 
      
     uuid 
     : 
      
     string 
     : 
      
     tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
      
     name 
     : 
      
     "1" 
      
     - 
      
     basic 
     : 
      
     attributes 
     : 
      
     index 
     : 
      
     int 
     : 
      
     2 
      
     tpuGen 
     : 
      
     string 
     : 
      
     v6e 
      
     uuid 
     : 
      
     string 
     : 
      
     tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
      
     name 
     : 
      
     "2" 
      
     - 
      
     basic 
     : 
      
     attributes 
     : 
      
     index 
     : 
      
     int 
     : 
      
     3 
      
     tpuGen 
     : 
      
     string 
     : 
      
     v6e 
      
     uuid 
     : 
      
     string 
     : 
      
     tpu-54de4859-dd8d-f67e-6f91-cf904d965454 
      
     name 
     : 
      
     "3" 
      
     driver 
     : 
      
     tpu.google.com 
      
     nodeName 
     : 
      
     gke-tpu-b4d4b61b-fwbg 
      
     pool 
     : 
      
     generation 
     : 
      
     1 
      
     name 
     : 
      
     gke-tpu-b4d4b61b-fwbg 
      
     resourceSliceCount 
     : 
      
     1 
     kind 
     : 
      
     List 
     metadata 
     : 
      
     resourceVersion 
     : 
      
     "" 
     
    

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: