Prepare GKE infrastructure for DRA workloads

This document explains how to manually set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA) . The setup steps include creating node pools that use GPUs and installing DRA drivers.

This document is intended for platform administrators who want to create infrastructure with specialized hardware devices that application operators can claim in workloads.

Limitations

The following limitations apply:

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.

Create a GKE node pool with GPUs

This section shows you how to create a GPU node pool and install the corresponding DRA drivers. The steps in this section apply only to node pools that you manually create. To create a GPU node pool that supports DRA, you must do the following:

  • Disable automatic GPU driver installation: specify the gpu-driver-version=disabled option in the --accelerator flag.
  • Disable the GPU device plugin: add the gke-no-default-nvidia-gpu-device-plugin=true node label to the node pool.
  • Run the DRA driver DaemonSet: add the nvidia.com/gpu.present=true node label to the node pool.
  • Configure autoscaling: to use the cluster autoscaler in your node pool, add the cloud.google.com/gke-nvidia-gpu-dra-driver=true node label to the node pool. The cluster autoscaler uses this node label to identify nodes that run the DRA driver for GPUs.

To create and configure GPU node pools, follow these steps:

  1. Create a GPU node pool. The following example commands create node pools with different configurations:

    • Create a node pool with a g2-standard-24 instance that has two L4 GPUs:

       gcloud  
      container  
      node-pools  
      create  
       NODEPOOL_NAME 
        
       \ 
        
      --cluster = 
       CLUSTER_NAME 
        
       \ 
        
      --location = 
       CONTROL_PLANE_LOCATION 
        
       \ 
        
      --node-locations = 
       NODE_LOCATION1,NODE_LOCATION2,... 
        
       \ 
        
      --machine-type = 
       "g2-standard-24" 
        
       \ 
        
      --accelerator = 
       "type=nvidia-l4,count=2,gpu-driver-version=disabled" 
        
       \ 
        
      --num-nodes = 
       "1" 
        
       \ 
        
      --node-labels = 
      gke-no-default-nvidia-gpu-device-plugin = 
      true,nvidia.com/gpu.present = 
       true 
       
      

      Replace the following:

      • NODEPOOL_NAME : a name for your node pool.
      • CLUSTER_NAME : the name of your cluster.
      • CONTROL_PLANE_LOCATION : the region or zone of the cluster control plane, such as us-central1 or us-central1-a .
      • NODE_LOCATION1,NODE_LOCATION2,... : a comma-separated list of zones, in the same region as the control plane, to create nodes in. Choose zones that have GPU availability .
    • Create an autoscaled node pool with a2-ultragpu-1g instances that have one NVIDIA A100 (80 GB) GPU in each instance:

       gcloud  
      container  
      node-pools  
      create  
       NODEPOOL_NAME 
        
       \ 
        
      --cluster = 
       CLUSTER_NAME 
        
       \ 
        
      --location = 
       CONTROL_PLANE_LOCATION 
        
       \ 
        
      --node-locations = 
       NODE_LOCATION1,NODE_LOCATION2,... 
        
       \ 
        
       --enable-autoscaling  
       \ 
        
      --max-nodes = 
       5 
        
       \ 
        
      --machine-type = 
       "a2-ultragpu-1g" 
        
       \ 
        
      --accelerator = 
       "type=nvidia-a100-80gb,count=1,gpu-driver-version=disabled" 
        
       \ 
        
      --num-nodes = 
       "1" 
        
       \ 
        
      --node-labels = 
      gke-no-default-nvidia-gpu-device-plugin = 
      true,nvidia.com/gpu.present = 
      true,cloud.google.com/gke-nvidia-gpu-dra-driver = 
       true 
       
      
  2. Manually install NVIDIA GPU drivers .

  3. Install DRA drivers .

Install DRA drivers

  1. Pull and update the Helm chart that contains the NVIDIA DRA driver:

     helm  
    repo  
    add  
    nvidia  
    https://helm.ngc.nvidia.com/nvidia  
     \ 
     && 
    helm  
    repo  
    update 
    
  2. Install the NVIDIA DRA GPU driver with version 25.8.0 or later:

     helm  
    install  
    nvidia-dra-driver-gpu  
    nvidia/nvidia-dra-driver-gpu  
     \ 
      
    --version = 
     "25.8.0" 
      
    --create-namespace  
    --namespace = 
    nvidia-dra-driver-gpu  
     \ 
      
    --set  
     nvidiaDriverRoot 
     = 
     "/home/kubernetes/bin/nvidia/" 
      
     \ 
      
    --set  
     gpuResourcesEnabledOverride 
     = 
     true 
      
     \ 
      
    --set  
    resources.computeDomains.enabled = 
     false 
      
     \ 
      
    --set  
    kubeletPlugin.priorityClassName = 
     "" 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].key=nvidia.com/gpu' 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].operator=Exists' 
      
     \ 
      
    --set  
     'kubeletPlugin.tolerations[0].effect=NoSchedule' 
     
    

    For Ubuntu nodes, specify the "/opt/nvidia" directory path in the --set nvidiaDriverRoot flag.

Verify that your infrastructure is ready for DRA

  1. Verify that your DRA driver Pods are running:

     kubectl  
    get  
    pods  
    -n  
    nvidia-dra-driver-gpu 
    

    The output is similar to the following:

     NAME                                         READY   STATUS    RESTARTS   AGE
    nvidia-dra-driver-gpu-kubelet-plugin-52cdm   1/1     Running   0          46s 
    
  2. Confirm that the ResourceSlice lists the hardware devices that you added:

     kubectl  
    get  
    resourceslices  
    -o  
    yaml 
    

    The output is similar to the following:

      apiVersion 
     : 
      
     v1 
     items 
     : 
     - 
      
     apiVersion 
     : 
      
     resource.k8s.io/v1 
      
     kind 
     : 
      
     ResourceSlice 
      
     metadata 
     : 
      
     # Multiple lines are omitted here. 
      
     spec 
     : 
      
     devices 
     : 
      
     - 
      
     attributes 
     : 
      
     architecture 
     : 
      
     string 
     : 
      
     Ada Lovelace 
      
     brand 
     : 
      
     string 
     : 
      
     Nvidia 
      
     cudaComputeCapability 
     : 
      
     version 
     : 
      
     8.9.0 
      
     cudaDriverVersion 
     : 
      
     version 
     : 
      
     13.0.0 
      
     driverVersion 
     : 
      
     version 
     : 
      
     580.65.6 
      
     index 
     : 
      
     int 
     : 
      
     0 
      
     minor 
     : 
      
     int 
     : 
      
     0 
      
     pcieBusID 
     : 
      
     string 
     : 
      
     "0000:00:03.0" 
      
     productName 
     : 
      
     string 
     : 
      
     NVIDIA L4 
      
     resource.kubernetes.io/pcieRoot 
     : 
      
     string 
     : 
      
     pci0000:00 
      
     type 
     : 
      
     string 
     : 
      
     gpu 
      
     uuid 
     : 
      
     string 
     : 
      
     GPU-ccc19e5e-e3cd-f911-65c8-89bcef084e3f 
      
     capacity 
     : 
      
     memory 
     : 
      
     value 
     : 
      
     23034Mi 
      
     name 
     : 
      
     gpu-0 
      
     - 
      
     attributes 
     : 
      
     architecture 
     : 
      
     string 
     : 
      
     Ada Lovelace 
      
     brand 
     : 
      
     string 
     : 
      
     Nvidia 
      
     cudaComputeCapability 
     : 
      
     version 
     : 
      
     8.9.0 
      
     cudaDriverVersion 
     : 
      
     version 
     : 
      
     13.0.0 
      
     driverVersion 
     : 
      
     version 
     : 
      
     580.65.6 
      
     index 
     : 
      
     int 
     : 
      
     1 
      
     minor 
     : 
      
     int 
     : 
      
     1 
      
     pcieBusID 
     : 
      
     string 
     : 
      
     "0000:00:04.0" 
      
     productName 
     : 
      
     string 
     : 
      
     NVIDIA L4 
      
     resource.kubernetes.io/pcieRoot 
     : 
      
     string 
     : 
      
     pci0000:00 
      
     type 
     : 
      
     string 
     : 
      
     gpu 
      
     uuid 
     : 
      
     string 
     : 
      
     GPU-f783198d-42f9-7cef-9ea1-bb10578df978 
      
     capacity 
     : 
      
     memory 
     : 
      
     value 
     : 
      
     23034Mi 
      
     name 
     : 
      
     gpu-1 
      
     driver 
     : 
      
     gpu.nvidia.com 
      
     nodeName 
     : 
      
     gke-cluster-1-dra-gpu-pool-b56c4961-7vnm 
      
     pool 
     : 
      
     generation 
     : 
      
     1 
      
     name 
     : 
      
     gke-cluster-1-dra-gpu-pool-b56c4961-7vnm 
      
     resourceSliceCount 
     : 
      
     1 
     kind 
     : 
      
     List 
     metadata 
     : 
      
     resourceVersion 
     : 
      
     "" 
     
    

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: