Create a custom AI-optimized GKE cluster which uses A4X

This page shows you how to use A4X virtual machines (VMs) to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses Cluster Director for GKE to support your AI and ML workloads. For more information about A4X, see A4X series .

Cluster Director for GKE lets you deploy and manage large AI-optimized clusters of accelerated VMs with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster Director overview .

GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services.

On this page, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For more information, see Create an AI-optimized GKE cluster with default configuration . To create a cluster which uses A4 or A3 Ultra, see Create a custom AI-optimized GKE cluster which uses A4 or A3 Ultra .

Cluster configuration options with GPUDirect RDMA

To create your cluster with the Google Cloud CLI, you can choose one of the following cluster configuration options:

  • If you plan to run distributed AI workloads: create a GKE cluster with GPUDirect RDMA using the instructions on this page.
  • If you don't plan to run distributed AI workloads: create a GKE cluster without using GPUDirect RDMA. For more information, see Create a cluster without GPUDirect RDMA .

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update .

Obtain capacity

You can obtain capacity for A4X VMs by creating a future reservation. For more information about future reservations, see the Future reservations in AI Hypercomputercolumn in the table for Choose a consumption option .

To obtain capacity with a future reservation, see the Future reservations in AI Hypercomputerrow in the table for How to obtain capacity .

Requirements

The following requirements apply to an AI-optimized GKE cluster with A4X VMs:

  • Verify that you use the minimum GPU driver version. The GB200 GPUs in A4X VMs require a minimum of the R580 GPU driver version. Install R580 by running GKE version 1.32.8-gke.1108000 or later and using the gpu-driver-version=latest flag.
  • The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
  • Your GKE workload must use all available GPUs and your Pod must use all available secondary NICs on a single GKE node. Multiple Pods cannot share RDMA on a single GKE node.
  • This setup runs a NCCL test. To run this NCCL test, you must have a minimum VM quota of 2 (4 GPUs each when using a4x-highgpu-4g or a4x-highgpu-4g-nolssd ).
  • You must use the reservation-bound provisioning model to create clusters with A4X. Other provisioning models are not supported.

Considerations for creating a cluster

When you create a cluster, consider the following information:

  • Choose a cluster location:
    • Verify that you use a location which has availability for the machine type that you choose. For more information, see GPU availability by regions and zones .
    • For dense reservations, you can create a zonal cluster. In this case, replace the --region flag with the --zone= COMPUTE_ZONE flag, where COMPUTE_ZONE is the zone of your control plane.
    • When you create node pools in a regional cluster, you can use the --node-locations flag to specify the zones for your GKE nodes.
  • Choose a driver version:
    • The driver version can be one of the following values:
      • default : install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.
      • latest : install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.
      • disabled : skip automatic driver installation. You must manually install a driver after you create the node pool.
    • For more information about the default and latest GPU driver versions for GKE node versions, see the table in the section Manually install NVIDIA GPU drivers .
  • Choose a reservation affinity:

    • You can find information about your reservation, such as the name of your reservation or the name of a specific block in your reservation. To find these values, see View future reservation requests .
    • The --reservation-affinity flag can take the values of specific or any . However, for high performance distributed AI workloads, we recommend that you use a specific reservation.
    • When you use a specific reservation, including shared reservations , specify the value of the --reservation flag in the following format:

       projects/ PROJECT_ID 
      /reservations/ RESERVATION_NAME 
      /reservationBlocks/ BLOCK_NAME 
       
      

      Replace the following values:

      • PROJECT_ID : your Google Cloud project ID.
      • RESERVATION_NAME : the name of your reservation.
      • BLOCK_NAME : the name of a specific block within the reservation.

      To use a sub-block targeted reservation so that VMs are placed on a single sub-block within the BLOCK_NAME , add the following to the end of the path:

       /reservationSubBlocks/ SUB_BLOCK_NAME 
       
      

      Replace SUB_BLOCK_NAME with the name of the sub-block.

Create an AI-optimized GKE cluster which uses A4X and GPUDirect RDMA

For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. A4X is an exascale platform based on NVIDIA GB200 NVL72 rack-scale architecture. This machine type enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads. For more information about the network architecture for A4X, including the network bandwidth and NIC arrangement, see A4X machine types .

To create your GKE Standard clusters with A4X and with GPUDirect RDMA, complete the following steps, which are described in the next sections:

  1. Create VPCs and subnets
  2. Create the GKE cluster with multi-networking
  3. Create the GKE network objects
  4. Create a workload policy
  5. Create a node pool with A4X
  6. Install the RDMA binary and configure NCCL
  7. Install the NVIDIA Compute Domain CRD and DRA driver

Create VPCs and subnets

A4X VMs have the following configuration:

  • Four NVIDIA B200 GPUs per virtual machine connected with NVLink
  • Two Arm-based NVIDIA Grace CPUs
  • Four 400 Gbps CX-7 network interface cards (NICs) for GPU-to-GPU networking
  • Two 200 Gbps Google Titanium network interface cards (NICs) for external services

AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs. This approach more efficiently uses the network bandwidth that's available.

One of the Google Titanium NICs that is associated with the CPU uses the default network in GKE, so you don't have to create a new VPC for this NIC as long as you have enough IP address ranges for the default network.

You can create one VPC for the second CPU Titanium NIC (gVNIC) and another VPC for the four CX-7 RDMA NICs by using the following commands.

To maximize network bandwidth, the command to create a VPC for the additional GVNIC sets the maximum transmission unit (MTU) to 8896. The RDMA VPC defaults to the recommended setting of 8896. For more information, see MTU settings and GPU machine types .

  1. Set environment variables to match your deployment:

     export REGION=" COMPUTE_REGION 
    "
    export ZONE=" COMPUTE_ZONE 
    "
    export PROJECT=" PROJECT_ID 
    "
    export GVNIC_NETWORK_PREFIX=" GVNIC_NETWORK_PREFIX 
    "
    export RDMA_NETWORK_PREFIX=" RDMA_NETWORK_PREFIX 
    " 
    

    Replace the following variables:

    • COMPUTE_REGION : the region of your cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • PROJECT_ID : your Google Cloud project ID.
    • GVNIC_NETWORK_PREFIX : the GVNIC network prefix (for example, a4x-gvnic ).
    • RDMA_NETWORK_PREFIX : the RDMA network prefix (for example, a4x-rdma ).
  2. Create two VPC networks:

      # Create a VPC for the additional GVNIC 
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    create  
     \ 
      
     GVNIC_NETWORK_PREFIX 
    -net  
     \ 
      
    --subnet-mode = 
    custom  
    --mtu = 
     8896 
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    subnets  
    create  
     \ 
      
     GVNIC_NETWORK_PREFIX 
    -sub  
     \ 
      
    --network = 
     GVNIC_NETWORK_PREFIX 
    -net  
     \ 
      
    --region = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --range = 
     192 
    .168.0.0/24
    
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    firewall-rules  
    create  
     \ 
      
     GVNIC_NETWORK_PREFIX 
    -internal  
     \ 
      
    --network = 
     GVNIC_NETWORK_PREFIX 
    -net  
     \ 
      
    --action = 
    ALLOW  
     \ 
      
    --rules = 
    tcp:0-65535,udp:0-65535,icmp  
     \ 
      
    --source-ranges = 
     192 
    .168.0.0/16 # Create HPC VPC for the RDMA NICs with 4 subnets. 
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    create  
     GVNIC_NETWORK_PREFIX 
    -net  
     \ 
      
    --network-profile = 
     ${ 
     ZONE 
     } 
    -vpc-roce  
     \ 
      
    --subnet-mode = 
    custom # Create subnets for the HPC VPC. 
     for 
      
    N  
     in 
      
     $( 
    seq  
     0 
      
     3 
     ) 
     ; 
      
     do 
      
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    subnets  
    create  
     \ 
      
     GVNIC_NETWORK_PREFIX 
    -sub- $N 
      
     \ 
      
    --network = 
     GVNIC_NETWORK_PREFIX 
    -net  
     \ 
      
    --region = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --range = 
     192 
    .168. $(( 
     N 
     + 
     1 
     )) 
    .0/24 & 
     # offset to avoid overlap with gvnics 
     done 
     
    

Create the GKE cluster with multi-networking

  1. Create a GKE Standard cluster with multi-networking:

     gcloud  
    container  
    clusters  
    create  
     CLUSTER_NAME 
      
     \ 
      
    --enable-dataplane-v2  
    --enable-ip-alias  
    --location = 
     COMPUTE_REGION 
      
     \ 
      
    --enable-multi-networking  
    --cluster-version = 
     CLUSTER_VERSION 
      
     \ 
      
    --enable-kubernetes-unstable-apis = 
    resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices  
     \ 
      
     [ 
    --services-ipv4-cidr = 
     SERVICE_CIDR 
      
     \ 
      
    --cluster-ipv4-cidr = 
     POD_CIDR 
     ] 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your cluster.
    • CLUSTER_VERSION : the version of your new cluster, which must be later than 1.32.4-gke.1533000 for A4X.
    • COMPUTE_REGION : the name of the compute region.

    Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, replace the following variables:

    • SERVICE_CIDR : the secondary CIDR range for services.
    • POD_CIDR : the secondary CIDR range for Pods.

    When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, SERVICE_CIDR =10.65.0.0/19 and POD_CIDR =10.64.0.0/19 . For more information, see Adding Pod IPv4 address ranges .

Create the GKE network objects

You must configure the VPC networks created in the previous section through GKE network parameter sets. Specifically, the second CPU Titanium NIC (gVNIC) needs to be configured in NetDevice mode and each of the four CX-7 RDMA NICs need to be configured in RDMA mode.

This command uses the following names:

  • CPU Titanium NIC (gVNIC) VPC is named GVNIC_NETWORK_PREFIX -net with the subnet named GVNIC_NETWORK_PREFIX -sub
  • CX-7 RDMA NICs VPC is named GVNIC_NETWORK_PREFIX -net with the subnets named GVNIC_NETWORK_PREFIX -sub-[0…3]

Create the GKE network objects by running the following command:

 kubectl  
apply  
-f  
-  
<<EOF
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
gvnic-1
spec:  
vpc:  
 GVNIC_NETWORK_PREFIX 
-net  
vpcSubnet:  
 GVNIC_NETWORK_PREFIX 
-sub  
deviceMode:  
NetDevice
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
gvnic-1
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
gvnic-1
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-0
spec:  
vpc:  
 GVNIC_NETWORK_PREFIX 
-net  
vpcSubnet:  
 GVNIC_NETWORK_PREFIX 
-sub-0  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-0
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-0
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-1
spec:  
vpc:  
 GVNIC_NETWORK_PREFIX 
-net  
vpcSubnet:  
 GVNIC_NETWORK_PREFIX 
-sub-1  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-1
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-1
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-2
spec:  
vpc:  
 GVNIC_NETWORK_PREFIX 
-net  
vpcSubnet:  
 GVNIC_NETWORK_PREFIX 
-sub-2  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-2
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-2
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-3
spec:  
vpc:  
 GVNIC_NETWORK_PREFIX 
-net  
vpcSubnet:  
 GVNIC_NETWORK_PREFIX 
-sub-3  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-3
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-3
EOF 

Create a workload policy

A workload policy is required to create a partition. For more information, see Workload policy for MIGs .

Create a HIGH_THROUGHPUT workload policy with the accelerator_topology field set to 1x72 :

 gcloud  
beta  
compute  
resource-policies  
create  
workload-policy  
 WORKLOAD_POLICY_NAME 
  
 \ 
  
--type  
HIGH_THROUGHPUT  
 \ 
  
--accelerator-topology  
1x72  
 \ 
  
--project  
 PROJECT 
  
 \ 
  
--region  
 COMPUTE_REGION 
 

Replace the following:

  • WORKLOAD_POLICY_NAME : the name of your workload policy.
  • PROJECT : the name of your project.
  • COMPUTE_REGION : the name of the compute region.

Create a node pool with A4X

We recommend that you create a node pool which uses the GKE GPU device plugin. This plugin provides GKE-managed GPU resource management. The approach has the following benefits:

  • Ease of deployment and upgrades
  • Driver auto-installation
  • GKE-managed GPU features, such as metrics and partitioned GPUs
  • Essential security vulnerability fixes

Alternatively, you can use the NVIDIA GPU Operator, if required by your use case. For more information, see Why use the NVIDIA GPU Operator? .

Create an A4X node pool with the GKE GPU device plugin

Create an A4X node pool which uses the GKE GPU device plugin:

 gcloud  
container  
node-pools  
create  
 NODE_POOL_NAME 
  
 \ 
  
--zone  
 COMPUTE_ZONE 
  
 \ 
  
--cluster  
 CLUSTER_NAME 
  
 \ 
  
--num-nodes = 
 NODE_COUNT 
  
 \ 
  
--machine-type  
 MACHINE_TYPE 
  
 \ 
  
--accelerator  
 type 
 = 
nvidia-gb200,count = 
 4 
,gpu-driver-version = 
 DRIVER_VERSION 
  
 \ 
  
--additional-node-network  
 network 
 = 
 GVNIC_NETWORK_PREFIX 
-net,subnetwork = 
 GVNIC_NETWORK_PREFIX 
-sub  
 \ 
  
--additional-node-network  
 network 
 = 
 GVNIC_NETWORK_PREFIX 
-net,subnetwork = 
 GVNIC_NETWORK_PREFIX 
-sub-0  
 \ 
  
--additional-node-network  
 network 
 = 
 GVNIC_NETWORK_PREFIX 
-net,subnetwork = 
 GVNIC_NETWORK_PREFIX 
-sub-1  
 \ 
  
--additional-node-network  
 network 
 = 
 GVNIC_NETWORK_PREFIX 
-net,subnetwork = 
 GVNIC_NETWORK_PREFIX 
-sub-2  
 \ 
  
--additional-node-network  
 network 
 = 
 GVNIC_NETWORK_PREFIX 
-net,subnetwork = 
 GVNIC_NETWORK_PREFIX 
-sub-3  
 \ 
  
--scopes  
 "https://www.googleapis.com/auth/cloud-platform" 
  
 \ 
  
--reservation-affinity = 
specific  
 \ 
  
--reservation = 
 RESERVATION_NAME 
/reservationBlocks/ BLOCK_NAME 
  
--placement-policy = 
 WORKLOAD_POLICY_NAME 
 

Replace the following:

  • NODE_POOL_NAME : the name of the node pool.
  • COMPUTE_ZONE : the zone of your node pool.
  • CLUSTER_NAME : the name of your cluster.
  • NODE_COUNT : the number of nodes for the node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of 1x72 in one subblock using an NVLink domain.
  • MACHINE_TYPE : a4x-highgpu-4g or a4x-highgpu-4g-nolssd , depending on if you want Local SSDs.
  • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .
  • RESERVATION_NAME : the name of your reservation. To find this value, see View future reservation requests .
  • BLOCK_NAME : the name of a specific block within the reservation. To find this value, see View future reservation requests .
  • WORKLOAD_POLICY_NAME : the name of the workload policy you created previously.

Create an A4X node pool with the NVIDIA GPU Operator

Alternatively, to use the NVIDIA GPU Operator, do the following steps:

  1. Run the gcloud container node-pools create command from the previous section with the following changes:

    • Change gpu-driver-version=latest to gpu-driver-version=disabled . This modification skips automatic GPU driver installation because it's not supported when using the NVIDIA GPU Operator.
    • Set --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" to disable the GKE managed GPU device plugin Daemonset.
  2. Apply the GKE GPU driver installer DaemonSet manifest. This manifest deploys a GPU driver installer Pod on each A4X node:

     kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml 
    
  3. Manage the GPU Stack with the NVIDIA GPU Operator on Google Kubernetes Engine (GKE) :

    1. In the section to create and set up the GPU node pool , follow the instructions starting from the step to get authentication credentials.
    2. Install the NVIDIA GPU Operator . Complete all the steps, but replace the command in the referenced section that installs the NVIDIA GPU Operator using Helm. Use the following command instead:

       helm  
      install  
      --wait  
      --generate-name  
       \ 
        
      -n  
      gpu-operator  
       \ 
        
      nvidia/gpu-operator  
       \ 
        
      --version = 
       "25.3.0" 
        
       \ 
        
      -f  
      < ( 
      cat  
      <<EOF
      hostPaths:  
      driverInstallDir:  
      /home/kubernetes/bin/nvidia
      toolkit:  
      installDir:  
      /home/kubernetes/bin/nvidia
      cdi:  
      enabled:  
       true 
        
      default:  
       true 
      driver:  
      enabled:  
       false 
      daemonsets:  
      tolerations:  
      -  
      key:  
      nvidia.com/gpu  
      operator:  
      Equal  
      value:  
      present  
      effect:  
      NoSchedule  
      -  
      key:  
      kubernetes.io/arch  
      operator:  
      Equal  
      value:  
      arm64  
      effect:  
      NoSchedule
      
      node-feature-discovery:  
      worker:  
      tolerations:  
      -  
      key:  
      kubernetes.io/arch  
      operator:  
      Equal  
      value:  
      arm64  
      effect:  
      NoSchedule  
      -  
      key:  
       "node-role.kubernetes.io/master" 
        
      operator:  
       "Equal" 
        
      value:  
       "" 
        
      effect:  
       "NoSchedule" 
        
      -  
      key:  
       "node-role.kubernetes.io/control-plane" 
        
      operator:  
       "Equal" 
        
      value:  
       "" 
        
      effect:  
       "NoSchedule" 
        
      -  
      key:  
      nvidia.com/gpu  
      operator:  
      Exists  
      effect:  
      NoSchedule
      EOF ) 
       
      

Install the RDMA binary and configure NCCL

Apply the following DaemonSet to install the RDMA binaries and the NCCL library on each node. On each underlying VM, the RDMA binaries are installed in the /home/kubernetes/bin/gib directory, and the NCCL library is installed in the /home/kubernetes/bin/nvidia/lib64 directory.

   
kubectl  
apply  
-f  
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-rdma/nccl-rdma-installer-a4x.yaml 

Install the NVIDIA Compute Domain CRD and DRA driver

Install the NVIDIA Compute Domain CRD and DRA driver. For more information, see NVIDIA DRA Driver for GPUs .

  1. Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell .

    Although there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.

     helm  
    version 
    

    If the output is similar to Command helm not found , then you can install the Helm CLI by running this command:

     curl  
    -fsSL  
    -o  
    get_helm.sh  
    https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3  
     \ 
     && 
    chmod  
     700 
      
    get_helm.sh  
     \ 
     && 
    ./get_helm.sh 
    
  2. Add the NVIDIA Helm repository:

     helm  
    repo  
    add  
    nvidia  
    https://helm.ngc.nvidia.com/nvidia  
     \ 
     && 
    helm  
    repo  
    update 
    
  3. Create a ResourceQuota for the DRA Driver:

      export 
      
     POD_QUOTA 
     = 
     POD_QUOTA 
    kubectl  
    create  
    ns  
    nvidia-dra-driver-gpu
    
    kubectl  
    apply  
    -n  
    nvidia-dra-driver-gpu  
    -f  
    - << 
    EOF
    apiVersion:  
    v1
    kind:  
    ResourceQuota
    metadata:  
    name:  
    nvidia-dra-driver-gpu-quota
    spec:  
    hard:  
    pods:  
     ${ 
     POD_QUOTA 
     } 
      
    scopeSelector:  
    matchExpressions:  
    -  
    operator:  
    In  
    scopeName:  
    PriorityClass  
    values:  
    -  
    system-node-critical  
    -  
    system-cluster-critical
    EOF 
    

    Replace POD_QUOTA with a number at least 2 times the number of A4X nodes in the cluster plus 1. For example, you must set the variable to at least 37 if you have 18 A4X nodes in your cluster.

  4. Install the DRA driver:

     helm  
    install  
    nvidia-dra-driver-gpu  
    nvidia/nvidia-dra-driver-gpu  
     \ 
      
    --version = 
     " DRIVER_VERSION 
    " 
      
     \ 
      
    --create-namespace  
     \ 
      
    --namespace  
    nvidia-dra-driver-gpu  
     \ 
      
    -f  
    < ( 
    cat  
    <<EOF
    nvidiaDriverRoot:  
    /home/kubernetes/bin/nvidia
    resources:  
    gpus:  
    enabled:  
     false 
    controller:  
    affinity:  
    nodeAffinity:  
    requiredDuringSchedulingIgnoredDuringExecution:  
    nodeSelectorTerms:  
    -  
    matchExpressions:  
    -  
    key:  
     "nvidia.com/gpu" 
      
    operator:  
     "DoesNotExist" 
    kubeletPlugin:  
    affinity:  
    nodeAffinity:  
    requiredDuringSchedulingIgnoredDuringExecution:  
    nodeSelectorTerms:  
    -  
    matchExpressions:  
    -  
    key:  
    cloud.google.com/gke-accelerator  
    operator:  
    In  
    values:  
    -  
    nvidia-gb200  
    -  
    key:  
    kubernetes.io/arch  
    operator:  
    In  
    values:  
    -  
    arm64  
    tolerations:  
    -  
    key:  
    nvidia.com/gpu  
    operator:  
    Equal  
    value:  
    present  
    effect:  
    NoSchedule  
    -  
    key:  
    kubernetes.io/arch  
    operator:  
    Equal  
    value:  
    arm64  
    effect:  
    NoSchedule
    EOF ) 
     
    

    Replace DRIVER_VERSION with version 25.3.1 or later.

Configure your workload manifest for multi-networking, RDMA, and the IMEX domain

  1. Add the following annotations to the Pod metadata:

      metadata 
     : 
      
     annotations 
     : 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"} 
      
     ] 
     
    
  2. Add a node affinity rule to schedule on Arm nodes:

      spec 
     : 
     affinity 
     : 
      
     nodeAffinity 
     : 
      
     requiredDuringSchedulingIgnoredDuringExecution 
     : 
      
     nodeSelectorTerms 
     : 
      
     - 
      
     matchExpressions 
     : 
      
     - 
      
     key 
     : 
      
     kubernetes.io/arch 
      
     operator 
     : 
      
     In 
      
     values 
     : 
      
     - 
      
     arm64 
     
    

    For more information, see Schedule workload to a single architecture .

  3. Add the following volumes to the Pod specification:

      spec 
     : 
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     library-dir-host 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/gib 
     
    
  4. Add the following volume mounts, environment variable, and resource to the container that requests GPUs. Your workload container must request all four GPUs:

      containers 
     : 
     - 
      
     name 
     : 
      
     my-container 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     library-dir-host 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     4 
     
    
  5. Create the ComputeDomain resource for the workload:

      apiVersion 
     : 
      
     resource.nvidia.com/v1beta1 
     kind 
     : 
      
     ComputeDomain 
     metadata 
     : 
      
     name 
     : 
      
     a4x-compute-domain 
     spec 
     : 
      
     numNodes 
     : 
      
      NUM_NODES 
     
      
     channel 
     : 
      
     resourceClaimTemplate 
     : 
      
     name 
     : 
      
     a4x-compute-domain-channel 
     
    

    Replace NUM_NODES with the number of nodes the workload requires.

  6. Specify the resourceClaimTemplate that the Pod will use:

      spec 
     : 
      
     ... 
      
     volumes 
     : 
      
     ... 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     my-container 
      
     ... 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     4 
      
     claims 
     : 
      
     - 
      
     name 
     : 
      
     compute-domain-channel 
      
      
     ... 
     resourceClaims 
     : 
      
     - 
      
     name 
     : 
      
     compute-domain-channel 
      
     resourceClaimTemplateName 
     : 
      
     a4x-compute-domain-channel 
     
    
  7. Set all the required environment variables to configure NCCL. Use the following shell script from the workload container:

      source 
      
    /usr/local/gib/scripts/set_nccl_env.sh 
    

A completed Pod specification looks like the following:

  apiVersion 
 : 
  
 resource.nvidia.com/v1beta1 
 kind 
 : 
  
 ComputeDomain 
 metadata 
 : 
  
 name 
 : 
  
 a4x-compute-domain 
 spec 
 : 
  
 numNodes 
 : 
  
  NUM_NODES 
 
  
 channel 
 : 
  
 resourceClaimTemplate 
 : 
  
 name 
 : 
  
 a4x-compute-domain-channel 
 --- 
 apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Pod 
 metadata 
 : 
  
 name 
 : 
  
 my-pod 
  
 labels 
 : 
  
 k8s-app 
 : 
  
 my-pod 
  
 annotations 
 : 
  
 networking.gke.io/default-interface 
 : 
  
 'eth0' 
  
 networking.gke.io/interfaces 
 : 
  
 | 
  
 [ 
  
 {"interfaceName":"eth0","network":"default"}, 
  
 {"interfaceName":"eth2","network":"rdma-0"}, 
  
 {"interfaceName":"eth3","network":"rdma-1"}, 
  
 {"interfaceName":"eth4","network":"rdma-2"}, 
  
 {"interfaceName":"eth5","network":"rdma-3"}, 
  
 ] 
 spec 
 : 
  
 ... 
  
 affinity 
 : 
  
 nodeAffinity 
 : 
  
 requiredDuringSchedulingIgnoredDuringExecution 
 : 
  
 nodeSelectorTerms 
 : 
  
 - 
  
 matchExpressions 
 : 
  
 - 
  
 key 
 : 
  
 kubernetes.io/arch 
  
 operator 
 : 
  
 In 
  
 values 
 : 
  
 - 
  
 arm64 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/nvidia 
  
 - 
  
 name 
 : 
  
 gib 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/gib 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 my-container 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 mountPath 
 : 
  
 /usr/local/nvidia 
  
 - 
  
 name 
 : 
  
 gib 
  
  
 mountPath 
 : 
  
 /usr/local/gib 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 LD_LIBRARY_PATH 
  
 value 
 : 
  
 /usr/local/nvidia/lib64 
  
 resources 
 : 
  
 limits 
 : 
  
 nvidia.com/gpu 
 : 
  
 4 
  
 claims 
 : 
  
 - 
  
 name 
 : 
  
 compute-domain-channel 
  
 ... 
 resourceClaims 
 : 
  
 - 
  
 name 
 : 
  
 compute-domain-channel 
  
 resourceClaimTemplateName 
 : 
  
 a4x-compute-domain-channel 
 

Deploy and run a NCCL test

To validate the functionality of the provisioned cluster which uses GPUDirect RDMA, you can run the following NCCL tests :

Test on two nodes

  1. Deploy a NCCL test workload of two test Pods running on two A4X nodes by applying the following manifest:

     kubectl  
    apply  
    -f  
    https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-imex-a4x.yaml 
    
  2. Check if the Pods are both scheduled to and running on some nodes:

     kubectl  
    get  
    pods  
    nccl-test-host-1  
    nccl-test-host-2 
    

    If the two Pods have the Running status, you can proceed to the next step.

  3. Trigger an all-gather test for the A4X nodes:

     kubectl  
     exec 
      
    nccl-test-host-1  
    -it  
    --  
    /usr/local/gib/scripts/run_nccl_tests.sh  
    -t  
    all_gather  
    -b  
    1K  
    -e  
    8G  
    nccl-host-1  
    nccl-host-2 
    

    The output is similar to the following:

     #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            32     float    none      -1    21.20    0.05    0.04      0    20.56    0.05    0.04      0
            2048            64     float    none      -1    21.03    0.10    0.09      0    20.82    0.10    0.09      0
            4096           128     float    none      -1    21.11    0.19    0.17      0    20.98    0.20    0.17      0
            8192           256     float    none      -1    21.51    0.38    0.33      0    21.15    0.39    0.34      0
           16384           512     float    none      -1    21.85    0.75    0.66      0    21.72    0.75    0.66      0
           32768          1024     float    none      -1    24.08    1.36    1.19      0    23.73    1.38    1.21      0
           65536          2048     float    none      -1    24.68    2.66    2.32      0    24.02    2.73    2.39      0
          131072          4096     float    none      -1    24.93    5.26    4.60      0    24.30    5.40    4.72      0
          262144          8192     float    none      -1    24.86   10.55    9.23      0    24.33   10.78    9.43      0
          524288         16384     float    none      -1    25.10   20.89   18.28      0    24.48   21.41   18.74      0
         1048576         32768     float    none      -1    25.43   41.24   36.09      0    24.82   42.25   36.97      0
         2097152         65536     float    none      -1    32.30   64.93   56.81      0    31.28   67.04   58.66      0
         4194304        131072     float    none      -1    45.92   91.34   79.92      0    44.22   94.84   82.99      0
         8388608        262144     float    none      -1    71.38  117.52  102.83      0    68.98  121.61  106.41      0
        16777216        524288     float    none      -1    74.17  226.20  197.93      0    72.37  231.83  202.85      0
        33554432       1048576     float    none      -1    116.6  287.84  251.86      0    112.7  297.75  260.54      0
        67108864       2097152     float    none      -1    188.9  355.27  310.86      0    184.0  364.71  319.12      0
       134217728       4194304     float    none      -1    309.6  433.56  379.36      0    299.7  447.83  391.85      0
       268435456       8388608     float    none      -1    559.0  480.23  420.20      0    540.3  496.85  434.75      0
       536870912      16777216     float    none      -1   1053.7  509.52  445.83      0   1021.4  525.64  459.93      0
      1073741824      33554432     float    none      -1   2087.4  514.39  450.10      0   2013.8  533.19  466.54      0
      2147483648      67108864     float    none      -1   4154.7  516.88  452.27      0   3987.4  538.57  471.25      0
      4294967296     134217728     float    none      -1   8289.2  518.14  453.37      0   7907.4  543.16  475.26      0
      8589934592     268435456     float    none      -1    16556  518.85  453.99      0    15726  546.24  477.96      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 175.233
    # 
    

Test with TAS

To validate the functionality of the provisioned cluster, you can run the following NCCL test with TAS .

Configure Kueue with TAS enabled

  1. Install Kueue with TAS enabled
  2. Configure Kueue with TAS enabled by creating the following file with the name a4x-kueue-config.yaml :

      apiVersion 
     : 
      
     kueue.x-k8s.io/v1alpha1 
     kind 
     : 
      
     Topology 
     metadata 
     : 
      
     name 
     : 
      
     "a4x-default" 
     spec 
     : 
      
     levels 
     : 
      
     - 
      
     nodeLabel 
     : 
      
     "cloud.google.com/gce-topology-block" 
      
     - 
      
     nodeLabel 
     : 
      
     "cloud.google.com/gce-topology-subblock" 
      
     - 
      
     nodeLabel 
     : 
      
     "cloud.google.com/gke-nodepool" 
      
     - 
      
     nodeLabel 
     : 
      
     "cloud.google.com/gce-topology-host" 
      
     - 
      
     nodeLabel 
     : 
      
     "kubernetes.io/hostname" 
     --- 
     kind 
     : 
      
     ResourceFlavor 
     apiVersion 
     : 
      
     kueue.x-k8s.io/v1beta1 
     metadata 
     : 
      
     name 
     : 
      
     "a4x" 
     spec 
     : 
      
     nodeLabels 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
     nvidia-gb200 
      
     topologyName 
     : 
      
     "a4x-default" 
      
     tolerations 
     : 
      
     - 
      
     key 
     : 
      
     "nvidia.com/gpu" 
      
     operator 
     : 
      
     "Exists" 
      
     effect 
     : 
      
     NoSchedule 
      
     - 
      
     key 
     : 
      
     "kubernetes.io/arch" 
      
     operator 
     : 
      
     "Exists" 
      
     effect 
     : 
      
     NoSchedule 
     --- 
     apiVersion 
     : 
      
     kueue.x-k8s.io/v1beta1 
     kind 
     : 
      
     ClusterQueue 
     metadata 
     : 
      
     name 
     : 
      
     "a4x" 
     spec 
     : 
      
     namespaceSelector 
     : 
      
     {} 
      
     # match all. 
      
     resourceGroups 
     : 
      
     - 
      
     coveredResources 
     : 
      
     [ 
     "nvidia.com/gpu" 
     ] 
      
     flavors 
     : 
      
     - 
      
     name 
     : 
      
     "a4x" 
      
     resources 
     : 
      
     - 
      
     name 
     : 
      
     "nvidia.com/gpu" 
      
     nominalQuota 
     : 
      
     1_000_000_000 
     --- 
     apiVersion 
     : 
      
     kueue.x-k8s.io/v1beta1 
     kind 
     : 
      
     LocalQueue 
     metadata 
     : 
      
     namespace 
     : 
      
     "default" 
      
     name 
     : 
      
     "a4x" 
     spec 
     : 
      
     clusterQueue 
     : 
      
     "a4x" 
     
    
  3. Run the test:

     kubectl  
    apply  
    -f  
    a4x-kueue-config.yaml 
    

Schedule a topology-aware NCCL test with Kueue with TAS enabled

The following workload must be placed within a single NVLink Domain sub-block.

  1. Install JobSet , a Kubernetes-native API for managing of group of Kubernetes Jobs as a unit. Ensure that your non-GPU node pools have enough resources to schedule the JobSet controllers.
  2. Create the following file with the name nccl-tas-test.yaml . Replace NUM_NODES with the intended number of nodes to run the NCCL test, up to 18 :

      apiVersion 
     : 
      
     resource.nvidia.com/v1beta1 
     kind 
     : 
      
     ComputeDomain 
     metadata 
     : 
      
     name 
     : 
      
     nccl-test-compute-domain 
     spec 
     : 
      
     numNodes 
     : 
      
      NUM_NODES 
     
      
     channel 
     : 
      
     resourceClaimTemplate 
     : 
      
     name 
     : 
      
     nccl-test-compute-domain-channel 
     --- 
     apiVersion 
     : 
      
     jobset.x-k8s.io/v1alpha2 
     kind 
     : 
      
     JobSet 
     metadata 
     : 
      
     name 
     : 
      
     kueue-tas-nccl-all-gather 
      
     labels 
     : 
      
     kueue.x-k8s.io/queue-name 
     : 
      
     a4x 
     spec 
     : 
      
     ttlSecondsAfterFinished 
     : 
      
     1200 
      
     network 
     : 
      
     enableDNSHostnames 
     : 
      
     true 
      
     replicatedJobs 
     : 
      
     - 
      
     name 
     : 
      
     worker 
      
     template 
     : 
      
     spec 
     : 
      
     parallelism 
     : 
      
      NUM_NODES 
     
      
     completions 
     : 
      
      NUM_NODES 
     
      
     template 
     : 
      
     metadata 
     : 
      
     annotations 
     : 
      
     kueue.x-k8s.io/podset-required-topology 
     : 
      
     "cloud.google.com/gce-topology-subblock" 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"} 
      
     ] 
      
     spec 
     : 
      
     activeDeadlineSeconds 
     : 
      
     3600 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
     nvidia-gb200 
      
     tolerations 
     : 
      
     - 
      
     key 
     : 
      
     nvidia.com/gpu 
      
     operator 
     : 
      
     Equal 
      
     value 
     : 
      
     present 
      
     effect 
     : 
      
     NoSchedule 
      
     - 
      
     key 
     : 
      
     kubernetes.io/arch 
      
     operator 
     : 
      
     Equal 
      
     value 
     : 
      
     arm64 
      
     effect 
     : 
      
     NoSchedule 
      
     setHostnameAsFQDN 
     : 
      
     true 
      
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     gib 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/gib 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/nvidia 
      
     - 
      
     name 
     : 
      
     lib64 
      
     hostPath 
     : 
      
     path 
     : 
      
     /lib64 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     emptyDir 
     : 
      
     medium 
     : 
      
     "Memory" 
      
     sizeLimit 
     : 
      
     250Gi 
      
     resourceClaims 
     : 
      
     - 
      
     name 
     : 
      
     compute-domain-channel 
      
     resourceClaimTemplateName 
     : 
      
     nccl-test-compute-domain-channel 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     nccl-test 
      
     stdin 
     : 
      
     true 
      
     tty 
     : 
      
     true 
      
     image 
     : 
      
     us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.4 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     MY_NODE_NAME 
      
     valueFrom 
     : 
      
     fieldRef 
     : 
      
     fieldPath 
     : 
      
     spec.nodeName 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT_CONFIRM 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     N_NODES 
      
     value 
     : 
      
     " NUM_NODES 
    " 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     set -x 
      
     echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark" 
      
     # Install ping 
      
     apt update -y 
      
     apt install -y iputils-ping 
      
     # Start sshd 
      
     /scripts/container_entry.sh daemon 
    &  
     # Get helper variables to form all hostnames 
      
     export POSTFIX=$(hostname | cut -d . -f 2-) 
      
     export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) 
      
     export NODE_RANK=$JOB_COMPLETION_INDEX 
      
     # For every worker, wait till online and add to hostfile 
      
     for i in `seq 0 $(($N_NODES-1))`; do 
      
     OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX} 
      
     until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do 
      
     echo Waiting for ${OTHER}... 
      
     sleep 10 
      
     done 
      
     echo ${OTHER} port=222 slots=4 | tee -a /tmp/hostfile; 
      
     done 
      
     cat /tmp/hostfile 
      
     # Launch from head node 
      
     if [[ "${NODE_RANK}" -eq "0" ]]; then 
      
     # World Level = 0x0, Rail Aligned = 0x7 
      
     export NCCL_TESTS_SPLIT_MASK="0x0"; 
      
     # Force use of libnccl-gib 
      
     export NCCL_NET=gIB 
      
     # Set all the correct libnccl-gib environment variables 
      
     source /usr/local/gib/scripts/set_nccl_env.sh 
      
     # Get all relevant NCCL / env vars to pass to all workers 
      
     ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') 
      
     mpirun --hostfile /tmp/hostfile \ 
      
     -x $ENV_VARS  \ 
      
     -mca plm_rsh_no_tree_spawn 1 \ 
      
     --mca orte_keep_fqdn_hostnames 1 \ 
      
     --mca btl self,tcp \ 
      
     --mca btl_tcp_if_include eth0 \ 
      
     --bind-to none \ 
      
     --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ 
      
     /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 
      
     else 
      
     while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do 
      
     sleep 5 
      
     done 
      
     fi 
      
     exit 0 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     mountPath 
     : 
      
     /dev/shm 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     4 
      
     requests 
     : 
      
     nvidia.com/gpu 
     : 
      
     4 
      
     claims 
     : 
      
     - 
      
     name 
     : 
      
     compute-domain-channel 
      
     restartPolicy 
     : 
      
     Never 
     
    
  3. Run the test:

     kubectl  
    apply  
    -f  
    nccl-tas-test.yaml 
    
  4. Check the test result by reviewing the logs:

    kubectl  
    logs  
     $( 
    kubectl  
    get  
    pods  
    -o  
    go-template = 
     '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' 
      
     | 
      
    grep  
    kueue-tas-nccl-all-gather-worker-0-0 ) 
    

    The output should be similar to the following:

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
     #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
             1024             8     float    none      -1    56.72    0.02    0.02      0    56.12    0.02    0.02      0
             2048            16     float    none      -1    56.85    0.04    0.03      0    56.87    0.04    0.03      0
             4096            32     float    none      -1    57.53    0.07    0.07      0    57.47    0.07    0.07      0
             8192            64     float    none      -1    58.43    0.14    0.14      0    58.27    0.14    0.14      0
            16384           128     float    none      -1    59.29    0.28    0.27      0    58.87    0.28    0.27      0
            32768           256     float    none      -1    60.02    0.55    0.53      0    59.60    0.55    0.53      0
            65536           512     float    none      -1    61.83    1.06    1.03      0    61.64    1.06    1.03      0
           131072          1024     float    none      -1    70.99    1.85    1.79      0    70.82    1.85    1.79      0
           262144          2048     float    none      -1    71.56    3.66    3.55      0    71.07    3.69    3.57      0
           524288          4096     float    none      -1    72.62    7.22    6.99      0    71.90    7.29    7.06      0
          1048576          8192     float    none      -1    72.80   14.40   13.95      0    72.31   14.50   14.05      0
          2097152         16384     float    none      -1    73.40   28.57   27.68      0    72.96   28.74   27.85      0
          4194304         32768     float    none      -1    73.86   56.78   55.01      0    73.44   57.12   55.33      0
          8388608         65536     float    none      -1    102.5   81.86   79.30      0    101.4   82.69   80.11      0
         16777216        131072     float    none      -1    158.3  105.97  102.66      0    156.8  107.02  103.68      0
         33554432        262144     float    none      -1    158.4  211.89  205.26      0    157.5  212.99  206.33      0
         67108864        524288     float    none      -1    250.7  267.68  259.32      0    248.7  269.81  261.38      0
        134217728       1048576     float    none      -1    417.7  321.29  311.25      0    414.1  324.13  314.01      0
        268435456       2097152     float    none      -1    728.8  368.32  356.81      0    721.5  372.08  360.45      0
        536870912       4194304     float    none      -1   1226.5  437.72  424.04      0   1216.1  441.46  427.66      0
       1073741824       8388608     float    none      -1   2268.4  473.35  458.56      0   2247.0  477.86  462.93      0
       2147483648      16777216     float    none      -1   4330.6  495.88  480.39      0   4291.6  500.39  484.76      0
       4294967296      33554432     float    none      -1   8640.9  497.05  481.52      0   8544.0  502.69  486.98      0
       8589934592      67108864     float    none      -1    17258  497.75  482.19      0    17052  503.75  488.00      0
     # Out of bounds values : 0 OK
     # Avg bus bandwidth    : 157.091 
    

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: