Create a custom AI-optimized GKE cluster

This page shows you how to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses Cluster Director for GKE to support your AI and ML workloads with A4 or A3 Ultra virtual machines (VMs).

Cluster Director for GKE lets you deploy and manage large AI-optimized clusters of accelerated VMs with features such as targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. For more information, see Cluster Director .

GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services.

On this page, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For instructions on how to do this, see Create an AI-optimized GKE cluster with default configuration .

Cluster configuration options with GPUDirect RDMA

To create your cluster with the Google Cloud CLI, you can choose one of the following cluster configuration options:

  • If you don't plan to run distributed AI workloads: create a GKE cluster without using GPUDirect RDMA.
  • If you plan to run distributed AI workloads: create a GKE cluster with GPUDirect RDMA.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update .

Choose a consumption option and obtain capacity

  1. Choose a consumption option. Make your choice based on how you want to get and use GPU resources. For more information, see Choose a consumption option .

    For GKE, consider the following additional information when you choose a consumption option:

  2. Obtain capacity. Learn how to obtain capacity for your consumption option.

Requirements

The following requirements apply to an AI-optimized GKE cluster:

  • To use the flex-start provisioning model, you must use GKE version 1.32.2-gke.1652000 or later.
  • Ensure you use the minimum GPU driver version, depending on the machine type:

    • A4: the B200 GPUs in A4 VMs require a minimum of the 570 GPU driver version. GKE, by default, automatically installs this driver version on all A4 nodes that run the required minimum version for A4, which is 1.32.1-gke.1729000 or later.
    • A3 Ultra: the H200 GPUs in A3 Ultra VMs require a minimum of the 550 GPU driver version, which is available in GKE version 1.31 as the latest driver version. For A3 Ultra VMs, you must set the value of the gpu-driver-version=latest field with GKE version 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE automatically installs 550 GPU driver versions on A3 Ultra nodes by default, including when you omit the gpu-driver-version flag.
  • To use GPUDirect RDMA, the following additional requirements apply:

    • Use the following minimum versions, depending on the machine type:
      • A4: use version 1.32.2-gke.1475000 or later.
      • A3 Ultra: use version 1.31.4-gke.1183000 or later.
    • The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
    • Your GKE workload must use all available GPUs and your Pod must use all available secondary network interface cards (NICs) on a single GKE node. Multiple Pods can't share RDMA on a single GKE node.
    • This setup runs a NCCL test. To run this NCCL test, you must have a minimum VM quota of 2 (that is, 16 GPUs if you use the a4-highgpu-8g or a3-ultragpu-8g machine types).
  • Ensure that you use a location which has availability for the machine type that you choose. For more information, see GPU availability by regions and zones .

Create an AI-optimized GKE cluster

Follow the instructions in this section to create a GKE cluster that meets the requirements for AI-optimized GKE clusters. You can choose between creating a cluster with or without GPUDirect RDMA.

Considerations for creating a cluster

When you create a cluster, consider the following information:

  • Choose a cluster location:
    • Ensure that you use a location which has availability for the machine type that you choose. For more information, see GPU availability by regions and zones .
    • For dense reservations, you can create a zonal cluster. In this case, replace the --region flag with the --zone= COMPUTE_ZONE flag, where COMPUTE_ZONE is the zone of your control plane.
    • When you create node pools in a regional cluster, you can use the --node-locations flag to specify the zones for your GKE nodes.
  • Choose a driver version:
    • The driver version can be one of the following values:
      • default : install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.
      • latest : install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.
      • disabled : skip automatic driver installation. You must manually install a driver after you create the node pool.
    • For more information about the default and latest GPU driver versions for GKE node versions, see Manually install NVIDIA GPU drivers .
  • Choose a reservation affinity:

    • You can find information about your reservation, such as the name of your reservation or the name of a specific block in your reservation, by querying your reservation .
    • The --reservation-affinity flag can take the values of specific or any .
    • For high performance distributed AI workloads, we recommend that you use a specific reservation.
    • When you use a specific reservation, including shared reservations , specify the value of the --reservation flag in the following format:

       projects/ PROJECT_ID 
      /reservations/ RESERVATION_NAME 
      /reservationBlocks/ BLOCK_NAME 
       
      

      Replace the following:

      • PROJECT_ID : your Google Cloud project ID.
      • RESERVATION_NAME : the name of your reservation.
      • BLOCK_NAME : the name of a specific block within the reservation.

Create a cluster without GPUDirect RDMA

To create a cluster without GPUDirect RDMA, use the following instructions to create a cluster with a CPU-based default node pool and additional node pools with GPUs. This approach allows the default node pool to run other services.

  1. Create the cluster:

       
    gcloud  
    container  
    clusters  
    create  
     CLUSTER_NAME 
      
     \ 
      
    --cluster-version = 
     CLUSTER_VERSION 
      
     \ 
      
    --region = 
     COMPUTE_REGION 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your new cluster.
    • CLUSTER_VERSION : the version of your new cluster. For more information about which version of GKE supports your configuration, see the Requirements section.
    • COMPUTE_REGION : the region of your new cluster. If you plan to create a zonal cluster , use the --zone flag instead of the --region flag, for example: --zone= COMPUTE_ZONE . Replace COMPUTE_ZONE with the zone of the control plane.
  2. Create the GPU-based node pool with one of the following commands. The command that you need to run depends on the consumption option that you use for your deployment. Select the tab that corresponds to your consumption option's provisioning model.

    Reservation-bound

    For reservation-bound provisioning, run the following command:

       
    gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --region  
     COMPUTE_REGION 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --node-locations  
     COMPUTE_ZONE 
      
     \ 
      
    --accelerator  
     type 
     = 
     GPU_TYPE 
    ,count = 
     AMOUNT 
    ,gpu-driver-version = 
     DRIVER_VERSION 
      
     \ 
      
    --machine-type  
     MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --reservation-affinity = 
    specific  
     \ 
      
    --reservation = 
     RESERVATION_NAME 
    /reservationBlocks/ BLOCK_NAME 
     
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool.
    • COMPUTE_REGION : the region of your new cluster.
    • CLUSTER_NAME : the name of your new cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • GPU_TYPE : the type of GPU accelerator :

      • A4 VMs: enter nvidia-b200 .
      • A3 Ultra VMs: enter nvidia-h200-141gb .
    • AMOUNT : the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8 .

    • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • NUM_NODES : the number of nodes for the node pool.

    • RESERVATION_NAME : the name of your reservation. To find this value, you can query your reservation .

    • BLOCK_NAME : the name of a specific block within the reservation. To find this value, you can query your reservation .

    Flex-start

    Preview

    This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms . Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions .

    For flex-start provisioning, run the following command:

       
    gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --region  
     COMPUTE_REGION 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --node-locations  
     COMPUTE_ZONE 
      
     \ 
      
    --accelerator  
     type 
     = 
     GPU_TYPE 
    ,count = 
     AMOUNT 
    ,gpu-driver-version = 
     DRIVER_VERSION 
      
     \ 
      
    --machine-type  
     MACHINE_TYPE 
      
     \ 
      
    --flex-start  
    --enable-autoscaling  
    --num-nodes = 
     0 
      
     \ 
      
    --total-max-nodes  
     TOTAL_MAX_NODES 
      
     \ 
      
    --no-enable-autorepair  
    --location-policy = 
    ANY  
     \ 
      
    --reservation-affinity = 
    none  
     [ 
     \ 
      
    --enable-queued-provisioning ] 
     
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool.
    • COMPUTE_REGION : the region of your new cluster.
    • CLUSTER_NAME : the name of your new cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • GPU_TYPE : the type of GPU accelerator :

      • A4 VMs: enter nvidia-b200 .
      • A3 Ultra VMs: enter nvidia-h200-141gb .
    • AMOUNT : the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8 .

    • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • TOTAL_MAX_NODES : the maximum number of nodes to automatically scale for the entire node pool.

      If you want to use flex-start with queued provisioning, include the --enable-queued-provisioning flag.

      For more information about using flex-start, see Run large-scale workload with flex-start with queued provisioning .

    Spot

    For spot provisioning, run the following command:

       
    gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --region  
     COMPUTE_REGION 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --node-locations  
     COMPUTE_ZONE 
      
     \ 
      
    --accelerator  
     type 
     = 
     GPU_TYPE 
    ,count = 
     AMOUNT 
    ,gpu-driver-version = 
     DRIVER_VERSION 
      
     \ 
      
    --machine-type  
     MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --spot 
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool.
    • COMPUTE_REGION : the region of your new cluster.
    • CLUSTER_NAME : the name of your new cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • GPU_TYPE : the type of GPU accelerator :

      • A4 VMs: enter nvidia-b200 .
      • A3 Ultra VMs: enter nvidia-h200-141gb .
    • AMOUNT : the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8 .

    • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • NUM_NODES : the number of nodes for the node pool.

      For more information about creating clusters with Spot VMs, see Run fault-tolerant workloads at lower costs with Spot VMs .

  3. Connect to your cluster, so that you can run the kubectl commands in the next sections:

       
    gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_NAME 
      
    --location = 
     COMPUTE_REGION 
     
    

    Replace the following:

Create a cluster with GPUDirect RDMA

For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. The A4 VMs and A3 Ultra VMs come with the Titanium ML network adapter, which is built on NVIDIA ConnectX-7 (CX7) NICs. Both A4 VMs and A3 Ultra VMs deliver non-blocking 3.2 Tbps of inter-node GPU-to-GPU traffic by using RDMA over Converged Ethernet (RoCE). RoCE enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads.

For more information about how to create your GKE clusters by using the Google Cloud CLI and GPUDirect TCPX (A3 High VMs) or TCPXO (A3 Mega VMs), see maximize GPU network bandwidth in Autopilot mode clusters , or maximize GPU network bandwidth in Standard mode clusters .

To create your GKE clusters in Autopilot or Standard mode with GPUDirect RDMA, complete the following steps, which are described in the next sections:

  1. Create VPCs and subnets
  2. Create the GKE cluster with multi-networking
  3. Create GKE network objects
  4. Install the RDMA binary and configure NCCL
  5. Deploy and run a NCCL test
  6. Configure your Pod manifests for GPUDirect-RDMA

Create VPCs and subnets

Both A4 VMs and A3 Ultra VMs have the following configuration:

  • Eight NVIDIA B200 (A4) or H200 (A3 Ultra) GPUs per virtual machine connected with NVLink
  • Two Intel Emerald Rapids CPUs
  • Eight 400 Gbps CX-7 NICs for GPU-to-GPU networking
  • Two 200 Gbps Google Titanium NICs for external services

AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs, which more efficiently uses the network bandwidth that's available. GPUDirect RDMA is designed to significantly improve throughput at scale compared to GPUs that don't use GPUDirect.

One of the Google Titanium NICs that's associated with the CPU uses the default network in GKE. You don't need to create a new VPC for this NIC if you have enough IP address ranges for the default network.

You can create one VPC for the second CPU Titanium NIC (gVNIC) and another VPC for the eight CX-7 RDMA NICs by using these commands.

  1. Set environment variables to match your deployment:

     export REGION=" COMPUTE_REGION 
    "
    export ZONE=" COMPUTE_ZONE 
    "
    export PROJECT=" PROJECT_ID 
    "
    export GVNIC_NETWORK_PREFIX=" GVNIC_NETWORK_PREFIX 
    "
    export RDMA_NETWORK_PREFIX=" RDMA_NETWORK_PREFIX 
    " 
    

    Replace the following variables:

    • COMPUTE_REGION : the region of your cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • PROJECT_ID : your Google Cloud project ID.
    • GVNIC_NETWORK_PREFIX : either a4high-gvnic for A4 VMs, or a3ultra-gvnic for A3 Ultra VMs.
    • RDMA_NETWORK_PREFIX : either a4high-rdma for A4 VMs, or a3ultra-rdma for A3 Ultra VMs.
  2. Create two VPC networks:

      # Create a VPC for the additional Google Titanium CPU NIC 
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    create  
     \ 
      
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -net  
     \ 
      
    --subnet-mode = 
    custom
    
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    subnets  
    create  
     \ 
      
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -sub  
     \ 
      
    --network = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -net  
     \ 
      
    --region = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --range = 
     192 
    .168.0.0/24
    
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    firewall-rules  
    create  
     \ 
      
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -internal  
     \ 
      
    --network = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -net  
     \ 
      
    --action = 
    ALLOW  
     \ 
      
    --rules = 
    tcp:0-65535,udp:0-65535,icmp  
     \ 
      
    --source-ranges = 
     192 
    .168.0.0/16 # Create HPC VPC for the RDMA NICs with 8 subnets. 
    gcloud  
    beta  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    create  
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net  
     \ 
      
    --network-profile = 
     ${ 
     ZONE 
     } 
    -vpc-roce  
     \ 
      
    --subnet-mode = 
    custom # Create subnets for the HPC VPC. 
     for 
      
    N  
     in 
      
     $( 
    seq  
     0 
      
     7 
     ) 
     ; 
      
     do 
      
    gcloud  
    compute  
    --project = 
     ${ 
     PROJECT 
     } 
      
     \ 
      
    networks  
    subnets  
    create  
     \ 
      
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub- $N 
      
     \ 
      
    --network = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net  
     \ 
      
    --region = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --range = 
     192 
    .168. $(( 
     N 
     + 
     1 
     )) 
    .0/24 & 
     # offset to avoid overlap with gvnics 
     done 
     
    

Create the GKE cluster with multi-networking

Autopilot

  1. Create a GKE Autopilot cluster with multi-networking:

     gcloud  
    container  
    clusters  
    create-auto  
     CLUSTER_NAME 
      
     \ 
      
    --enable-multi-networking  
     \ 
      
    --cluster-version = 
     CLUSTER_VERSION 
      
     \ 
      
    --region = 
     COMPUTE_REGION 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your cluster.
    • CLUSTER_VERSION : the version of your new cluster. To find out which version of GKE supports your configuration, see the Requirements section.
    • COMPUTE_REGION : the name of the compute region.
  2. Connect to your cluster, so that you can run the kubectl commands in the next sections:

     gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_NAME 
      
    --location = 
     COMPUTE_REGION 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your cluster.
    • COMPUTE_REGION : the name of the compute region.

    For more information, see Install kubectl and configure cluster access .

Standard

Create a GKE Standard cluster and GPU node pool with multi-networking:

  1. Create the cluster:

     gcloud  
    container  
    clusters  
    create  
     CLUSTER_NAME 
      
     \ 
      
    --region = 
     COMPUTE_REGION 
      
     \ 
      
    --cluster-version = 
     CLUSTER_VERSION 
      
     \ 
      
    --enable-dataplane-v2  
    --enable-ip-alias  
    --enable-multi-networking  
     [ 
     \ 
      
    --services-ipv4-cidr = 
     SERVICE_CIDR 
      
     \ 
      
    --cluster-ipv4-cidr = 
     POD_CIDR 
     ] 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your cluster.
    • CLUSTER_VERSION : the version of your new cluster. To find out which version of GKE supports your configuration, see the Requirements section.
    • COMPUTE_REGION : the name of the compute region.

    Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, replace the following variables:

    • SERVICE_CIDR : the secondary CIDR range for services.
    • POD_CIDR : the secondary CIDR range for Pods.

    When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, the ranges in the SERVICE_CIDR =10.65.0.0/19 and POD_CIDR =10.64.0.0/19 values don't overlap with each other.

  2. Create the node pool. The command that you need to run depends on the consumption option that you use for your deployment. Select the tab that corresponds to your consumption option's provisioning model.

    Reservation-bound

    For reservation-bound provisioning, run the following command:

     gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --region  
     COMPUTE_REGION 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --node-locations  
     COMPUTE_ZONE 
      
     \ 
      
    --accelerator  
     type 
     = 
     GPU_TYPE 
    ,count = 
     AMOUNT 
    ,gpu-driver-version = 
     DRIVER_VERSION 
      
     \ 
      
    --machine-type  
     MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --reservation-affinity = 
    specific  
     \ 
      
    --reservation = 
     RESERVATION_NAME 
    /reservationBlocks/ BLOCK_NAME 
      
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -sub  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-0  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-1  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-2  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-3  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-4  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-5  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-6  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-7 
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool.
    • COMPUTE_REGION : the region of your new cluster.
    • CLUSTER_NAME : the name of your new cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • GPU_TYPE : the type of GPU accelerator :

      • A4 VMs: enter nvidia-b200 .
      • A3 Ultra VMs: enter nvidia-h200-141gb .
    • AMOUNT : the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8 .

    • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • NUM_NODES : the number of nodes for the node pool. For flex-start, this value must be set to 0 .

    • RESERVATION_NAME : the name of your reservation. To find this value, you can query your reservation .

    • BLOCK_NAME : the name of a specific block within the reservation. To find this value, you can query your reservation .

    Flex-start

    Preview

    This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms . Pre-GA products and features are available "as is" and might have limited support. For more information, see the launch stage descriptions .

    For flex-start provisioning, run the following command:

     gcloud  
    container  
    node-pools  
    create  
     NODE_POOL_NAME 
      
     \ 
      
    --region  
     COMPUTE_REGION 
      
    --cluster  
     CLUSTER_NAME 
      
     \ 
      
    --node-locations  
     COMPUTE_ZONE 
      
     \ 
      
    --accelerator  
     type 
     = 
     GPU_TYPE 
    ,count = 
     AMOUNT 
    ,gpu-driver-version = 
     DRIVER_VERSION 
      
     \ 
      
    --machine-type  
     MACHINE_TYPE 
      
     \ 
      
    --num-nodes = 
     NUM_NODES 
      
     \ 
      
    --flex-start  
    --num-nodes = 
     0 
      
    --enable-autoscaling  
     \ 
      
    --total-max-nodes  
     TOTAL_MAX_NODES 
      
     \ 
      
    --no-enable-autorepair  
    --location-policy = 
    ANY  
     \ 
      
    --reservation-affinity = 
    none  
     \ 
      
     [ 
    --enable-queued-provisioning  
     \] 
      
    --additional-node-network  
     network 
     = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     GVNIC_NETWORK_PREFIX 
     } 
    -sub  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-0  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-1  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-2  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-3  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-4  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-5  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-6  
     \ 
      
    --additional-node-network  
     network 
     = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -net,subnetwork = 
     ${ 
     RDMA_NETWORK_PREFIX 
     } 
    -sub-7 
    

    Replace the following:

    • NODE_POOL_NAME : the name of the node pool.
    • COMPUTE_REGION : the region of your new cluster.
    • CLUSTER_NAME : the name of your new cluster.
    • COMPUTE_ZONE : the zone of your node pool.
    • GPU_TYPE : the type of GPU accelerator :

      • A4 VMs: enter nvidia-b200 .
      • A3 Ultra VMs: enter nvidia-h200-141gb .
    • AMOUNT : the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8 .

    • DRIVER_VERSION : the NVIDIA driver version to install. It can be one of the following values: default , latest , or disabled .

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • MACHINE_TYPE : the Compute Engine machine type for the nodes. For example, use a4-highgpu-8g for A4 VMs, and a3-ultragpu-8g for A3 Ultra VMs.

    • NUM_NODES : the number of nodes for the node pool. For flex-start, this value must be set to 0 .

    • TOTAL_MAX_NODES : the maximum number of nodes to automatically scale for the entire node pool.

    If you want to use flex-start with queued provisioning, include the --enable-queued-provisioning flag.

    For more information about using flex-start, see Run large-scale workload with flex-start with queued provisioning .

  3. Connect to your cluster, so that you can run the kubectl commands in the next sections:

     gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_NAME 
      
    --location = 
     COMPUTE_REGION 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of your cluster.
    • COMPUTE_REGION : the name of the compute region.

    For more information, see Install kubectl and configure cluster access .

Create the GKE network objects

The VPC networks created in the previous section need to be configured through GKE network parameter sets. Specifically, the second CPU Titanium NIC (gVNIC) needs to be configured in NetDevice mode and each of the eight CX-7 RDMA NICs need to be configured in RDMA mode.

This command uses the following names:

  • CPU Titanium NIC (gVNIC) VPC is named ${GVNIC_NETWORK_PREFIX}-net with subnet named ${GVNIC_NETWORK_PREFIX}-sub
  • CX-7 RDMA NICs VPC is named ${RDMA_NETWORK_PREFIX}-net with subnets named ${RDMA_NETWORK_PREFIX}-sub-[0…7]

Create the GKE network objects by running the following command:

 kubectl  
apply  
-f  
-  
<<EOF
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
gvnic-1
spec:  
vpc:  
 ${ 
 GVNIC_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 GVNIC_NETWORK_PREFIX 
 } 
-sub  
deviceMode:  
NetDevice
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
gvnic-1
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
gvnic-1
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-0
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-0  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-0
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-0
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-1
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-1  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-1
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-1
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-2
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-2  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-2
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-2
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-3
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-3  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-3
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-3
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-4
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-4  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-4
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-4
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-5
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-5  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-5
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-5
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-6
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-6  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-6
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-6
---
apiVersion:  
networking.gke.io/v1
kind:  
GKENetworkParamSet
metadata:  
name:  
rdma-7
spec:  
vpc:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-net  
vpcSubnet:  
 ${ 
 RDMA_NETWORK_PREFIX 
 } 
-sub-7  
deviceMode:  
RDMA
---
apiVersion:  
networking.gke.io/v1
kind:  
Network
metadata:  
name:  
rdma-7
spec:  
type:  
 "Device" 
  
parametersRef:  
group:  
networking.gke.io  
kind:  
GKENetworkParamSet  
name:  
rdma-7
EOF 

Install the RDMA binary and configure NCCL

Apply the following DaemonSet to install the RDMA binaries and the NCCL library on each node. On each underlying VM, the RDMA binaries are installed in the /home/kubernetes/bin/gib directory, and the NCCL library is installed in the /home/kubernetes/bin/nvidia/lib64 directory.

Autopilot

For GKE Autopilot mode, run the following command:

 kubectl  
apply  
-f  
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer-autopilot.yaml 

Standard

For GKE Standard mode, run the following command:

 kubectl  
apply  
-f  
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml 

Run NCCL tests

To validate the functionality of the provisioned cluster, you can run a NCCL test . For instructions, see Deploy and run a NCCL test .

Configure your Pod manifests for GPUDirect RDMA

To run your workloads by using GPUDirect RDMA, configure your Pod manifests with the following steps:

  1. Add the following annotations to the Pod metadata.

    Autopilot

    Use the following annotation for GKE Autopilot mode:

      metadata 
     : 
     annotations 
     : 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth1","network":"gvnic-1"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"}, 
      
     {"interfaceName":"eth6","network":"rdma-4"}, 
      
     {"interfaceName":"eth7","network":"rdma-5"}, 
      
     {"interfaceName":"eth8","network":"rdma-6"}, 
      
     {"interfaceName":"eth9","network":"rdma-7"} 
      
     ] 
     
    

    Standard

    The following annotation for GKE Standard mode doesn't include a gvnic-1 specification, but you can add it if your workloads require it.

    Use the following annotation for GKE Standard mode:

      metadata 
     : 
     annotations 
     : 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"}, 
      
     {"interfaceName":"eth6","network":"rdma-4"}, 
      
     {"interfaceName":"eth7","network":"rdma-5"}, 
      
     {"interfaceName":"eth8","network":"rdma-6"}, 
      
     {"interfaceName":"eth9","network":"rdma-7"} 
      
     ] 
     
    
  2. Specify the chosen GPU type and specific reservation by using node selectors:

      spec 
     : 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
      ACCELERATOR 
     
      
     cloud.google.com/reservation-name 
     : 
      
      RESERVATION_NAME 
     
      
     cloud.google.com/reservation-affinity 
     : 
      
     "specific" 
     
    

    Replace the following:

    • ACCELERATOR : the accelerator that you reserved in the Compute Engine capacity reservation. You must use one of the following values:
      • nvidia-b200 : NVIDIA B200 (180GB) for A4 VMs
      • nvidia-h200-141gb : NVIDIA H200 (141GB) for A3 Ultra VMs
    • RESERVATION_NAME : the name of the Compute Engine capacity reservation.

    To consume shared reservations, or specific blocks and sub-blocks of reservations, see the respective sections in Consuming reserved zonal path resources .

  3. Add the following volumes to the Pod spec:

      spec 
     : 
      
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     library-dir-host 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/gib 
     
    
  4. Add the following volume mounts, environment variables, and resources to the container that requests GPUs. Your workload container must request all eight GPUs:

    Autopilot

    For GKE Autopilot mode, configure the following resources:

      containers 
     : 
      
     - 
      
     name 
     : 
      
     my-container 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     library-dir-host 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     readOnly 
     : 
      
     true 
      
     - 
      
     name 
     : 
      
     gib 
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     readOnly 
     : 
      
     true 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
     
    

    Standard

    For GKE Standard mode, configure the following resources:

      containers 
     : 
      
     - 
      
     name 
     : 
      
     my-container 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     library-dir-host 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
     
    
  5. Set all the required environment variables to configure NCCL by using the following shell script from the workload container:

      source 
      
    /usr/local/gib/scripts/set_nccl_env.sh 
    

The following tabs include examples of completed Pod manifests.

Autopilot

For GKE Autopilot mode, a completed Pod manifest should look similar to the following:

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Pod 
 metadata 
 : 
  
 name 
 : 
  
 my-pod 
  
 labels 
 : 
  
 k8s-app 
 : 
  
 my-pod 
  
 annotations 
 : 
  
 networking.gke.io/default-interface 
 : 
  
 'eth0' 
  
 networking.gke.io/interfaces 
 : 
  
 | 
  
 [ 
  
 {"interfaceName":"eth0","network":"default"}, 
  
 {"interfaceName":"eth1","network":"gvnic-1"}, 
  
 {"interfaceName":"eth2","network":"rdma-0"}, 
  
 {"interfaceName":"eth3","network":"rdma-1"}, 
  
 {"interfaceName":"eth4","network":"rdma-2"}, 
  
 {"interfaceName":"eth5","network":"rdma-3"}, 
  
 {"interfaceName":"eth6","network":"rdma-4"}, 
  
 {"interfaceName":"eth7","network":"rdma-5"}, 
  
 {"interfaceName":"eth8","network":"rdma-6"}, 
  
 {"interfaceName":"eth9","network":"rdma-7"} 
  
 ] 
 spec 
 : 
  
 ... 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/nvidia 
  
 - 
  
 name 
 : 
  
 gib 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/gib 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 my-container 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 mountPath 
 : 
  
 /usr/local/nvidia 
  
 readOnly 
 : 
  
 true 
  
 - 
  
 name 
 : 
  
 gib 
  
 mountPath 
 : 
  
 /usr/local/gib 
  
 readOnly 
 : 
  
 true 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 LD_LIBRARY_PATH 
  
 value 
 : 
  
 /usr/local/nvidia/lib64 
  
 resources 
 : 
  
 limits 
 : 
  
 nvidia.com/gpu 
 : 
  
 8 
  
 ... 
 

Standard

For GKE Standard mode, a completed Pod manifest should look similar to the following:

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Pod 
 metadata 
 : 
  
 name 
 : 
  
 my-pod 
  
 labels 
 : 
  
 k8s-app 
 : 
  
 my-pod 
  
 annotations 
 : 
  
 networking.gke.io/default-interface 
 : 
  
 'eth0' 
  
 networking.gke.io/interfaces 
 : 
  
 | 
  
 [ 
  
 {"interfaceName":"eth0","network":"default"}, 
  
 {"interfaceName":"eth2","network":"rdma-0"}, 
  
 {"interfaceName":"eth3","network":"rdma-1"}, 
  
 {"interfaceName":"eth4","network":"rdma-2"}, 
  
 {"interfaceName":"eth5","network":"rdma-3"}, 
  
 {"interfaceName":"eth6","network":"rdma-4"}, 
  
 {"interfaceName":"eth7","network":"rdma-5"}, 
  
 {"interfaceName":"eth8","network":"rdma-6"}, 
  
 {"interfaceName":"eth9","network":"rdma-7"} 
  
 ] 
 spec 
 : 
  
 ... 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/nvidia 
  
 - 
  
 name 
 : 
  
 gib 
  
 hostPath 
 : 
  
 path 
 : 
  
 /home/kubernetes/bin/gib 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 my-container 
  
 volumeMounts 
 : 
  
 - 
  
 name 
 : 
  
 library-dir-host 
  
 mountPath 
 : 
  
 /usr/local/nvidia 
  
 - 
  
 name 
 : 
  
 gib 
  
 mountPath 
 : 
  
 /usr/local/gib 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 LD_LIBRARY_PATH 
  
 value 
 : 
  
 /usr/local/nvidia/lib64 
  
 resources 
 : 
  
 limits 
 : 
  
 nvidia.com/gpu 
 : 
  
 8 
  
 ... 
 

Deploy and run a NCCL test for clusters with GPUDirect RDMA

To validate the functionality of the provisioned cluster which uses GPUDirect RDMA, you can run a NCCL test . You can run a basic test on two nodes , which you must use for nodes that are provisioned with flex-start ( Preview ). Or, if you have a larger number of nodes that are not provisioned with flex-start, you can use a NCCL test with Topology Aware Scheduling .

Test on two nodes

Run the two node test:

A4

  1. To deploy a NCCL test workload of two test Pods that are running on two A4 nodes, apply one of the following manifests:

    • For an Autopilot cluster:

       kubectl  
      apply  
      -f  
      https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4-autopilot.yaml 
      
    • For a Standard cluster:

       kubectl  
      apply  
      -f  
      https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4.yaml 
      
  2. Check if the Pods are scheduled to and running on some nodes:

     kubectl  
    get  
    pods  
    nccl-test-host-1  
    nccl-test-host-2 
    

    If the two Pods have the Running status, you can proceed to the next step. For nodes that are provisioned by flex-start, it might take a few minutes before the nodes are created and the Pods are scheduled on those nodes.

  3. Trigger a NCCL all-gather test for the nodes:

     kubectl  
     exec 
      
    nccl-test-host-1  
    -it  
    --  
    /usr/local/gib/scripts/run_nccl_tests.sh  
    -t  
    all_gather  
    -b  
    1K  
    -e  
    8G  
    nccl-host-1  
    nccl-host-2 
    

    The output should be similar to the following:

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            16     float    none      -1    48.17    0.02    0.02      0    47.21    0.02    0.02      0
            2048            32     float    none      -1    47.23    0.04    0.04      0    47.17    0.04    0.04      0
            4096            64     float    none      -1    47.43    0.09    0.08      0    47.48    0.09    0.08      0
            8192           128     float    none      -1    47.93    0.17    0.16      0    47.98    0.17    0.16      0
           16384           256     float    none      -1    48.90    0.34    0.31      0    48.75    0.34    0.32      0
           32768           512     float    none      -1    50.10    0.65    0.61      0    49.59    0.66    0.62      0
           65536          1024     float    none      -1    51.70    1.27    1.19      0    51.66    1.27    1.19      0
          131072          2048     float    none      -1    52.23    2.51    2.35      0    55.60    2.36    2.21      0
          262144          4096     float    none      -1    53.89    4.86    4.56      0    53.39    4.91    4.60      0
          524288          8192     float    none      -1    56.80    9.23    8.65      0    57.66    9.09    8.52      0
         1048576         16384     float    none      -1    87.85   11.94   11.19      0    88.47   11.85   11.11      0
         2097152         32768     float    none      -1    92.52   22.67   21.25      0    93.22   22.50   21.09      0
         4194304         65536     float    none      -1    97.41   43.06   40.37      0    96.15   43.62   40.90      0
         8388608        131072     float    none      -1    110.0   76.27   71.51      0    110.9   75.66   70.93      0
        16777216        262144     float    none      -1    141.3  118.77  111.35      0    140.7  119.27  111.81      0
        33554432        524288     float    none      -1    203.2  165.14  154.82      0    202.3  165.90  155.53      0
        67108864       1048576     float    none      -1    303.3  221.25  207.42      0    301.9  222.27  208.38      0
       134217728       2097152     float    none      -1    513.2  261.56  245.21      0    509.3  263.56  247.08      0
       268435456       4194304     float    none      -1    842.4  318.64  298.72      0    832.3  322.54  302.38      0
       536870912       8388608     float    none      -1   1511.8  355.12  332.92      0   1502.5  357.31  334.98      0
      1073741824      16777216     float    none      -1   2976.7  360.72  338.17      0   2923.2  367.32  344.36      0
      2147483648      33554432     float    none      -1   5888.9  364.66  341.87      0   5766.2  372.43  349.15      0
      4294967296      67108864     float    none      -1    11722  366.39  343.49      0    11457  374.88  351.45      0
      8589934592     134217728     float    none      -1    23379  367.43  344.46      0    22818  376.45  352.92      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.845 
    

A3 Ultra

  1. To deploy a NCCL test workload of two test Pods that are running on two A3 Ultra nodes, apply one of the following manifests:

    • For an Autopilot cluster:

       kubectl  
      apply  
      -f  
      https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-autopilot.yaml 
      
    • For a Standard cluster:

       kubectl  
      apply  
      -f  
      https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test.yaml 
      
  2. Check if the Pods are scheduled to and running on some nodes:

     kubectl  
    get  
    pods  
    nccl-test-host-1  
    nccl-test-host-2 
    

    If the two Pods have the Running status, you can proceed to the next step. For nodes that are provisioned by flex-start, it might take a few minutes before the nodes are created and the Pods are scheduled on those nodes.

  3. Trigger a NCCL all-gather test for the nodes:

     kubectl  
     exec 
      
    nccl-test-host-1  
    -it  
    --  
    /usr/local/gib/scripts/run_nccl_tests.sh  
    -t  
    all_gather  
    -b  
    1K  
    -e  
    8G  
    nccl-host-1  
    nccl-host-2 
    

    The output should be similar to the following:

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            16     float    none      -1    56.00    0.02    0.02      0    55.59    0.02    0.02      0
            2048            32     float    none      -1    55.79    0.04    0.03      0    55.57    0.04    0.03      0
            4096            64     float    none      -1    56.29    0.07    0.07      0    57.35    0.07    0.07      0
            8192           128     float    none      -1    56.44    0.15    0.14      0    56.32    0.15    0.14      0
           16384           256     float    none      -1    57.57    0.28    0.27      0    57.60    0.28    0.27      0
           32768           512     float    none      -1    57.92    0.57    0.53      0    59.35    0.55    0.52      0
           65536          1024     float    none      -1    59.92    1.09    1.03      0    60.15    1.09    1.02      0
          131072          2048     float    none      -1    59.21    2.21    2.08      0    61.82    2.12    1.99      0
          262144          4096     float    none      -1    63.58    4.12    3.87      0    63.34    4.14    3.88      0
          524288          8192     float    none      -1    64.89    8.08    7.57      0    65.09    8.06    7.55      0
         1048576         16384     float    none      -1    80.90   12.96   12.15      0    77.49   13.53   12.69      0
         2097152         32768     float    none      -1    80.22   26.14   24.51      0    79.88   26.25   24.61      0
         4194304         65536     float    none      -1    82.86   50.62   47.45      0    82.47   50.86   47.68      0
         8388608        131072     float    none      -1    95.83   87.53   82.06      0    93.27   89.94   84.32      0
        16777216        262144     float    none      -1    122.8  136.58  128.04      0    121.7  137.86  129.24      0
        33554432        524288     float    none      -1    180.6  185.75  174.14      0    179.2  187.19  175.49      0
        67108864       1048576     float    none      -1    279.7  239.90  224.90      0    277.0  242.26  227.12      0
       134217728       2097152     float    none      -1    507.5  264.46  247.93      0    485.1  276.66  259.37      0
       268435456       4194304     float    none      -1    866.3  309.88  290.51      0    864.0  310.70  291.28      0
       536870912       8388608     float    none      -1   1576.1  340.62  319.33      0   1558.2  344.54  323.01      0
      1073741824      16777216     float    none      -1   3096.6  346.75  325.08      0   3047.5  352.33  330.31      0
      2147483648      33554432     float    none      -1   6148.0  349.30  327.47      0   6034.3  355.88  333.64      0
      4294967296      67108864     float    none      -1    12226  351.29  329.33      0    12000  357.92  335.55      0
      8589934592     134217728     float    none      -1    24391  352.17  330.16      0    23920  359.11  336.67      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.94 
    

Test with Topology Aware Scheduling (TAS)

If you have more than two nodes, we recommend using the following test, which uses TAS. Follow the steps in the next sections to prepare and run the test on your cluster.

Set up your cluster with Jobset and the TAS plugin

  1. Install JobSet .

  2. Install the TAS plugin:

    1. Clone the container-engine-accelerators git repository:

        cd 
        
      ~
      git  
      clone  
      https://github.com/GoogleCloudPlatform/container-engine-accelerators.git 
      
    2. Apply the TAS plugin:

        cd 
        
      container-engine-accelerators/gke-topology-scheduler
      kubectl  
      create  
      configmap  
      topology-scheduler-scripts  
      --namespace  
      kube-system  
      --from-file = 
      schedule-daemon.py = 
      schedule-daemon.py  
      --from-file = 
      label-nodes-daemon.py = 
      label-nodes-daemon.py
      kubectl  
      apply  
      -f  
      service-account.yaml
      kubectl  
      apply  
      -f  
      schedule-daemon.yaml
      kubectl  
      apply  
      -f  
      label-nodes-daemon.yaml 
      

Deploy a NCCL test workload with TAS

A4

  1. Create the following nccl-jobset-test.yaml manifest:

      apiVersion 
     : 
      
     jobset.x-k8s.io/v1alpha2 
     kind 
     : 
      
     JobSet 
     metadata 
     : 
      
     name 
     : 
      
     nccl-allgather 
     spec 
     : 
      
     ttlSecondsAfterFinished 
     : 
      
     1200 
      
     suspend 
     : 
      
     False 
      
     network 
     : 
      
     enableDNSHostnames 
     : 
      
     true 
      
     replicatedJobs 
     : 
      
     - 
      
     name 
     : 
      
     worker 
      
     template 
     : 
      
     spec 
     : 
      
     parallelism 
     : 
      
      NUM_NODES 
     
      
     completions 
     : 
      
      NUM_NODES 
     
      
     template 
     : 
      
     metadata 
     : 
      
     annotations 
     : 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"}, 
      
     {"interfaceName":"eth6","network":"rdma-4"}, 
      
     {"interfaceName":"eth7","network":"rdma-5"}, 
      
     {"interfaceName":"eth8","network":"rdma-6"}, 
      
     {"interfaceName":"eth9","network":"rdma-7"} 
      
     ] 
      
     spec 
     : 
      
     activeDeadlineSeconds 
     : 
      
     3600 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
     nvidia-b200 
      
     tolerations 
     : 
      
     - 
      
     key 
     : 
      
     cloud.google.com/gke-queued 
      
     effect 
     : 
      
     NoSchedule 
      
     value 
     : 
      
     "true" 
      
     - 
      
     key 
     : 
      
     "nvidia.com/gpu" 
      
     operator 
     : 
      
     "Exists" 
      
     effect 
     : 
      
     "NoSchedule" 
      
     setHostnameAsFQDN 
     : 
      
     true 
      
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     gib 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/gib 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/nvidia 
      
     - 
      
     name 
     : 
      
     lib64 
      
     hostPath 
     : 
      
     path 
     : 
      
     /lib64 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     emptyDir 
     : 
      
     medium 
     : 
      
     "Memory" 
      
     sizeLimit 
     : 
      
     250Gi 
      
     schedulingGates 
     : 
      
     - 
      
     name 
     : 
      
     "gke.io/topology-aware-auto-nccl-test" 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     nccl-test 
      
     stdin 
     : 
      
     true 
      
     tty 
     : 
      
     true 
      
     image 
     : 
      
     us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.6 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     MY_NODE_NAME 
      
     valueFrom 
     : 
      
     fieldRef 
     : 
      
     fieldPath 
     : 
      
     spec.nodeName 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT_CONFIRM 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     N_NODES 
      
     value 
     : 
      
     " NUM_NODES 
    " 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     set -x 
      
     echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark" 
      
     # Install ping 
      
     apt update -y 
      
     apt install -y iputils-ping 
      
     # Start sshd 
      
     /scripts/container_entry.sh daemon 
    &  
     # Get helper variables to form all hostnames 
      
     export POSTFIX=$(hostname | cut -d . -f 2-) 
      
     export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) 
      
     export NODE_RANK=$JOB_COMPLETION_INDEX 
      
     # For every worker, wait till online and add to hostfile 
      
     for i in `seq 0 $(($N_NODES-1))`; do 
      
     OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX} 
      
     until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do 
      
     echo Waiting for ${OTHER}... 
      
     sleep 10 
      
     done 
      
     echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; 
      
     done 
      
     cat /tmp/hostfile 
      
     # Launch from head node 
      
     if [[ "${NODE_RANK}" -eq "0" ]]; then 
      
     # World Level = 0x0, Rail Aligned = 0x7 
      
     export NCCL_TESTS_SPLIT_MASK="0x0"; 
      
     # Force use of libnccl-gib 
      
     export NCCL_NET=gIB 
      
     # Set all the correct libnccl-gib environment variables 
      
     source /usr/local/gib/scripts/set_nccl_env.sh 
      
     # Get all relevant NCCL / env vars to pass to all workers 
      
     ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') 
      
     mpirun --hostfile /tmp/hostfile \ 
      
     -x $ENV_VARS  \ 
      
     -mca plm_rsh_no_tree_spawn 1 \ 
      
     --mca mtl ^ofi \ 
      
     --mca orte_keep_fqdn_hostnames 1 \ 
      
     --mca btl self,tcp \ 
      
     --mca btl_tcp_if_include eth0 \ 
      
     --bind-to none \ 
      
     --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ 
      
     /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 
      
     else 
      
     while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do 
      
     sleep 5 
      
     done 
      
     fi 
      
     exit 0 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     mountPath 
     : 
      
     /dev/shm 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
      
     requests 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
      
     restartPolicy 
     : 
      
     Never 
     
    

    Replace NUM_NODES with the number of nodes in the node pool.

    Make sure that you understand the following about this manifest:

    • The JobSet is a headless Service with the same name as the JobSet name, in this case, nccl-allgather .
    • The gke.io/topology-aware-auto-nccl-test scheduling gate is used to verify the Pods are scheduled for colocation.
    • The parallelism and completions fields are both set to the number of nodes that you want to use to run the NCCL test.
  2. Apply the manifest:

     kubectl  
    apply  
    -f  
    nccl-jobset-test.yaml 
    
  3. Confirm that the workload is admitted:

     kubectl  
    get  
    jobsets 
    

    The output is similar to the following:

     NAME            RESTARTS   COMPLETED   AGE
    nccl-allgather                         3s 
    
  4. Confirm that the workload is in the Completed state:

     kubectl  
    get  
    pods 
    

    The output is similar to the following:

     NAME                          READY   STATUS      RESTARTS   AGE
    nccl-allgather-worker-0-0-n9s6j   0/1     Completed   0          9m34s
    nccl-allgather-worker-0-1-rsf7r   0/1     Completed   0          9m34s
    ... 
    
  5. The logs of the Pod with the pattern nccl-allgather-worker-0-0-.* contain the results of the test.

    Fetch the logs for this Pod:

       
    kubectl  
    logs  
     $( 
    kubectl  
    get  
    pods  
    -o  
    go-template = 
     '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' 
      
     | 
      
    grep  
    nccl-allgather-worker-0-0 ) 
     
    

    The output should be similar to the following:

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us) ∂ç (GB/s)  (GB/s)
            1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
            2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
            4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
            8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
           16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
           32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
           65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
          131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
          262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
          524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
         1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
         2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
         4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
         8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
        16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
        33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
        67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
       134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
       268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
       536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
      1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
      2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
      4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
      8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.248 
    

A3 Ultra

  1. Create the following nccl-jobset-test.yaml manifest:

      apiVersion 
     : 
      
     jobset.x-k8s.io/v1alpha2 
     kind 
     : 
      
     JobSet 
     metadata 
     : 
      
     name 
     : 
      
     nccl-allgather 
     spec 
     : 
      
     ttlSecondsAfterFinished 
     : 
      
     1200 
      
     suspend 
     : 
      
     False 
      
     network 
     : 
      
     enableDNSHostnames 
     : 
      
     true 
      
     replicatedJobs 
     : 
      
     - 
      
     name 
     : 
      
     worker 
      
     template 
     : 
      
     spec 
     : 
      
     parallelism 
     : 
      
      NUM_NODES 
     
      
     completions 
     : 
      
      NUM_NODES 
     
      
     template 
     : 
      
     metadata 
     : 
      
     annotations 
     : 
      
     networking.gke.io/default-interface 
     : 
      
     'eth0' 
      
     networking.gke.io/interfaces 
     : 
      
     | 
      
     [ 
      
     {"interfaceName":"eth0","network":"default"}, 
      
     {"interfaceName":"eth2","network":"rdma-0"}, 
      
     {"interfaceName":"eth3","network":"rdma-1"}, 
      
     {"interfaceName":"eth4","network":"rdma-2"}, 
      
     {"interfaceName":"eth5","network":"rdma-3"}, 
      
     {"interfaceName":"eth6","network":"rdma-4"}, 
      
     {"interfaceName":"eth7","network":"rdma-5"}, 
      
     {"interfaceName":"eth8","network":"rdma-6"}, 
      
     {"interfaceName":"eth9","network":"rdma-7"} 
      
     ] 
      
     spec 
     : 
      
     activeDeadlineSeconds 
     : 
      
     3600 
      
     restartPolicy 
     : 
      
     Never 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
     nvidia-h200-141gb 
      
     tolerations 
     : 
      
     - 
      
     key 
     : 
      
     cloud.google.com/gke-queued 
      
     effect 
     : 
      
     NoSchedule 
      
     value 
     : 
      
     "true" 
      
     - 
      
     key 
     : 
      
     "nvidia.com/gpu" 
      
     operator 
     : 
      
     "Exists" 
      
     effect 
     : 
      
     "NoSchedule" 
      
     setHostnameAsFQDN 
     : 
      
     true 
      
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     gib 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/gib 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     hostPath 
     : 
      
     path 
     : 
      
     /home/kubernetes/bin/nvidia 
      
     - 
      
     name 
     : 
      
     lib64 
      
     hostPath 
     : 
      
     path 
     : 
      
     /lib64 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     emptyDir 
     : 
      
     medium 
     : 
      
     "Memory" 
      
     sizeLimit 
     : 
      
     250Gi 
      
     schedulingGates 
     : 
      
     - 
      
     name 
     : 
      
     "gke.io/topology-aware-auto-nccl-test" 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     nccl-test 
      
     stdin 
     : 
      
     true 
      
     tty 
     : 
      
     true 
      
     image 
     : 
      
     us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.6 
      
     securityContext 
     : 
      
     privileged 
     : 
      
     true 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     MY_NODE_NAME 
      
     valueFrom 
     : 
      
     fieldRef 
     : 
      
     fieldPath 
     : 
      
     spec.nodeName 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     OMPI_ALLOW_RUN_AS_ROOT_CONFIRM 
      
     value 
     : 
      
     "1" 
      
     - 
      
     name 
     : 
      
     N_NODES 
      
     value 
     : 
      
     " NUM_NODES 
    " 
      
     - 
      
     name 
     : 
      
     LD_LIBRARY_PATH 
      
     value 
     : 
      
     /usr/local/nvidia/lib64 
      
     command 
     : 
      
     - 
      
     bash 
      
     - 
      
     -c 
      
     - 
      
     | 
      
     set -x 
      
     echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark" 
      
     # Install ping 
      
     apt update -y 
      
     apt install -y iputils-ping 
      
     # Start sshd 
      
     /scripts/container_entry.sh daemon 
    &  
     # Get helper variables to form all hostnames 
      
     export POSTFIX=$(hostname | cut -d . -f 2-) 
      
     export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) 
      
     export NODE_RANK=$JOB_COMPLETION_INDEX 
      
     # For every worker, wait till online and add to hostfile 
      
     for i in `seq 0 $(($N_NODES-1))`; do 
      
     OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX} 
      
     until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do 
      
     echo Waiting for ${OTHER}... 
      
     sleep 10 
      
     done 
      
     echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; 
      
     done 
      
     cat /tmp/hostfile 
      
     # Launch from head node 
      
     if [[ "${NODE_RANK}" -eq "0" ]]; then 
      
     # World Level = 0x0, Rail Aligned = 0x7 
      
     export NCCL_TESTS_SPLIT_MASK="0x0"; 
      
     # Force use of libnccl-gib 
      
     export NCCL_NET=gIB 
      
     # Set all the correct libnccl-gib environment variables 
      
     source /usr/local/gib/scripts/set_nccl_env.sh 
      
     # Get all relevant NCCL / env vars to pass to all workers 
      
     ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') 
      
     mpirun --hostfile /tmp/hostfile \ 
      
     -x $ENV_VARS  \ 
      
     -mca plm_rsh_no_tree_spawn 1 \ 
      
     --mca orte_keep_fqdn_hostnames 1 \ 
      
     --mca btl self,tcp \ 
      
     --mca btl_tcp_if_include eth0 \ 
      
     --bind-to none \ 
      
     --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ 
      
     /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 
      
     else 
      
     while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do 
      
     sleep 5 
      
     done 
      
     fi 
      
     exit 0 
      
     volumeMounts 
     : 
      
     - 
      
     name 
     : 
      
     nvidia 
      
     mountPath 
     : 
      
     /usr/local/nvidia 
      
     - 
      
     name 
     : 
      
     gib 
      
     mountPath 
     : 
      
     /usr/local/gib 
      
     - 
      
     name 
     : 
      
     shared-memory 
      
     mountPath 
     : 
      
     /dev/shm 
      
     resources 
     : 
      
     limits 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
      
     requests 
     : 
      
     nvidia.com/gpu 
     : 
      
     8 
      
     restartPolicy 
     : 
      
     Never 
     
    

    Replace NUM_NODES with the number of nodes in the node pool.

    Make sure that you understand the following about this manifest:

    • The JobSet is a Headless Service with the same name as the JobSet name, in this case, nccl-allgather .
    • The gke.io/topology-aware-auto-nccl-test scheduling gate is used to verify the Pods are scheduled for colocation.
    • The parallelism and completions fields are both set to the number of nodes that you want to use to run the NCCL test.
  2. Apply the manifest:

     kubectl  
    apply  
    -f  
    nccl-jobset-test.yaml 
    
  3. Confirm that the workload is admitted:

     kubectl  
    get  
    jobsets 
    

    The output is similar to the following:

     NAME            RESTARTS   COMPLETED   AGE
    nccl-allgather                         3s 
    
  4. Confirm that the workload is in the Completed state:

     kubectl  
    get  
    pods 
    

    The output is similar to the following:

     NAME                          READY   STATUS      RESTARTS   AGE
    nccl-allgather-worker-0-0-n9s6j   0/1     Completed   0          9m34s
    nccl-allgather-worker-0-1-rsf7r   0/1     Completed   0          9m34s
    ... 
    
  5. The logs of the Pod with the pattern nccl-allgather-worker-0-0-.* contain the results of the test.

    Fetch the logs for this Pod:

       
    kubectl  
    logs  
     $( 
    kubectl  
    get  
    pods  
    -o  
    go-template = 
     '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' 
      
     | 
      
    grep  
    nccl-allgather-worker-0-0 ) 
     
    

    The output should be similar to the following:

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
      #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us) ∂ç (GB/s)  (GB/s)
              1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
              2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
              4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
              8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
             16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
             32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
             65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
            131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
            262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
            524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
           1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
           2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
           4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
           8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
          16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
          33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
          67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
         134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
         268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
         536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
        1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
        2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
        4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
        8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
      # Out of bounds values : 0 OK
      # Avg bus bandwidth    : 120.248
    ``` 
    

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: