Create a fully managed Slurm cluster with two A4 VMs

This quickstart explains how to create and connect to a Slurm cluster by using Cluster Director. The cluster that you create uses two A4 virtual machine (VM) instances , which are engineered to help your Slurm cluster efficiently handle large-scale model training and inference workloads.

Cluster Director is a managed service that simplifies and automates cluster deployment, reducing operational overhead and letting you focus on running your workload. If you want more control over the deployment and management of your cluster, then create a Slurm cluster by using Cluster Toolkit .


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me :

Guide me


Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project .

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project .

  6. Enable the Hypercompute Cluster API, Compute Engine API, Filestore API, Google Cloud Managed Lustre API, Cloud Logging API, and Cloud Monitoring API:

    Enable the APIs
  7. Verify that your project and the Compute Engine default service account have the following Identity and Access Management (IAM) roles:
  8. If the organization in which your project exists has a trusted image policy ( constraints/compute.trustedImageProjects ), then verify that the clusterdirector-public-images project is included in the list of allowed projects. To view the trusted image policies for your organization, see Set image access constraints .

Costs

This quickstart uses the following billable Google Cloud resources:

  • Compute Engine:

    • Two VMs with A4 machine types

    • One Persistent Disk volume for the Slurm login node with 100 GB

    • One Google Cloud Hyperdisk Balanced volume with 100 GB for the A4 VMs

  • Filestore: a Filestore instance with 10 TiB (10,240 GiB)

To generate a cost estimate based on your projected usage, use the pricing calculator .

Create a Slurm cluster

To create a Slurm cluster, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Directorpage.

    Go to Cluster Director

  2. Click Create a cluster.

  3. In the dialog that appears, click Step-by-step configuration. The Create clusterpage appears.

  4. In the Cluster namefield, enter cluster001 .

  5. In the Computesection, click Configure resources. In the Add resource configurationpane that appears, complete the following steps:

    1. In the GPU typelist, select NVIDIA B200 180GB.

    2. In the Number of instancesfield, enter 2 .

    3. In the Consumption optionssection, select the consumption option that you want to use to obtain resources.

    4. In the Locationsection, specify the Regionand Zonewhere you want to create your A4 VMs, or where the reservation that you want to use to create your VMs exists.

    5. Click Done.

  6. In the navigation menu, click Storage.

  7. In the Storagesection, click Edit storage configuration. In the Add storage configurationpane that appears, complete the following steps:

    1. In the Capacitysection, select 10-100 TiB, with increments of 2.5 TiB.

    2. Click Done.

  8. Click Create. The Clusterspage appears.

    Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available.

View the cluster creation request

To review the cluster creation request, complete the following steps:

  1. In the Clusterstable, in the Namecolumn, click cluster001. A page that gives the details of the cluster appears, and the Detailstab is selected.

  2. In the Computesection, locate the Statusrow. When AI Hypercomputer sets its value to Ready, you can proceed to the next section.

Connect to your cluster through SSH

To connect to your cluster through SSH, complete the following steps:

  1. Click the Nodestab.

  2. In the Login nodestable, find the row that contains the cluster001-login-001node. In that row, in the Connectcolumn, click the SSHbutton. The SSH-in-browserwindow appears.

  3. If prompted, then click Authorize. Connecting to your cluster can take some time to complete. When the terminal is ready, proceed to the next section.

Run sample jobs

In the SSH-in-browserwindow, complete the following steps:

  1. To verify that Slurm is running, run the following command:

     sinfo 
    
  2. To submit a test job that returns the hostname of the node, run the following command:

     srun  
    hostname 
    
  3. To submit a batch job that sleeps for 30 seconds, run the following command:

     sbatch  
    --wrap = 
     "sleep 30" 
     
    
  4. To check the status of jobs in the queue, run the following command:

     squeue 
    
  5. To view accounting data for jobs, run the following command:

     sacct 
    

You've successfully created a Slurm cluster, connected to it, and run sample jobs. If AI Hypercomputer still hasn't created the A4 VMs, then you can wait for the cluster to create the VMs, modify the cluster to add or remove VMs, or delete the cluster to avoid incurring any unnecessary charges.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete your project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete your cluster

To delete the cluster, and its associated resources, that you created as part of this quickstart, complete the following steps:

  1. On the page that contains the details of your cluster, click Delete.

  2. In the dialog that appears, enter cluster001 , and then click Deleteto confirm.

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: