Create a fully managed Slurm cluster with two A4 VMs
This quickstart explains how to create and connect to a Slurm cluster by using Cluster Director. The cluster that you create uses two A4 virtual machine (VM) instances , which are engineered to help your Slurm cluster efficiently handle large-scale model training and inference workloads.
Cluster Director is a managed service that simplifies and automates cluster deployment, reducing operational overhead and letting you focus on running your workload. If you want more control over the deployment and management of your cluster, then create a Slurm cluster by using Cluster Toolkit .
To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me :
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project
: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles .
-
Verify that billing is enabled for your Google Cloud project .
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project
: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles .
-
Verify that billing is enabled for your Google Cloud project .
-
Enable the Hypercompute Cluster API, Compute Engine API, Filestore API, Google Cloud Managed Lustre API, Cloud Logging API, and Cloud Monitoring API:
Enable the APIs - Verify that your project and the Compute Engine default service
account have the following Identity and Access Management (IAM) roles:
-
To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on your project:
- To create and manage a cluster: Cluster Director Editor
(
roles/hypercomputecluster.editor) - To create and manage VMs in a cluster: Compute Instance Admin (v1)
(
roles/compute.instanceAdmin.v1) - To connect to the login node in a cluster:
- Compute OS Login
(
roles/compute.osLogin) - IAP-Secured Tunnel User
(
roles/iap.tunnelResourceAccessor)
- Compute OS Login
(
For more information about granting roles, see Manage access to projects, folders, and organizations .
You might also be able to get the required permissions through custom roles or other predefined roles .
- To create and manage a cluster: Cluster Director Editor
(
-
To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on the Compute Engine default service account:
- To create a cluster: Service Account User
(
roles/iam.serviceAccountUser) - To manage resources in a cluster:
- Logs Writer
(
roles/logging.logWriter) - Monitoring Metric Writer
(
roles/monitoring.metricWriter) - Storage Object Viewer
(
roles/storage.objectViewer)
- Logs Writer
(
- To create a cluster: Service Account User
(
-
- If the organization in which your project exists has a trusted image policy
(
constraints/compute.trustedImageProjects), then verify that theclusterdirector-public-imagesproject is included in the list of allowed projects. To view the trusted image policies for your organization, see Set image access constraints .
Costs
This quickstart uses the following billable Google Cloud resources:
-
Compute Engine:
-
Two VMs with A4 machine types
-
One Persistent Disk volume for the Slurm login node with 100 GB
-
One Google Cloud Hyperdisk Balanced volume with 100 GB for the A4 VMs
-
-
Filestore: a Filestore instance with 10 TiB (10,240 GiB)
To generate a cost estimate based on your projected usage, use the pricing calculator .
Create a Slurm cluster
To create a Slurm cluster, complete the following steps:
-
In the Google Cloud console, go to the Cluster Directorpage.
-
Click Create a cluster.
-
In the dialog that appears, click Step-by-step configuration. The Create clusterpage appears.
-
In the Cluster namefield, enter
cluster001. -
In the Computesection, click Configure resources. In the Add resource configurationpane that appears, complete the following steps:
-
In the GPU typelist, select NVIDIA B200 180GB.
-
In the Number of instancesfield, enter
2. -
In the Consumption optionssection, select the consumption option that you want to use to obtain resources.
-
In the Locationsection, specify the Regionand Zonewhere you want to create your A4 VMs, or where the reservation that you want to use to create your VMs exists.
-
Click Done.
-
-
In the navigation menu, click Storage.
-
In the Storagesection, click Edit storage configuration. In the Add storage configurationpane that appears, complete the following steps:
-
In the Capacitysection, select 10-100 TiB, with increments of 2.5 TiB.
-
Click Done.
-
-
Click Create. The Clusterspage appears.
Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available.
View the cluster creation request
To review the cluster creation request, complete the following steps:
-
In the Clusterstable, in the Namecolumn, click cluster001. A page that gives the details of the cluster appears, and the Detailstab is selected.
-
In the Computesection, locate the Statusrow. When AI Hypercomputer sets its value to Ready, you can proceed to the next section.
Connect to your cluster through SSH
To connect to your cluster through SSH, complete the following steps:
-
Click the Nodestab.
-
In the Login nodestable, find the row that contains the cluster001-login-001node. In that row, in the Connectcolumn, click the SSHbutton. The SSH-in-browserwindow appears.
-
If prompted, then click Authorize. Connecting to your cluster can take some time to complete. When the terminal is ready, proceed to the next section.
Run sample jobs
In the SSH-in-browserwindow, complete the following steps:
-
To verify that Slurm is running, run the following command:
sinfo -
To submit a test job that returns the hostname of the node, run the following command:
srun hostname -
To submit a batch job that sleeps for 30 seconds, run the following command:
sbatch --wrap = "sleep 30" -
To check the status of jobs in the queue, run the following command:
squeue -
To view accounting data for jobs, run the following command:
sacct
You've successfully created a Slurm cluster, connected to it, and run sample jobs. If AI Hypercomputer still hasn't created the A4 VMs, then you can wait for the cluster to create the VMs, modify the cluster to add or remove VMs, or delete the cluster to avoid incurring any unnecessary charges.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Delete your project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete .
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete your cluster
To delete the cluster, and its associated resources, that you created as part of this quickstart, complete the following steps:
-
On the page that contains the details of your cluster, click Delete.
-
In the dialog that appears, enter
cluster001, and then click Deleteto confirm.

