This page describes how to create a High Performance Computing (HPC) Slurm cluster that uses remote direct memory access (RDMA) with H4D VMs with enhanced cluster management capabilities. You use the gcloud CLI and Cluster Toolkit to configure the cluster.
The H4D machine series is specifically designed to meet the needs of demanding HPC workloads. H4D offers instances with improved workload scalability through Cloud RDMA networking with 200 Gbps throughput. For more information on H4D compute-optimized machine types on Google Cloud, see H4D machine series .
Before you begin
Before creating a Slurm cluster, if you haven't already done so, complete the following steps:
- Choose a consumption option : the option that you pick determines how you want to obtain and use vCPU resources.
- Obtain capacity : obtain capacity for the selected consumption option.
- Ensure that you have enough Filestore quota
: you need a minimum of
10,240 GiB of zonal (also known as high scale SSD) capacity.
- To check quota, see View API-specific quota .
- If you don't have enough quota, request a quota increase .
- Install Cluster Toolkit
: to provision Slurm clusters, you must use Cluster Toolkit
version
v1.62.0or later.To install Cluster Toolkit, see Set up Cluster Toolkit .
To learn more, see Choose a consumption option and obtain capacity .
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Set up a storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
To create this bucket and enable versioning from the CLI, run the following commands:
gcloud storage buckets create gs:// BUCKET_NAME \ --project= PROJECT_ID \ --default-storage-class=STANDARD --location= BUCKET_REGION \ --uniform-bucket-level-access gcloud storage buckets update gs:// BUCKET_NAME --versioning
Replace the following:
-
BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements . -
PROJECT_ID: your project ID. -
BUCKET_REGION: any available location .
Open the Cluster Toolkit directory
Ensure that you are in the Cluster Toolkit directory by running the following command:
cd cluster-toolkit
This cluster deployment requires Cluster Toolkit v1.70.0
or
later. To check your version, you can run the following command:
./gcluster --version
Create a deployment file
Create a deployment file to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as project ID, region, and zone.
To create a deployment file, follow the steps for the H4D machine type:
The parameters that you need to add to your deployment file depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option you want to use.
Reservation-bound
To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml
and add the following content.
terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE h4d_cluster_size: NUMBER_OF_VMS h4d_reservation_name: RESERVATION_NAME
Replace the following:
-
BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section. -
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one. -
PROJECT_ID: your project ID. -
REGION: the region that has the reserved machines. -
ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your account team when the capacity was delivered . -
NUMBER_OF_VMS: the number of VMs that you want for the cluster. -
RESERVATION_NAME: the name of your reservation .
Flex-start
To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml
and add the following content.
terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE h4d_cluster_size: NUMBER_OF_VMS h4d_dws_flex_enabled: true
Replace the following:
-
BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section. -
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one. -
PROJECT_ID: your project ID. -
REGION: the region where you want to provision your cluster. -
ZONE: the zone where you want to provision your cluster. -
NUMBER_OF_VMS: the number of VMs that you want for the cluster.
This deployment provisions static compute nodes
,
which means that the cluster has a set number of nodes at all times. If you want to enable your
cluster to autoscale instead, use examples/h4d/hpc-slurm-h4d.yaml
file
and edit the values of node_count_static
and node_count_dynamic_max
to match the following:
node_count_static: 0
node_count_dynamic_max: $(vars.h4d_cluster_size)
Spot
To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml
and add the following content.
terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE h4d_cluster_size: NUMBER_OF_VMS h4d_enable_spot_vm: true
Replace the following:
-
BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section. -
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one. -
PROJECT_ID: your project ID. -
REGION: the region where you want to provision your cluster. -
ZONE: the zone where you want to provision your cluster. -
NUMBER_OF_VMS: the number of VMs that you want for the cluster.
Provision an H4D Slurm cluster
Cluster Toolkit provisions the cluster based on the deployment file you created in the previous step and the default cluster blueprint. For more information about the software that is installed by the blueprint, see learn more about Slurm custom images .
Using Cloud Shell, from the directory where you installed Cluster Toolkit and created the deployment file, you can provision the cluster with the following command, which uses the H4D Slurm blueprint file . This step takes approximately 20-30 minutes.
./gcluster deploy -d h4d-slurm-deployment.yaml examples/hpc-slurm-h4d/hpc-slurm-h4d.yaml --auto-approve
Connect to the Slurm cluster
To access your cluster, you must sign in to the Slurm login node. To sign in, you can use either Google Cloud console or Google Cloud CLI.
Console
-
Go to the Compute Engine> VM instancespage.
-
Locate the login node. It should have a name with the pattern
DEPLOYMENT_NAME+login-001. -
From the Connectcolumn of the login node, click SSH.
gcloud
To connect to the login node, complete the following steps:
-
Identify the login node by using the
gcloud compute instances listcommand .gcloud compute instances list \ --zones=
ZONE\ --filter="name ~ login" --format "value(name)"If the output lists multiple Slurm clusters, you can identify your login node by the
DEPLOYMENT_NAMEthat you specified. -
Use the
gcloud compute sshcommand to connect to the login node.gcloud compute ssh LOGIN_NODE \ --zone=
ZONE--tunnel-through-iapReplace the following:
-
ZONE: the zone where the VMs for your cluster are located. -
LOGIN_NODE: the name of the login node, which you identified in the previous step.
-
Redeploy the Slurm cluster
If you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying.
To redeploy the cluster using an existing image do the following:
-
Run the following command:
./gcluster deploy -d h4d-slurm-deployment.yaml examples/h4d/h4d-slurm-deployment.yaml --only cluster-env,cluster --auto-approve -w
This command is only for redeployments where an image already exists; it only redeploys the cluster and its infrastructure.
Destroy the Slurm cluster
To remove the Slurm cluster and the instances within it, use complete the following steps:
-
Disconnect from the cluster if you haven't already.
-
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
-
To destroy the cluster, run:
./gcluster destroy DEPLOYMENT_FOLDER --auto-approve
Replace the following:
-
DEPLOYMENT_FOLDER: the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME .
-
When the cluster removal is complete you should see a message similar to the following:
Destroy complete! Resources: xx destroyed.
To learn how to cleanly destroy infrastructure and for advanced manual
deployment instructions, see the deployment folder located at the root of
the Cluster Toolkit directory: DEPLOYMENT_FOLDER
/instructions.txt

