Create an H4D Slurm cluster with enhanced management capabilities

This page describes how to create a High Performance Computing (HPC) Slurm cluster that uses remote direct memory access (RDMA) with H4D VMs with enhanced cluster management capabilities. You use the gcloud CLI and Cluster Toolkit to configure the cluster.

The H4D machine series is specifically designed to meet the needs of demanding HPC workloads. H4D offers instances with improved workload scalability through Cloud RDMA networking with 200 Gbps throughput. For more information on H4D compute-optimized machine types on Google Cloud, see H4D machine series .

Before you begin

Before creating a Slurm cluster, if you haven't already done so, complete the following steps:

Choose a consumption option : the option that you pick determines how you want to obtain and use vCPU resources.
Obtain capacity : obtain capacity for the selected consumption option.

To learn more, see Choose a consumption option and obtain capacity .

Ensure that you have enough Filestore quota : you need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.
- To check quota, see View API-specific quota .
- If you don't have enough quota, request a quota increase .
Install Cluster Toolkit : to provision Slurm clusters, you must use Cluster Toolkit version v1.62.0 or later.
To install Cluster Toolkit, see Set up Cluster Toolkit .

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Set up a storage bucket

Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

To create this bucket and enable versioning from the CLI, run the following commands:

gcloud storage buckets create gs:// BUCKET_NAME 
\
    --project= PROJECT_ID 
\
    --default-storage-class=STANDARD --location= BUCKET_REGION 
\
    --uniform-bucket-level-access
gcloud storage buckets update gs:// BUCKET_NAME 
--versioning

Replace the following:

BUCKET_NAME : a name for your Cloud Storage bucket that meets the bucket naming requirements .
PROJECT_ID : your project ID.
BUCKET_REGION : any available location .

Open the Cluster Toolkit directory

Ensure that you are in the Cluster Toolkit directory by running the following command:

cd cluster-toolkit

This cluster deployment requires Cluster Toolkit v1.70.0 or later. To check your version, you can run the following command:

./gcluster --version

Create a deployment file

Create a deployment file to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as project ID, region, and zone.

To create a deployment file, follow the steps for the H4D machine type:

The parameters that you need to add to your deployment file depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option you want to use.

Reservation-bound

To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml and add the following content.

terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME 
vars:
  deployment_name: DEPLOYMENT_NAME 
project_id: PROJECT_ID 
region: REGION 
zone: ZONE 
h4d_cluster_size: NUMBER_OF_VMS 
h4d_reservation_name: RESERVATION_NAME

Replace the following:

BUCKET_NAME : the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME : a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID : your project ID.
REGION : the region that has the reserved machines.
ZONE : the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your account team when the capacity was delivered .
NUMBER_OF_VMS : the number of VMs that you want for the cluster.
RESERVATION_NAME : the name of your reservation .

Flex-start

To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml and add the following content.

terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME 
vars:
  deployment_name: DEPLOYMENT_NAME 
project_id: PROJECT_ID 
region: REGION 
zone: ZONE 
h4d_cluster_size: NUMBER_OF_VMS 
h4d_dws_flex_enabled: true

Replace the following:

BUCKET_NAME : the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME : a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID : your project ID.
REGION : the region where you want to provision your cluster.
ZONE : the zone where you want to provision your cluster.
NUMBER_OF_VMS : the number of VMs that you want for the cluster.

This deployment provisions static compute nodes , which means that the cluster has a set number of nodes at all times. If you want to enable your cluster to autoscale instead, use examples/h4d/hpc-slurm-h4d.yaml file and edit the values of node_count_static and node_count_dynamic_max to match the following:

node_count_static: 0
      node_count_dynamic_max: $(vars.h4d_cluster_size)

Spot

To create your deployment file, use a text editor to create a YAML file named h4d-slurm-deployment.yaml and add the following content.

terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME 
vars:
  deployment_name: DEPLOYMENT_NAME 
project_id: PROJECT_ID 
region: REGION 
zone: ZONE 
h4d_cluster_size: NUMBER_OF_VMS 
h4d_enable_spot_vm: true

Replace the following:

BUCKET_NAME : the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME : a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID : your project ID.
REGION : the region where you want to provision your cluster.
ZONE : the zone where you want to provision your cluster.
NUMBER_OF_VMS : the number of VMs that you want for the cluster.

Provision an H4D Slurm cluster

Cluster Toolkit provisions the cluster based on the deployment file you created in the previous step and the default cluster blueprint. For more information about the software that is installed by the blueprint, see learn more about Slurm custom images .

Using Cloud Shell, from the directory where you installed Cluster Toolkit and created the deployment file, you can provision the cluster with the following command, which uses the H4D Slurm blueprint file . This step takes approximately 20-30 minutes.

./gcluster deploy -d h4d-slurm-deployment.yaml examples/hpc-slurm-h4d/hpc-slurm-h4d.yaml --auto-approve

Connect to the Slurm cluster

To access your cluster, you must sign in to the Slurm login node. To sign in, you can use either Google Cloud console or Google Cloud CLI.

Console

Go to the Compute Engine> VM instancespage.

Go to the VM instances page
Locate the login node. It should have a name with the pattern DEPLOYMENT_NAME + login-001 .
From the Connectcolumn of the login node, click SSH.

gcloud

To connect to the login node, complete the following steps:

Identify the login node by using the gcloud compute instances list command .
```
gcloud compute instances list \
  --zones=  ZONE 
 
\
  --filter="name ~ login" --format "value(name)"
```
If the output lists multiple Slurm clusters, you can identify your login node by the DEPLOYMENT_NAME that you specified.
Use the gcloud compute ssh command to connect to the login node.
```
gcloud compute ssh LOGIN_NODE 
\
  --zone=  ZONE 
 
--tunnel-through-iap
```
Replace the following:
- ZONE : the zone where the VMs for your cluster are located.
- LOGIN_NODE : the name of the login node, which you identified in the previous step.

Redeploy the Slurm cluster

If you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying.

To redeploy the cluster using an existing image do the following:

Connect to the cluster
Run the following command:
```
./gcluster deploy -d h4d-slurm-deployment.yaml examples/h4d/h4d-slurm-deployment.yaml --only cluster-env,cluster --auto-approve -w
```
This command is only for redeployments where an image already exists; it only redeploys the cluster and its infrastructure.

Destroy the Slurm cluster

To remove the Slurm cluster and the instances within it, use complete the following steps:

Disconnect from the cluster if you haven't already.
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
```
./gcluster destroy DEPLOYMENT_FOLDER 
--auto-approve
```
Replace the following:
- DEPLOYMENT_FOLDER : the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME .

When the cluster removal is complete you should see a message similar to the following:

Destroy complete! Resources: xx destroyed.

To learn how to cleanly destroy infrastructure and for advanced manual deployment instructions, see the deployment folder located at the root of the Cluster Toolkit directory: DEPLOYMENT_FOLDER /instructions.txt

Create an H4D Slurm cluster with enhanced management capabilities

Before you begin

Set up a storage bucket

Open the Cluster Toolkit directory

Create a deployment file

Reservation-bound

Flex-start

Spot

Provision an H4D Slurm cluster

Connect to the Slurm cluster

Console

gcloud

Redeploy the Slurm cluster

Destroy the Slurm cluster

What's next