Quickstart: Deploy a Slurm cluster on GKE

Standard

This document explains how to quickly deploy and configure a basic Slurm cluster on Google Kubernetes Engine (GKE) by using the open-source Slurm Helm chart and the Slurm Operator add-on for GKE. This setup includes a Slurm controller ( slurmctld ), REST API ( slurmrestd ), a login node for user access, and a single worker node ( slurmd ) managed by the Slurm Operator add-on for GKE.

This document is for Data administrators, Operators, and Developers who want to enable and configure the Slurm cluster on GKE.

Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE .

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property . If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location . You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure you have already generated an SSH key pair. This key pair is required only if you want to set up OS Login.
Ensure you have a running GKE cluster with Slurm Operator enabled. If not, create one:
```
 gcloud  
container  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--cluster-version = 
 VERSION 
  
 \ 
  
--location = 
 LOCATION 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--addons = 
SlurmOperator 
```
Replace the following:
- CLUSTER_NAME : the name of the new cluster.
- VERSION : the GKE version, which must be 1.35.2-gke.1842000 or later. You can also use the --release-channel option to select a release channel. The release channel must have a default version of 1.35.2-gke.1842000 or later.
- LOCATION : the location of the cluster.
- PROJECT_ID : the ID of the project.
For more information, see enable Slurm Operator add-on for GKE .

OS Login simplifies SSH access management by linking your Linux user account to your IAM identity. This configuration lets you manage access to Slurm nodes by using IAM permissions.

Grant necessary IAM roles. Ensure your user account has the necessary IAM roles in the project:
- roles/compute.osLogin : lets you manage your own OS Login profile.
- roles/compute.instanceAdmin.v1 : provides permissions to manage compute instances.
- roles/iam.serviceAccountUser : lets users act as a service account, which is often needed for node operations.
For more information about the required roles, see the guide to set up OS Login .

Add your SSH key to OS Login by uploading your public SSH key:

 gcloud  
compute  
os-login  
ssh-keys  
add  
--key-file = 
 PATH_TO_PUBLIC_KEY 
  
--project = 
 PROJECT_ID

Alternatively, you can add a key that's loaded in your ssh-agent :

 gcloud  
compute  
os-login  
ssh-keys  
add  
--key = 
 " 
 $( 
ssh-add  
-L  
 | 
  
grep  
publickey  
 | 
  
head  
-n  
 1 
 ) 
 " 
  
--project = 
 PROJECT_ID

Enable OS Login in your project metadata:

 gcloud  
compute  
project-info  
add-metadata  
--metadata  
enable-oslogin = 
TRUE  
--project = 
 PROJECT_ID

Best practice :

For managing OS Login across multiple projects in an organization, consider enforcing OS Login by using an Organization Policy Service constraint ( compute.requireOsLogin ). This is a recommended security best practice. For more information, see Enable and configure OS Login in GKE .

(Optional) Add a compute node pool

If you want to run Slurm compute workloads on separate nodes, you can create a dedicated node pool for them.

 gcloud  
container  
node-pools  
create  
 NODE_POOL_NAME 
  
 \ 
  
--cluster = 
 CLUSTER_NAME 
  
 \ 
  
--machine-type = 
 MACHINE_TYPE 
  
 \ 
  
--num-nodes = 
 NUM_NODES 
  
 \ 
  
--node-taints = 
slurm-worker = 
true:NoSchedule

Replace the following:

NODE_POOL_NAME : the name of the new node pool.
CLUSTER_NAME : the name of your cluster.
MACHINE_TYPE : the machine type for the nodes (for example: n2-standard-4 ).
NUM_NODES : the number of nodes in the node pool.

Deploy Slurm using Helm

This section guides you through deploying the Slurm cluster components by using the Slurm Helm chart. The Helm chart deploys slurmctld , slurmrestd , and slurmd components within the GKE cluster.

Configure kubectl to communicate with your cluster:
```
 gcloud  
container  
clusters  
get-credentials  
 CLUSTER_NAME 
 
```
Replace CLUSTER_NAME with your cluster name.

Verify that you are running Helm 3.8.0 or later.

 helm  
version

The output is similar to the following:

 version.BuildInfo{Version:"v3.17.3", GitCommit:"e4da49785aa6e6ee2b86efd5dd9e43400318262b", GitTreeState:"clean", GoVersion:"go1.23.7"}

If needed, you can install Helm by following the official Helm documentation .

Find an available image tag:
1. In the Google Cloud console, go to the Artifact Registry repositorypage that includes the slinky/slurmd package.
  
  Go to Artifact Registry repository
2. Annotate one of the image tag value, for example 25.11-ubuntu24.04-gke.4 . You use this tag in the IMAGE_TAG placeholder in the following configuration file.

Save the following configuration to a new file named values.yaml :

  controller 
 : 
  
 slurmctld 
 : 
  
 image 
 : 
  
 repository 
 : 
  
 gcr.io/gke-release/slinky/slurmctld 
  
 tag 
 : 
  
  IMAGE_TAG 
 
  
 reconfigure 
 : 
  
 image 
 : 
  
 repository 
 : 
  
 gcr.io/gke-release/slinky/slurmctld 
  
 tag 
 : 
  
  IMAGE_TAG 
 
 restapi 
 : 
  
 replicas 
 : 
  
 1 
  
 slurmrestd 
 : 
  
 image 
 : 
  
 repository 
 : 
  
 gcr.io/gke-release/slinky/slurmrestd 
  
 tag 
 : 
  
  IMAGE_TAG 
 
 nodesets 
 : 
  
 slinky 
 : 
  
 replicas 
 : 
  
 1 
  
 slurmd 
 : 
  
 image 
 : 
  
 repository 
 : 
  
 gcr.io/gke-release/slinky/slurmd 
  
 tag 
 : 
  
  IMAGE_TAG 
 
  
 # The podSpec block is optional and only required when using 
  
 # a dedicated node pool for compute nodes. 
  
 podSpec 
 : 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-nodepool 
 : 
  
  NODE_POOL_NAME 
 
  
 tolerations 
 : 
  
 - 
  
 key 
 : 
  
 "slurm-worker" 
  
 operator 
 : 
  
 "Equal" 
  
 value 
 : 
  
 "true" 
  
 effect 
 : 
  
 "NoSchedule" 
 loginsets 
 : 
  
 slinky 
 : 
  
 enabled 
 : 
  
 true 
  
 replicas 
 : 
  
 1 
  
 login 
 : 
  
 image 
 : 
  
 repository 
 : 
  
 gcr.io/gke-release/slinky/login 
  
 tag 
 : 
  
  IMAGE_TAG

Replace IMAGE_TAG with the tag that you copied in the previous step. For example, use 25.11-ubuntu24.04-gke.4 .

Install the Slurm Helm chart by using the values.yaml file:

 helm  
install  
slurm  
oci://ghcr.io/slinkyproject/charts/slurm  
 \ 
  
--namespace = 
slurm  
--create-namespace  
--version  
 1 
.0.2  
-f  
values.yaml

Verify the Slurm installation

You can verify that Slurm is deployed on the cluster by using kubectl .

Check Pod status:

 kubectl  
get  
pods  
--namespace  
slurm

The output should be similar to the following, and show the Running status for all Pods:

 NAME                                  READY   STATUS    RESTARTS   AGE
slurm-controller-0                    3/3     Running   0          60s
slurm-login-slinky-5d79cd755c-mf62z   1/1     Running   0          60s
slurm-restapi-6b4ccb479f-njlp9        1/1     Running   0          60s
slurm-worker-slinky-0                 2/2     Running   0          60s

To see the registered nodes, execute the sinfo command on the login node:
```
 kubectl  
 exec 
  
-it  
deployment/slurm-login-slinky  
-n  
slurm  
--  
sinfo 
```
The output should list the slinky partition and the worker node.

Run a Slurm Job

To run a job, you need to access the Slurm login node. The way you access the login node depends on whether you have configured OS Login in the previous section.
1. If you configured OS Login in the preceding section, access the login node by using SSH. To do this, get the external IP address of slurm-login-slinky` Service:
```
 kubectl  
get  
service  
--namespace  
slurm  
slurm-login-slinky 
```
  The output looks like this:
```
 NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
slurm-login-slinky   LoadBalancer   10.X.X.X        X.X.X.X        22:30171/TCP   5m 
```
  Copy the value of the EXTERNAL-IP column.
```
 ssh  
 OSLOGIN_USERNAME 
@ EXTERNAL_IP 
 
```
  Replace the following:
  - EXTERNAL_IP : the IP address obtained in the previous step.
  - OSLOGIN_USERNAME : your OS Login username .
2. If you did not configure OS Login, you can still access the login node by using the kubectl exec command:
```
 kubectl  
 exec 
  
-it  
deployment/slurm-login-slinky  
-n  
slurm  
--  
bash 
```
  Caution: By entering the login node this way, you sign in as a root user, which can pose security risks.
Run an interactive job: After you're in the login node, you can run a command on a compute node by using the srun command line utility.
```
 srun  
hostname 
```
The output includes the hostname of the slurm-worker-slinky-0 Pod.

Clean up

To avoid incurring charges, clean up the resources created in this document.

Uninstall the Helm deployment: This command removes all Kubernetes resources deployed by the Helm chart.
```
 helm  
uninstall  
slurm  
--namespace  
slurm 
```
Delete the Slurm namespace:
```
 kubectl  
delete  
namespace  
slurm 
```

Delete the GKE cluster:

 gcloud  
container  
clusters  
delete  
 CLUSTER_NAME 
  
 \ 
  
--location = 
 LOCATION 
  
 \