Quickstart: Deploy a Slurm cluster on GKE

This document explains how to quickly deploy and configure a basic Slurm cluster on Google Kubernetes Engine (GKE) by using the open-source Slurm Helm chart and the Slurm Operator add-on for GKE. This setup includes a Slurm controller ( slurmctld ), REST API ( slurmrestd ), a login node for user access, and a single worker node ( slurmd ) managed by the Slurm Operator add-on for GKE.

This document is for Data administrators, Operators, and Developers who want to enable and configure the Slurm cluster on GKE.

Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE .

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
  • Ensure you have already generated an SSH key pair. This key pair is required only if you want to set up OS Login.

  • Ensure you have a running GKE cluster with Slurm Operator enabled. If not, create one:

     gcloud  
    container  
    clusters  
    create  
     CLUSTER_NAME 
      
     \ 
      
    --cluster-version = 
     VERSION 
      
     \ 
      
    --location = 
     LOCATION 
      
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --addons = 
    SlurmOperator 
    

    Replace the following:

    • CLUSTER_NAME : the name of the new cluster.
    • VERSION : the GKE version, which must be 1.35.2-gke.1842000 or later. You can also use the --release-channel option to select a release channel. The release channel must have a default version of 1.35.2-gke.1842000 or later.
    • LOCATION : the location of the cluster.
    • PROJECT_ID : the ID of the project.

    For more information, see enable Slurm Operator add-on for GKE .

OS Login simplifies SSH access management by linking your Linux user account to your IAM identity. This configuration lets you manage access to Slurm nodes by using IAM permissions.

  1. Grant necessary IAM roles. Ensure your user account has the necessary IAM roles in the project:

    • roles/compute.osLogin : lets you manage your own OS Login profile.
    • roles/compute.instanceAdmin.v1 : provides permissions to manage compute instances.
    • roles/iam.serviceAccountUser : lets users act as a service account, which is often needed for node operations.

    For more information about the required roles, see the guide to set up OS Login .

  2. Add your SSH key to OS Login by uploading your public SSH key:

     gcloud  
    compute  
    os-login  
    ssh-keys  
    add  
    --key-file = 
     PATH_TO_PUBLIC_KEY 
      
    --project = 
     PROJECT_ID 
     
    

    Alternatively, you can add a key that's loaded in your ssh-agent :

     gcloud  
    compute  
    os-login  
    ssh-keys  
    add  
    --key = 
     " 
     $( 
    ssh-add  
    -L  
     | 
      
    grep  
    publickey  
     | 
      
    head  
    -n  
     1 
     ) 
     " 
      
    --project = 
     PROJECT_ID 
     
    
  3. Enable OS Login in your project metadata:

     gcloud  
    compute  
    project-info  
    add-metadata  
    --metadata  
    enable-oslogin = 
    TRUE  
    --project = 
     PROJECT_ID 
     
    
Best practice :

For managing OS Login across multiple projects in an organization, consider enforcing OS Login by using an Organization Policy Service constraint ( compute.requireOsLogin ). This is a recommended security best practice. For more information, see Enable and configure OS Login in GKE .

(Optional) Add a compute node pool

If you want to run Slurm compute workloads on separate nodes, you can create a dedicated node pool for them.

 gcloud  
container  
node-pools  
create  
 NODE_POOL_NAME 
  
 \ 
  
--cluster = 
 CLUSTER_NAME 
  
 \ 
  
--machine-type = 
 MACHINE_TYPE 
  
 \ 
  
--num-nodes = 
 NUM_NODES 
  
 \ 
  
--node-taints = 
slurm-worker = 
true:NoSchedule 

Replace the following:

  • NODE_POOL_NAME : the name of the new node pool.
  • CLUSTER_NAME : the name of your cluster.
  • MACHINE_TYPE : the machine type for the nodes (for example: n2-standard-4 ).
  • NUM_NODES : the number of nodes in the node pool.

Deploy Slurm using Helm

This section guides you through deploying the Slurm cluster components by using the Slurm Helm chart. The Helm chart deploys slurmctld , slurmrestd , and slurmd components within the GKE cluster.

  1. Configure kubectl to communicate with your cluster:

     gcloud  
    container  
    clusters  
    get-credentials  
     CLUSTER_NAME 
     
    

    Replace CLUSTER_NAME with your cluster name.

  2. Verify that you are running Helm 3.8.0 or later.

     helm  
    version 
    

    The output is similar to the following:

     version.BuildInfo{Version:"v3.17.3", GitCommit:"e4da49785aa6e6ee2b86efd5dd9e43400318262b", GitTreeState:"clean", GoVersion:"go1.23.7"} 
    

    If needed, you can install Helm by following the official Helm documentation .

  3. Find an available image tag:

    1. In the Google Cloud console, go to the Artifact Registry repositorypage that includes the slinky/slurmd package.

      Go to Artifact Registry repository

    2. Annotate one of the image tag value, for example 25.11-ubuntu24.04-gke.4 . You use this tag in the IMAGE_TAG placeholder in the following configuration file.

  4. Save the following configuration to a new file named values.yaml :

      controller 
     : 
      
     slurmctld 
     : 
      
     image 
     : 
      
     repository 
     : 
      
     gcr.io/gke-release/slinky/slurmctld 
      
     tag 
     : 
      
      IMAGE_TAG 
     
      
     reconfigure 
     : 
      
     image 
     : 
      
     repository 
     : 
      
     gcr.io/gke-release/slinky/slurmctld 
      
     tag 
     : 
      
      IMAGE_TAG 
     
     restapi 
     : 
      
     replicas 
     : 
      
     1 
      
     slurmrestd 
     : 
      
     image 
     : 
      
     repository 
     : 
      
     gcr.io/gke-release/slinky/slurmrestd 
      
     tag 
     : 
      
      IMAGE_TAG 
     
     nodesets 
     : 
      
     slinky 
     : 
      
     replicas 
     : 
      
     1 
      
     slurmd 
     : 
      
     image 
     : 
      
     repository 
     : 
      
     gcr.io/gke-release/slinky/slurmd 
      
     tag 
     : 
      
      IMAGE_TAG 
     
      
     # The podSpec block is optional and only required when using 
      
     # a dedicated node pool for compute nodes. 
      
     podSpec 
     : 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-nodepool 
     : 
      
      NODE_POOL_NAME 
     
      
     tolerations 
     : 
      
     - 
      
     key 
     : 
      
     "slurm-worker" 
      
     operator 
     : 
      
     "Equal" 
      
     value 
     : 
      
     "true" 
      
     effect 
     : 
      
     "NoSchedule" 
     loginsets 
     : 
      
     slinky 
     : 
      
     enabled 
     : 
      
     true 
      
     replicas 
     : 
      
     1 
      
     login 
     : 
      
     image 
     : 
      
     repository 
     : 
      
     gcr.io/gke-release/slinky/login 
      
     tag 
     : 
      
      IMAGE_TAG 
     
     
    

    Replace IMAGE_TAG with the tag that you copied in the previous step. For example, use 25.11-ubuntu24.04-gke.4 .

  5. Install the Slurm Helm chart by using the values.yaml file:

     helm  
    install  
    slurm  
    oci://ghcr.io/slinkyproject/charts/slurm  
     \ 
      
    --namespace = 
    slurm  
    --create-namespace  
    --version  
     1 
    .0.2  
    -f  
    values.yaml 
    

Verify the Slurm installation

You can verify that Slurm is deployed on the cluster by using kubectl .

  1. Check Pod status:

     kubectl  
    get  
    pods  
    --namespace  
    slurm 
    

    The output should be similar to the following, and show the Running status for all Pods:

     NAME                                  READY   STATUS    RESTARTS   AGE
    slurm-controller-0                    3/3     Running   0          60s
    slurm-login-slinky-5d79cd755c-mf62z   1/1     Running   0          60s
    slurm-restapi-6b4ccb479f-njlp9        1/1     Running   0          60s
    slurm-worker-slinky-0                 2/2     Running   0          60s 
    
  2. To see the registered nodes, execute the sinfo command on the login node:

     kubectl  
     exec 
      
    -it  
    deployment/slurm-login-slinky  
    -n  
    slurm  
    --  
    sinfo 
    

    The output should list the slinky partition and the worker node.

Run a Slurm Job

  1. To run a job, you need to access the Slurm login node. The way you access the login node depends on whether you have configured OS Login in the previous section.

    1. If you configured OS Login in the preceding section, access the login node by using SSH. To do this, get the external IP address of slurm-login-slinky` Service:

       kubectl  
      get  
      service  
      --namespace  
      slurm  
      slurm-login-slinky 
      

      The output looks like this:

       NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
      slurm-login-slinky   LoadBalancer   10.X.X.X        X.X.X.X        22:30171/TCP   5m 
      

      Copy the value of the EXTERNAL-IP column.

       ssh  
       OSLOGIN_USERNAME 
      @ EXTERNAL_IP 
       
      

      Replace the following:

      • EXTERNAL_IP : the IP address obtained in the previous step.
      • OSLOGIN_USERNAME : your OS Login username .
    2. If you did not configure OS Login, you can still access the login node by using the kubectl exec command:

       kubectl  
       exec 
        
      -it  
      deployment/slurm-login-slinky  
      -n  
      slurm  
      --  
      bash 
      
  2. Run an interactive job: After you're in the login node, you can run a command on a compute node by using the srun command line utility.

     srun  
    hostname 
    

    The output includes the hostname of the slurm-worker-slinky-0 Pod.

Clean up

To avoid incurring charges, clean up the resources created in this document.

  1. Uninstall the Helm deployment: This command removes all Kubernetes resources deployed by the Helm chart.

     helm  
    uninstall  
    slurm  
    --namespace  
    slurm 
    
  2. Delete the Slurm namespace:

     kubectl  
    delete  
    namespace  
    slurm 
    
  3. Delete the GKE cluster:

     gcloud  
    container  
    clusters  
    delete  
     CLUSTER_NAME 
      
     \ 
      
    --location = 
     LOCATION 
      
     \ 
     
    

    Replace the following:

    • CLUSTER_NAME : the name of the new cluster.
    • LOCATION : the region of cluster.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: