Build custom Slurm Docker images

This document explains how to build custom Docker images for your Slurm clusters on Google Kubernetes Engine (GKE). You can extend the base Slurm images provided by GKE to include additional tools, libraries, or configurations required for your high performance computing (HPC) workloads.

Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE .

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.

Prerequisites

This document assumes that you already have a Slurm cluster running on GKE with Slurm Operator add-on for GKE installed. Complete the procedures on the following pages:

  1. Complete the Quickstart: Deploy a Slurm cluster on GKE .
  2. Configure an Artifact Registry repository in your project to store your custom images.

Slurm base images

GKE provides base Slurm images in the gcr.io/gke-release/ Artifact Registry repository. GKE updates these images frequently for security and performance. These images come in variants that include the latest Slurm versions and two Linux distributions, Ubuntu and Rocky Linux.

You can customize the following base images:

  • gcr.io/gke-release/slinky/slurmd : used for Slurm compute nodes.
  • gcr.io/gke-release/slinky/login : used for login nodes.

Build a custom image

The following example demonstrates how to build a custom Slurm compute image that includes a Python virtual environment with JAX installed. You also build a corresponding login image that mirrors the compute image PATH environment variable without actually installing the JAX libraries.

Select the image version

When you select a base image, ensure that it meets the following conditions:

  • The version matches the Slurm version used by other components in your Slurm cluster.
  • For a specific Slurm version, choose the tag of the newest available image, which includes the latest security updates and bug fixes.

For example, if the default Slurm version in your cluster is 25.11, you should choose a tag that starts with 25.11- , for example 25.11-ubuntu24.04-gke.6 .

Create a Dockerfile

  1. Select an Ubuntu-based slurmd image tag:

    1. In the Google Cloud console, go to the Artifact Registry repositorypage that includes the slinky/slurmd package.

      Go to Artifact Registry repository

    2. Find an image with a tag that includes ubuntu and matches your Slurm version, for example 25.11-ubuntu24.04-gke.6 .

    3. Copy the tag. You use this tag to replace the VERSION_TAG placeholder in the following configuration file.

  2. Create a file named Dockerfile with the following content:

      # --- Target 1: The Worker Node (slurmd) --- 
     FROM 
      
     gcr.io/gke-release/slinky/slurmd: VERSION_TAG 
     
      
     AS 
      
     slurmd-custom 
     USER 
      
     root 
     # Install minimal requirements for venv 
     RUN 
      
    apt-get  
    update && 
    apt-get  
    install  
    -y  
    --no-install-recommends  
     \ 
      
    python3-pip  
     \ 
      
    python3-venv  
     \ 
     && 
    rm  
    -rf  
    /var/lib/apt/lists/* # Create and populate the virtual environment 
     ENV 
      
     VIRTUAL_ENV 
     = 
    /opt/custom_venv RUN 
      
    python3  
    -m  
    venv  
     ${ 
     VIRTUAL_ENV 
     } 
     ENV 
      
     PATH 
     = 
     " 
     ${ 
     VIRTUAL_ENV 
     } 
     /bin: 
     $PATH 
     " 
     # Install JAX (CPU version for general compatibility) and dependencies 
     RUN 
      
    pip  
    install  
    --no-cache-dir  
    jax [ 
    cpu ] 
      
    numpy # --- Target 2: The Login Node --- 
     FROM 
      
     gcr.io/gke-release/slinky/login: VERSION_TAG 
     
      
     AS 
      
     login-custom 
     USER 
      
     root 
     # Mirror the PATH exactly so that the srun command captures it. 
     # Note: You don't need to install the JAX libs here, 
     # but the binary path must exist for the shell to recognize it. 
     ENV 
      
     VIRTUAL_ENV 
     = 
    /opt/custom_venv ENV 
      
     PATH 
     = 
     " 
     ${ 
     VIRTUAL_ENV 
     } 
     /bin: 
     $PATH 
     " 
     # Create the directory structure so the PATH is valid on the login node 
     RUN 
      
    mkdir  
    -p  
     ${ 
     VIRTUAL_ENV 
     } 
    /bin 
    

    Replace the VERSION_TAG with the Slurm version tag that matches your cluster's default Slurm version.

  3. Build the images by using the docker build command:

     docker  
    build  
    --target = 
    slurmd-custom  
     \ 
      
    -t  
     AR_PATH 
    /slinky/slurmd: CUSTOM_SLURMD_TAG 
      
     \ 
      
    -f  
    Dockerfile  
    .
    docker  
    build  
    --target = 
    login-custom  
     \ 
      
    -t  
     AR_PATH 
    /slinky/login: CUSTOM_LOGIN_TAG 
      
     \ 
      
    -f  
    Dockerfile  
    . 
    

    Replace the following:

    • AR_PATH : the path to your Artifact Registry repository, for example gcr.io/my-project .
    • CUSTOM_SLURMD_TAG : a slurmd-custom tag name of your choice.
    • CUSTOM_LOGIN_TAG : a login-custom tag name of your choice.
  4. Push the custom images to your repository:

     docker  
    push  
     AR_PATH 
    /slinky/slurmd: CUSTOM_SLURMD_TAG 
    docker  
    push  
     AR_PATH 
    /slinky/login: CUSTOM_LOGIN_TAG 
     
    

Use the custom images in GKE

To use your custom images, complete the following steps:

  1. As shown in the following example, update the image repository and tag for the slurmd nodeset and login loginset by modifying the values.yaml file:

      nodesets 
     : 
      
     slinky 
     : 
      
     replicas 
     : 
      
     1 
      
     slurmd 
     : 
      
     image 
     : 
      
     repository 
     : 
      
      AR_PATH 
     
    /slinky/slurmd  
     tag 
     : 
      
      CUSTOM_SLURMD_TAG 
     
     loginsets 
     : 
      
     slinky 
     : 
      
     enabled 
     : 
      
     true 
      
     replicas 
     : 
      
     1 
      
     login 
     : 
      
     image 
     : 
      
     repository 
     : 
      
      AR_PATH 
     
    /slinky/login  
     tag 
     : 
      
      CUSTOM_LOGIN_TAG 
     
     
    
  2. Upgrade the existing deployment:

     helm  
    upgrade  
    slurm  
    oci://ghcr.io/slinkyproject/charts/slurm  
     \ 
      
    --namespace  
    slurm  
     \ 
      
    --version = 
     1 
    .0.2  
     \ 
      
    -f  
    values.yaml 
    
  3. Test the new capabilities of your compute node by signing in to the login node and running the following srun command:

     srun  
    python3  
    -c  
     " 
     import sys 
     import jax 
     import jax.numpy as jnp 
     print(f'Python Executable: {sys.executable}') 
     print(f'Using JAX backend: {jax.devices()[0].platform}') 
     key = jax.random.PRNGKey(42) 
     x = jax.random.normal(key, (5000, 5000)) 
     result = jnp.dot(x, x) 
     print(f'Matrix multiplication successful. Shape: {result.shape}') 
     " 
     
    

    The output is similar to the following:

     Python Executable: /opt/custom_venv/bin/python3
    Using JAX backend: cpu
    Matrix multiplication successful. Shape: (5000, 5000) 
    

    This output confirms that Slurm executes the script on worker Pods running your custom image, and the image contains the required Python and JAX capabilities.

Clean up

To clean up the resources that you used in this tutorial, do the following:

  1. Uninstall the Helm deployment: sh helm uninstall slurm --namespace slurm

    This command removes all Kubernetes resources deployed by the Helm chart.

  2. Delete the Slurm namespace:

     kubectl  
    delete  
    namespace  
    slurm 
    
  3. Delete the GKE cluster:

     gcloud  
    container  
    clusters  
    delete  
     CLUSTER_NAME 
     
    

    Replace CLUSTER_NAME with your cluster name.

  4. Delete the custom images from Artifact Registry:

     gcloud  
    container  
    images  
    delete  
     AR_PATH 
    /slinky/slurmd: CUSTOM_SLURMD_TAG 
      
    --force-delete-tags
    gcloud  
    container  
    images  
    delete  
     AR_PATH 
    /slinky/login: CUSTOM_LOGIN_TAG 
      
    --force-delete-tags 
    
  5. Remove the custom images from your local Docker environment:

     docker  
    rmi  
     AR_PATH 
    /slinky/slurmd: CUSTOM_SLURMD_TAG 
    docker  
    rmi  
     AR_PATH 
    /slinky/login: CUSTOM_LOGIN_TAG 
     
    

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: