This document explains how to build custom Docker images for your Slurm clusters on Google Kubernetes Engine (GKE). You can extend the base Slurm images provided by GKE to include additional tools, libraries, or configurations required for your high performance computing (HPC) workloads.
Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE .
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task, install
and then initialize
the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Prerequisites
This document assumes that you already have a Slurm cluster running on GKE with Slurm Operator add-on for GKE installed. Complete the procedures on the following pages:
- Complete the Quickstart: Deploy a Slurm cluster on GKE .
- Configure an Artifact Registry repository in your project to store your custom images.
Slurm base images
GKE provides base Slurm images in the gcr.io/gke-release/ Artifact Registry repository. GKE updates these images frequently for security and performance. These images come in variants that include the latest Slurm versions and two Linux distributions, Ubuntu and Rocky Linux.
You can customize the following base images:
-
gcr.io/gke-release/slinky/slurmd: used for Slurm compute nodes. -
gcr.io/gke-release/slinky/login: used for login nodes.
Build a custom image
The following example demonstrates how to build a custom Slurm compute image
that includes a Python virtual environment with JAX installed. You also build a
corresponding login image that mirrors the compute image PATH
environment
variable without actually installing the JAX libraries.
Select the image version
When you select a base image, ensure that it meets the following conditions:
- The version matches the Slurm version used by other components in your Slurm cluster.
- For a specific Slurm version, choose the tag of the newest available image, which includes the latest security updates and bug fixes.
For example, if the default Slurm version in your cluster is 25.11, you should
choose a tag that starts with 25.11-
, for example 25.11-ubuntu24.04-gke.6
.
Create a Dockerfile
-
Select an Ubuntu-based
slurmdimage tag:-
In the Google Cloud console, go to the Artifact Registry repositorypage that includes the
slinky/slurmdpackage. -
Find an image with a tag that includes
ubuntuand matches your Slurm version, for example25.11-ubuntu24.04-gke.6. -
Copy the tag. You use this tag to replace the
VERSION_TAGplaceholder in the following configuration file.
-
-
Create a file named
Dockerfilewith the following content:# --- Target 1: The Worker Node (slurmd) --- FROM gcr.io/gke-release/slinky/slurmd: VERSION_TAG AS slurmd-custom USER root # Install minimal requirements for venv RUN apt-get update && apt-get install -y --no-install-recommends \ python3-pip \ python3-venv \ && rm -rf /var/lib/apt/lists/* # Create and populate the virtual environment ENV VIRTUAL_ENV = /opt/custom_venv RUN python3 -m venv ${ VIRTUAL_ENV } ENV PATH = " ${ VIRTUAL_ENV } /bin: $PATH " # Install JAX (CPU version for general compatibility) and dependencies RUN pip install --no-cache-dir jax [ cpu ] numpy # --- Target 2: The Login Node --- FROM gcr.io/gke-release/slinky/login: VERSION_TAG AS login-custom USER root # Mirror the PATH exactly so that the srun command captures it. # Note: You don't need to install the JAX libs here, # but the binary path must exist for the shell to recognize it. ENV VIRTUAL_ENV = /opt/custom_venv ENV PATH = " ${ VIRTUAL_ENV } /bin: $PATH " # Create the directory structure so the PATH is valid on the login node RUN mkdir -p ${ VIRTUAL_ENV } /binReplace the
VERSION_TAGwith the Slurm version tag that matches your cluster's default Slurm version. -
Build the images by using the
docker buildcommand:docker build --target = slurmd-custom \ -t AR_PATH /slinky/slurmd: CUSTOM_SLURMD_TAG \ -f Dockerfile . docker build --target = login-custom \ -t AR_PATH /slinky/login: CUSTOM_LOGIN_TAG \ -f Dockerfile .Replace the following:
-
AR_PATH: the path to your Artifact Registry repository, for examplegcr.io/my-project. -
CUSTOM_SLURMD_TAG: aslurmd-customtag name of your choice. -
CUSTOM_LOGIN_TAG: alogin-customtag name of your choice.
-
-
Push the custom images to your repository:
docker push AR_PATH /slinky/slurmd: CUSTOM_SLURMD_TAG docker push AR_PATH /slinky/login: CUSTOM_LOGIN_TAG
Use the custom images in GKE
To use your custom images, complete the following steps:
-
As shown in the following example, update the image repository and tag for the
slurmdnodeset andloginloginset by modifying thevalues.yamlfile:nodesets : slinky : replicas : 1 slurmd : image : repository : AR_PATH /slinky/slurmd tag : CUSTOM_SLURMD_TAG loginsets : slinky : enabled : true replicas : 1 login : image : repository : AR_PATH /slinky/login tag : CUSTOM_LOGIN_TAG -
Upgrade the existing deployment:
helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \ --namespace slurm \ --version = 1 .0.2 \ -f values.yaml -
Test the new capabilities of your compute node by signing in to the login node and running the following
sruncommand:srun python3 -c " import sys import jax import jax.numpy as jnp print(f'Python Executable: {sys.executable}') print(f'Using JAX backend: {jax.devices()[0].platform}') key = jax.random.PRNGKey(42) x = jax.random.normal(key, (5000, 5000)) result = jnp.dot(x, x) print(f'Matrix multiplication successful. Shape: {result.shape}') "The output is similar to the following:
Python Executable: /opt/custom_venv/bin/python3 Using JAX backend: cpu Matrix multiplication successful. Shape: (5000, 5000)This output confirms that Slurm executes the script on worker Pods running your custom image, and the image contains the required Python and JAX capabilities.
Clean up
To clean up the resources that you used in this tutorial, do the following:
-
Uninstall the Helm deployment:
sh helm uninstall slurm --namespace slurmThis command removes all Kubernetes resources deployed by the Helm chart.
-
Delete the Slurm namespace:
kubectl delete namespace slurm -
Delete the GKE cluster:
gcloud container clusters delete CLUSTER_NAMEReplace
CLUSTER_NAMEwith your cluster name. -
Delete the custom images from Artifact Registry:
gcloud container images delete AR_PATH /slinky/slurmd: CUSTOM_SLURMD_TAG --force-delete-tags gcloud container images delete AR_PATH /slinky/login: CUSTOM_LOGIN_TAG --force-delete-tags -
Remove the custom images from your local Docker environment:
docker rmi AR_PATH /slinky/slurmd: CUSTOM_SLURMD_TAG docker rmi AR_PATH /slinky/login: CUSTOM_LOGIN_TAG
What's next
- Explore the Slurm Project on GitHub .
- Learn how to enable the Slurm Operator add-on for GKE .

