This document helps you understand the benefits and the shared responsibility model for deploying and managing Slurm clusters on Google Kubernetes Engine (GKE). Slurm is a highly scalable, open-source workload manager and job scheduler focused on high performance computing (HPC) workloads.
This document focuses on the Slurm Operator add-on for GKE , which integrates the Slurm interface for batch jobs with GKE's scalability and efficient resource management.
This document is for Data administrators, Operators, and Developers who want to enable and configure Slurm with the Slurm Operator add-on for GKE.
Why use Slurm on GKE
Running Slurm on GKE lets you set up environments where Slurm training jobs can run in parallel with other Kubernetes workloads, such as Ray jobs or serving inference, on the same shared compute pool.
Google Cloud offers the following ways to run Slurm on GKE:
-
Cluster Director:a managed solution that provides an opinionated, pre-configured experience. We recommend that you use Cluster Director if you want a managed experience with minimal configuration, where Google manages the Slurm control plane, software versions, and configurations. For more information, see Cluster Director .
-
Slurm Operator add-on for GKE:an open-source technology developed as part of the open-source Slinky project, offers a managed installation of the Slurm Operator. This managed installation follows Google best practices. The Slurm Operator add-on for GKE helps platform engineers who want to build custom platforms that run artificial intelligence (AI), machine learning (ML), and HPC workloads on accelerator-optimized machines. We recommend that you use the Slurm Operator add-on for GKE if you need full control over your Slurm configurations, want to integrate custom containers, or need to run Slurm with other workloads, like Ray or inference, on the same cluster. The Slurm Operator add-on for GKE has the following benefits:
- Unified infrastructure:you can manage a single GKE cluster for both HPC batch jobs and microservices. This reduces operational silos.
- Efficient scaling:the Slurm Operator add-on for GKE uses GKE's rapid node provisioning and efficient bin-packing to optimize resource usage.
- High-performance hardware:you can access Google Cloud's latest accelerators, such as TPUs and GPUs , by using Slurm commands.
The following sections of this document focus on the Slurm Operator add-on.
Understand the infrastructure layer
Typically, Slurm is deployed directly on bare-metal servers or VMs. In these setups, Slurm assumes a relatively static set of nodes that it manages for the duration of its lifecycle.
GKE, on the other hand, is a managed service that abstracts away the underlying VMs by presenting them as nodes in a dynamic pool. GKE handles the scheduling and scaling of these nodes automatically.
With the Slurm Operator add-on for GKE, Slurm runs as a workload on top of GKE. In this model, GKE and Slurm provide the following capabilities:
- GKE as the infrastructure manager:GKE provides and manages the underlying nodes, networking, and security.
- Slurm as the workload manager:Slurm provides the interface for users to submit and manage batch jobs.
This convergence creates a heterogeneous stack where Slurm jobs can share the same underlying hardware (like GPUs and TPUs) with Kubernetes applications, such as Ray or inference serving.
How the Slurm Operator add-on works
When you enable the Slurm Operator add-on for GKE in your GKE clusters, GKE performs the following steps:
- GKE installs and hosts the Slurm Operator . This operator runs in the GKE control plane and manages the lifecycle of Slurm components deployed in your cluster. This operator automatically handles prerequisites like certificate generation and custom resource management.
- After the operator is running, you define your Slurm cluster topology by using Kubernetes custom resources.
- The operator deploys the necessary Pods—such as controller, login, and workers—to match the specifications of your workload.
Custom resources
The Slurm Operator add-on for GKE utilizes the following custom resources to manage the Slurm cluster:
- NodeSet:defines a set of homogeneous worker nodes. You can map Slurm workers to specific GKE node pools or hardware types. For example, you can create a NodeSet for H100 GPUs and another for TPUs.
- Controller:defines a core component of the Slurm Cluster, which is a
Controller. This resource creates a
slurmctldcomponent in a Pod and all other components, such as a NodeSet or LoginSet. - LoginSet:defines the login nodes where users sign in to submit jobs.
- Restapi:defines a REST API component of the Slurm Cluster.
- Accounting:when configured, deploys the
slurmdbdcomponent to handle job accounting and usage tracking, typically connected to a Cloud SQL database.
The shared responsibilities of the Slurm Operator add-on
When you choose to run Slurm on GKE with the Slurm Operator add-on for GKE, Google Cloud and you share responsibilities. This integration gives you the flexibility to customize the environment while Google manages the orchestration layer.
Google's responsibilities
- Operator lifecycle:install the Slurm Operator.
- GKE control plane:manage the reliability and uptime of the GKE control plane.
- Kubernetes CustomResourceDefinition:manage the custom resources required for Slurm.
- Base images:provide optimized, Google-owned container images for Slurm components. The use of these images is optional; when you configure your Slurm cluster, you can use these Google-provided images or use your own images.
Customer's responsibilities
- Slurm configuration:define the Slurm cluster topology, partitions, limits, and plugins by using YAML files.
- Job submission:manage user access and job submission workflows.
- Custom images:maintain custom container images used for login or worker nodes if the default Google images don't meet specific requirements.
- External dependencies:manage external resources such as Cloud SQL for accounting or Filestore for shared storage.
For more information, see GKE shared responsibility .
What's next
To get started with the Slurm Operator add-on for GKE, do the following:
- Learn how to deploy a Slurm cluster on GKE .
- Learn how to enable the Slurm Operator add-on .

