Create a partition for a Slurm controller

Use this module to create a compute partition to use as input when you define the slurm-controller module . A compute partition defines a logical group of compute nodes for your Slurm cluster. This module works alongside the nodeset module to group resources together in a single partition.

This module lets you configure partition-level settings such as job exclusivity, power management, timeouts, and default status.

For the complete list of inputs and outputs, see the schedmd-slurm-gcp-v6-partition module in the Cluster Toolkit GitHub repository.

Before you begin

Before you begin, verify that you meet the following requirements:

  • You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit .
  • You have an existing cluster blueprint. You can use and modify an existing blueprint or create one from scratch. For a working example of a blueprint configured for the slurm-partition module, see the examples/hpc-slurm.yaml file. For more information about creating and customizing blueprints, see Cluster blueprint .
  • To view a complete list of blueprints that support the slurm-partition module, go to the Cluster blueprint catalog page, click the Select schedulermenu and then select Slurm.
  • In your blueprint, you must have defined at least one slurm-nodeset module to include in the partition.

Required roles

To get the permissions that you need to deploy the partition, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Create a compute partition

To create a compute partition, add the slurm-partition module to your blueprint and specify the nodeset modules that the partition contains.

The following example creates a partition module with the following configuration:

  • Adds two nodesets by using the use field:

    • The first nodeset uses the c2-standard-30 machine type.
    • The second nodeset uses the c2-standard-60 machine type.
    • Both nodesets specify a maximum of 200 dynamic nodes.
  • Sets the partition name to compute .

  • Connects to the network module by using the use field.

  • Mounts homefs by using the use field, which connects the partition to the shared file system module that hosts user home directories.

  - 
  
 id 
 : 
  
 nodeset_1 
  
 source 
 : 
  
 community/modules/compute/schedmd-slurm-gcp-v6-nodeset 
  
 use 
 : 
  
 [ 
 network 
 ] 
  
 settings 
 : 
  
 name 
 : 
  
 c30 
  
 node_count_dynamic_max 
 : 
  
 200 
  
 machine_type 
 : 
  
 c2-standard-30 
 - 
  
 id 
 : 
  
 nodeset_2 
  
 source 
 : 
  
 community/modules/compute/schedmd-slurm-gcp-v6-nodeset 
  
 use 
 : 
  
 [ 
 network 
 ] 
  
 settings 
 : 
  
 name 
 : 
  
 c60 
  
 node_count_dynamic_max 
 : 
  
 200 
  
 machine_type 
 : 
  
 c2-standard-60 
 - 
  
 id 
 : 
  
 compute_partition 
  
 source 
 : 
  
 community/modules/compute/schedmd-slurm-gcp-v6-partition 
  
 use 
 : 
  
 - 
  
 homefs 
  
 - 
  
 nodeset_1 
  
 - 
  
 nodeset_2 
  
 settings 
 : 
  
 partition_name 
 : 
  
 compute 
 

Set the default partition

You can set a specific partition as the default for jobs that don't explicitly request one. To do this, set the is_default setting to true .

   
 settings 
 : 
  
 partition_name 
 : 
  
 compute 
  
 is_default 
 : 
  
 true 
 

Configure job exclusivity

By default, the slurm-partition module configures nodes to execute a single job at a time by using the exclusive: true setting. When a job completes, the node is automatically deleted to ensure a clean environment for the next job.

To let multiple jobs to share a node, set the exclusive setting to false :

   
 settings 
 : 
  
 partition_name 
 : 
  
 shared-compute 
  
 exclusive 
 : 
  
 false 
 

Configure node power management

Slurm on Google Cloud automatically suspends idle nodes to save costs. You can configure this behavior by using the following input settings:

  • suspend_time : controls how many seconds a node remains idle before it turns off. The default is 300 seconds (5 minutes). Set this to -1 to prevent nodes from suspending.
  • suspend_timeout : controls the maximum time allowed (in seconds) between when a request to suspend a node is issued and when the node fully shuts down.
  • resume_timeout : controls the maximum time allowed (in seconds) between when a request to start a node is issued and when the node is ready for use.
   
 settings 
 : 
  
 partition_name 
 : 
  
 compute 
  
 suspend_time 
 : 
  
 600 
  
 # Wait 10 minutes before scaling down 
  
 suspend_timeout 
 : 
  
 120 
  
 resume_timeout 
 : 
  
 300 
 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: