Create a Slurm controller node

The schedmd-slurm-gcp-v6-controller module lets you create a Slurm controller node. By using this module, you establish the central management node for your Slurm cluster. The controller node manages job queues, schedules workloads, and monitors compute resources.

For the complete list of inputs and outputs for this module, see the schedmd-slurm-gcp-v6-controller module page in the Cluster Toolkit GitHub repository.

Before you begin

Before you begin, verify that you meet the following requirements:

You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
You have an existing cluster blueprint. You can use and modify an existing blueprint or create one from scratch. For a working example of a blueprint configured for Slurm, see the examples/hpc-slurm.yaml file. For more information about creating and customizing blueprints, see Cluster blueprint.
To view a complete list of blueprints that support Slurm, go to the Cluster blueprint catalog page, click the Select scheduler menu and then select Slurm.
The schedmd-slurm-gcp-v6-controller module does not execute your individual jobs or provide direct user login capability. It establishes the central management plane to monitor compute resources and orchestrate the scheduling of your workloads.

Required roles

To get the permissions that you need to deploy the Slurm controller node, ask your administrator to grant you the following IAM roles on your project:

Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1)
Service Account User (roles/iam.serviceAccountUser)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a Slurm controller

The following example creates a controller node that has the following attributes:

Connects to the primary subnetwork of the network module.
Mounts the file system that has the homefs identifier.
Configures one partition that has the compute_partition identifier.
Upgrades the machine type from the default c2-standard-4 type to the c2-standard-8 type.

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network
    - homefs
    - compute_partition
    settings:
      machine_type: c2-standard-8

Update your live Slurm cluster

The schedmd-slurm-gcp-v6-controller module supports the reconfiguration of partitions and the Slurm configuration in a running, active cluster. To reconfigure a running cluster, do the following:

Edit the blueprint with the configuration changes that you want.
To overwrite the deployment directory, execute the gcluster create command and append the -w flag.
Deploy the changes by following the terminal instructions.

The following list provides examples of updates that you can apply to a running cluster:

Add or remove a partition.
Resize an existing partition.
Attach new network storage to an existing partition.

Configure custom images

For more information about how to create valid custom images for the controller VM instance or for custom instance templates, see the Slurm on Google Cloud custom images documentation on GitHub.

Turn on scheduled maintenance reservations

A maintenance event occurs when Google Cloud stops a VM instance to perform a hardware update or a software update. The host maintenance policy determines these events. Maintenance events can disrupt your running jobs.

To protect jobs from termination during maintenance, you can turn on the scheduled maintenance reservation feature for your compute nodeset. When you turn on this feature, Slurm reserves your node for maintenance during the maintenance window. Slurm does not schedule any jobs that overlap with the maintenance reservation.

To turn on the maintenance reservation feature, configure your blueprint to resemble the following example:

  - id: compute_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      enable_maintenance_reservation: true

When you execute a job on the Slurm cluster, you can specify the total run time of the job in minutes by using the -t flag. This flag helps ensure that the job executes only outside of the maintenance window. The following command sets the total run time to 10 minutes:

srun -n1 -pcompute -t 10:00 JOB_FILE.sh

Only the Compute Engine API alpha version supports the maintenance notification feature. To update the API version in your blueprint, configure the endpoint_versions setting to resemble the following example:

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    settings:
      endpoint_versions:
        compute: "alpha"

Opportunistic Compute Engine API maintenance in Slurm

You can turn on opportunistic Compute Engine API maintenance to perform early maintenance as a Slurm job. If the system detects a node for maintenance, then Slurm creates a maintenance job and places the job in the queue.

If you use the backfill scheduler, Slurm backfills the maintenance job if a time window is available. You can also use the built-in scheduler. If you use the built-in scheduler, then Slurm executes the maintenance job in strict priority order. If the maintenance job does not execute, then forced maintenance occurs at the scheduled window.

To turn on this feature at the nodeset level, configure your blueprint to look like the following:

  - id: debug_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      enable_opportunistic_maintenance: true

Use placement policies

When you turn on the enable_placement setting with Slurm, Google Cloud attempts to place the VM instances as physically close together as possible. Capacity constraints at the time of VM creation might still force the system to spread the VM instances across multiple racks.

Google provides the max-distance flag to control the maximum allowed spreading. For more information about the max-distance flag, see Reduce latency by using compact placement policies.

You can use the placement_max_distance setting on the schedmd-slurm-gcp-v6-nodeset module to control the max-distance behavior. The following example demonstrates how to configure this setting:

  - id: nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [ network ]
    settings:
      machine_type: c2-standard-4
      node_count_dynamic_max: 30
      enable_placement: true
      placement_max_distance: 1

You must set the enable_placement variable to true for the placement_max_distance setting to take effect.

In the preceding example, a value of 1 restricts the VM instances to the same rack. To confirm that the system applied the max-distance value, execute the following command while jobs are running:

gcloud beta compute resource-policies list \
  --format='yaml(name,groupPlacementPolicy.maxDistance)'

Configure tree width and node communication

Slurm uses a fan-out mechanism to communicate with large groups of nodes. The TreeWidth configuration variable determines the shape of this fan-out tree.

In the cloud, this fan-out mechanism can become unstable when nodes restart with new IP addresses. You can enforce that all nodes communicate directly with the controller by setting the tree_width value to a number greater than or equal to the largest partition.

If the largest partition contains 200 nodes, then configure the blueprint to resemble the following example:

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    settings:
      cloud_parameters:
        tree_width: 200

The default value is 128. Values greater than 128 might cause congestion on the controller.

Configure resume rate and node resumption

The ResumeRate parameter in the slurm.conf file controls the maximum number of nodes that Slurm attempts to resume per minute. This parameter is important in cloud environments because autoscaling can cause a large number of nodes to start concurrently.

When many nodes start simultaneously, they place a heavy load on shared resources, such as shared file systems. To reduce the peak load on these shared resources and improve overall cluster stability during scaling events, you can limit the ResumeRate parameter to stagger the node startup process.

The following example demonstrates how to limit the node resumption rate to 100 nodes per minute by using the cloud_parameters setting:

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    settings:
      cloud_parameters:
        resume_rate: 100

Adjust this value based on the capabilities of your shared file system and the expected scaling behavior of your cluster.

What's next

For the complete list of inputs and outputs for this module, see the schedmd-slurm-gcp-v6-controller module page in the Cluster Toolkit GitHub repository.