Submit and monitor jobs with Slurm

This document explains how to submit and monitor jobs by using Slurm in Cluster Director. By using hardware-optimized submission flags and checking node states, you can achieve the following:

  • Align job scheduling with the underlying physical hardware for A4X virtual machine (VM) instances to minimize network latency.

  • Verify job allocations across cluster partitions and forcefully stop unneeded nodes.

For a high-level overview of how Slurm orchestrates jobs in Cluster Director, see Slurm orchestration in Cluster Director.

Before you begin

If you haven't already, then connect to a login node in your cluster. For instructions, see Connect to a cluster's login node.

Requirements for submitting jobs for A4X VMs

For cluster partitions that use A4X VMs, Slurm uses block topology to align job scheduling with the physical hardware structure of A4X machines; specifically, NVLink domains. This configuration minimizes network latency by ensuring that tasks run on VMs that are physically close to each other.

When you run the salloc, sbatch, or srun commands in Slurm, you can control how the job interacts with blocks of A4X VMs by using the following flags:

  • --segment=SEGMENT_SIZE: this flag groups nodes into segments of a specific size. This configuration lets Slurm fit your job into the available capacity by bypassing nodes that are drained or unavailable. The value must be between 1 and 18. If your job can't start because there aren't enough adjacent nodes to match your segment size, then we recommend using a value of 1.

  • --exclusive=topo: this flag reserves an entire sub-block for a job. This isolation helps ensure that no other jobs share the NVLink domain, preventing interference.

Monitor job allocations

The following sections explain how to monitor job allocations and forcefully stop nodes in your cluster.

Verify job allocations in your cluster

To verify whether your nodes have jobs scheduled on them, check the node state suffix in Slurm. To do so, complete the following steps:

  1. If you haven't already, then connect to a login node in your cluster.

  2. To view information about the nodes and partitions in your cluster, use the sinfo command:

    sinfo
    

    In the output, you can view the state for each node:

    • alloc: Slurm has assigned all vCPUs on the node to one or more jobs.

    • idle: the node has obtained capacity and is preparing to run jobs.

    • #idle: Cluster Director is provisioning capacity to run your job on the node. If the node is a Flex-start VM, then this state also indicates that Cluster Director is attempting to gain capacity.

    • idle~: the node isn't running or has stopped running.

    • %idle: Cluster Director is deleting the node.

    • ~idle: Cluster Director is stopping the node.

    • mix: Slurm has allocated jobs to some, but not all, vCPUs on the node.

    • #mix: Slurm has allocated jobs to some vCPUs on the node; however, Cluster Director is still looking for capacity to run the jobs.

Forcefully stop nodes

To stop specific nodes when, for example, you want to free up resources that are stuck or scale down your cluster, use the scontrol command:

scontrol update nodename=NODE_NAMES state=power_down_force

Replace NODE_NAMES with a comma-separated list of nodes that you want to stop—for example, node-1,node-2.

When the stop operation starts, Slurm sets the node states to ~idle. When the stop operation completes, the node states change to idle~.

What's next