This document explains the architecture and core capabilities of the managed Slurm environment that Cluster Director uses to handle cluster management, job scheduling, and workload orchestration.
Slurm is a highly scalable and fault-tolerant open-source cluster manager and job scheduler. It's an open-source cluster management and job scheduling system with built-in integration and management within the Cluster Director platform. Slurm is a standard orchestrator for artificial intelligence (AI), machine learning (ML), and high performance computing (HPC) workloads.
Slurm architecture in Cluster Director
When you create and deploy a cluster, Cluster Director automatically sets up a pre-configured Slurm environment for you. This environment consists of the following nodes:
Controller node: the primary control plane component that monitors resources, manages system state, and processes all submitted jobs. The Slurm controller is deployed in a fault-tolerant configuration to help ensure high availability and reliability.
Login nodes: the primary entry points for users to interact with the cluster. You connect to a login node by using SSH to submit jobs, review job statuses, and manage your workflows. The login nodes are pre-configured with the necessary Slurm commands and have access to the shared file systems.
Compute nodes: these nodes are responsible for executing your jobs. Slurm manages these nodes in partitions, which are logical groupings of nodes with specific characteristics (such as machine type or GPU model). You can configure compute nodes within each partition as follows:
Static nodes: you specify a minimum number of virtual machine (VM) instances to always be running in the cluster, which provides a baseline of available resources.
Dynamic nodes: you specify a maximum number of additional VMs that the cluster can automatically create to handle increases in demand.
Key Slurm capabilities
Slurm provides several key capabilities that are essential for running large-scale AI, ML, or HPC workloads as follows:
Job scheduling and queuing: Slurm manages a queue of jobs submitted by users. It allocates resources to these jobs based on priority, policies, and resource availability. This approach helps ensure that your cluster efficiently uses resources.
Resource management: Slurm has a detailed awareness of the cluster's resources, including vCPUs, memory, and GPUs. When you submit a job, you can request the specific resources your job needs, and Slurm helps ensure it runs on nodes that meet those requirements.
Topology-aware workload placement: Cluster Director uses Slurm's topology-aware scheduling capabilities to run your workloads. By leveraging information about the physical network layout, Slurm can colocate tasks of a job on VMs that are close to each other in the network. This approach minimizes latency and is critical for the performance of tightly-coupled distributed training workloads.
Fault tolerance: the managed Slurm environment is designed for resilience. Together with autohealing features in Cluster Director, Slurm handles host errors by automatically rescheduling jobs and replacing failed nodes, providing a resilient environment for long-running and critical workloads.