Create a fully managed Slurm cluster for AI workloads

This document explains how to configure and deploy a fully managed Slurm cluster that uses A4X, A4, A3 Ultra, A3 Mega, or A3 High machine types. To learn more about these accelerator-optimized machine types, see GPU machine types.

The steps in this document show you how to create a Slurm cluster by using Cluster Director. Cluster Director is a Google Cloud product that automates the setup and configuration of Slurm clusters. It's designed for IT administrators and AI researchers who want to avoid the overhead of managing a cluster and focus on running their workloads. If you want more control over the deployment and management of your cluster, then create your cluster by using Cluster Toolkit.

Limitations

Depending on the machine type that the Compute Engine instances in your cluster use, the following limitations apply:

A4X

You don't receive sustained use discounts or flexible committed use discounts for instances that use this machine type.
You can only create instances in certain regions and zones.
You can't use Persistent Disk (regional or zonal). You can only use Google Cloud Hyperdisk.
This machine type is only available on the NVIDIA Grace platform.
Machine type changes aren't supported for A4X. To switch to or from this machine type, you must create a new instance.
You can't run Windows operating systems on this machine type. For a list of supported Linux operating systems, review the supported operating systems for GPU instances.
For A4X instances, when you use ethtool -S to monitor GPU networking, the physical port counters that end in _phy don't update. This is expected behavior for instances that use the MRDMA Virtual Function (VF) architecture. For more information, see MRDMA functions and network monitoring tools.
A4X instances don't support the following:
You can't attach Hyperdisk ML disks created before February 4, 2026 to A4X machine types.

A4

You don't receive sustained use discounts and flexible committed use discounts for instances that use an A4 machine type.
You can only use an A4 machine type in certain regions and zones.
You can't use Persistent Disk (regional or zonal). You can only use Google Cloud Hyperdisk.
The A4 machine type is only available on the Emerald Rapids CPU platform.
You can't change the machine type of an instance to or from A4 machine type. You must create a new instance with this machine type.
A4 machine types don't support sole-tenancy.
You can't run Windows operating systems on an A4 machine type.
For A4 instances, when you use ethtool -S to monitor GPU networking, physical port counters that end in _phy don't update. This is expected behavior for instances that use the MRDMA Virtual Function (VF) architecture. For more information, see MRDMA functions and network monitoring tools.
You can't attach Hyperdisk ML disks that were created before February 4, 2026 to A4 machine types.

A3 Ultra

You don't receive sustained use discounts and flexible committed use discounts for instances that use an A3 Ultra machine type.
You can only use an A3 Ultra machine type in certain regions and zones.
You can't use Persistent Disk (regional or zonal). You can only use Google Cloud Hyperdisk.
The A3 Ultra machine type is only available on the Emerald Rapids CPU platform.
Machine type changes aren't supported for A3 Ultra machine type. To switch to or from this machine type, you must create a new instance.
You can't run Windows operating systems on an A3 Ultra machine type.
A3 Ultra machine types don't support sole-tenancy.
For A3 Ultra instances, when you use ethtool -S to monitor GPU networking, physical port counters that end in _phy don't update. This is expected behavior for instances that use the MRDMA Virtual Function (VF) architecture. For more information, see MRDMA functions and network monitoring tools.

A3 Mega

You don't receive sustained use discounts and flexible committed use discounts for instances that use an A3 Mega machine type.
You can only use an A3 Mega machine type in certain regions and zones.
You can't use regional Persistent Disk on an instance that uses an A3 Mega machine type.
The A3 Mega machine type is only available on the Sapphire Rapids CPU platform.
Machine type changes aren't supported for A3 Mega machine type. To switch to or from this machine type, you must create a new instance.
You can't run Windows operating systems on an A3 Mega machine type.

A3 High

You don't receive sustained use discounts and flexible committed use discounts for instances that use an A3 High machine type.
You can only use an A3 High machine type in certain regions and zones.
You can't use regional Persistent Disk on an instance that uses an A3 High machine type.
The A3 High machine type is only available on the Sapphire Rapids CPU platform.
Machine type changes aren't supported for A3 High machine type. To switch to or from this machine type, you must create a new instance.
You can't run Windows operating systems on an A3 High machine type.
You can only use a3-highgpu-8g. A3 High machine types with fewer than 8 GPUs aren't supported.

Before you begin

Before creating a Slurm cluster, if you haven't already done so, complete the following steps:

Choose a consumption option: your choice of consumption option determines how you get and use GPU resources. To learn more, see Choose a consumption option.
Obtain capacity: the process to obtain capacity differs for each consumption option. To learn about the process to obtain capacity for your chosen consumption option, see Capacity overview.
Verify that you have enough Filestore capacity quota: you need to have enough Filestore quota in your target region before deploying. The required minimum capacity depends on the machine types in your cluster:
- A4X Max, A4X, A4, A3 Ultra, and A3 Mega: requires a minimum of 10 TiB (10,240 GiB) of HIGH_SCALE_SSD (zonal) capacity.
- A3 High: requires a minimum of 2.5 TiB (2,560 GiB) of BASIC_SSD (standard) capacity.
To check quota or request a quota increase, see the following:
- To check quota in your project, see View API-specific quota.
- If you don't have enough quota, then request a quota increase.
Verify trusted image policy: if the organization in which your project exists has a trusted image policy (constraints/compute.trustedImageProjects), then verify that the clusterdirector-public-images project is included in the list of allowed projects. To learn more, see Setting up trusted image policies.

Required roles

To create a Slurm cluster, you need the following IAM roles and permissions:

To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on your project:
- To create and manage a cluster: Cluster Director Editor (roles/hypercomputecluster.editor)
- To create and manage VMs in a cluster: Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1)
- To connect to the login node in a cluster:
  - Compute OS Login (roles/compute.osLogin)
  - IAP-Secured Tunnel User (roles/iap.tunnelResourceAccessor)
For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.
To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on the Compute Engine default service account:
- To create a cluster: Service Account User (roles/iam.serviceAccountUser)
- To manage resources in a cluster:
  - Logs Writer (roles/logging.logWriter)
  - Monitoring Metric Writer (roles/monitoring.metricWriter)
  - Storage Object Viewer (roles/storage.objectViewer)

Create a Slurm cluster

To create an AI-optimized cluster by using Cluster Director, complete the following steps:

Configure compute resource configurations
Configure network
Configure storage resources
Configure the Slurm environment

Configure compute resource configurations

To configure compute resource configurations when creating a cluster, complete the following steps:

In the Google Cloud console, go to the Cluster Director page.

Go to Cluster Director
Click Create cluster.
In the dialog that appears, click Reference architecture. The Create a cluster page opens.
Click one of the available templates. You can optionally edit the template to adapt it to your workload's needs.
Click Customize.
In the Compute section, in the Cluster name field, enter a name for your cluster. The name can contain up to 10 characters, and it can only use numbers or lowercase letters (a-z).
To add information to the pre-configured compute resource configuration, or edit the number and type of compute instances that the configurations specifies, do the following:
1. In the Compute section, click Edit resource configuration. The Add resource configuration pane appears.
2. Optional: To change the compute resource configuration name, in the Name field, enter a new name.
3. Optional: To change the number and type of compute instances that your cluster uses, in the Machine configuration section, follow the prompts to update the compute resources.
4. In the Consumption options section, specify the consumption option that you want to use to obtain resources:
  - To create compute instances by using a reservation, do the following:
    1. Click the Use reservation tab.
    2. Click Select reservation. The Choose a reservation pane appears. If you want to use a reservation of A4X VMs, then you can optionally choose the block or sub-block to control the placement of your VMs.
    3. Select the reservation that you want to use. Then, click Choose. This action automatically sets the Region and Zone of your compute resources.
  - To create Flex-start VMs, do the following:
    1. Click the Flex start tab.
    2. In the Time limit for the VM section, specify the run duration for the compute instances. The value must be between 10 minutes and 7 days.
    3. In the Location section, select the region where you want to create Flex-start VMs. The Google Cloud console automatically filters the available regions to only show those regions that support Flex-start VMs for your selected machine type.
  - To create Spot VMs, do the following:
    1. Click the Use spot tab.
    2. In the On VM termination list, select one of the following options:
      - To delete Spot VMs on preemption, select Delete.
      - To stop Spot VMs on preemption, select Stop.
    3. In the Location section, select the Region and Zone where you want to create Spot VMs. The Google Cloud console automatically filters the available regions to only show those regions that support Spot VMs for your selected machine type.
5. Click Done.
6. Optional: To create additional compute resource configurations for a partition, click the Add resource configuration, and then follow the prompts to specify the compute resources.
Click Continue.

Configure network

To configure the network that your cluster uses, complete the following steps:

In the Choose a Virtual Private Cloud (VPC) network section, do one of the following:
- Recommended: To let AI Hypercomputer automatically create a pre-configured VPC network for your cluster, do the following:
  1. Select Create a new VPC network.
  2. In the Network name field, enter a name for the VPC network.
- To use an existing VPC or Shared VPC network, do the following:
  1. Select Use a VPC network in the current project or Use a Shared VPC network hosted in another project.
  2. In the Select VPC network or Shared VPC network list, select a VPC or Shared VPC network that meets the required configurations.
  3. In the Select subnetwork list, select an existing subnetwork.
Click Continue.

Configure storage resources

To configure the storage resources that your cluster uses, in the Storage section, complete the following steps:

Optional: To edit a storage resource, click the Edit storage plan, and then follow the prompts to update the configuration of the storage resource.
Optional: To add storage resources to your cluster, click Add storage configuration, and then follow the prompts to specify the configuration for the storage resources.
Click Continue.

Configure the Slurm environment

To configure the Slurm environment in your cluster, complete the following steps:

Optional: To edit the number and type of compute instances that the login node uses, expand the Login node section, and then follow the prompts to update the compute resources.
Optional: To edit partitions of your cluster to organize your compute resources, expand the Partitions section, and then do one of the following:
- To add a partition, click Add partition, and then do the following:
  1. In the Partition name field, enter a name for the partition.
  2. To edit a nodeset, click Toggle nodeset. Otherwise, to add a nodeset, click Add nodeset.
  3. In the Nodeset name field, enter a name for your nodeset.
  4. In the Resource configuration field, select a compute resource configuration that you created in the previous steps.
  5. In the Source image list, select one of the supported OS images for AI Hypercomputer.
  6. In the Static node count field, enter the minimum number of compute instances that must always be running in the cluster.
  7. In the Dynamic node count field, enter a maximum number of compute instances that AI Hypercomputer can increase the cluster to during increases in traffic.
    
    Important: If you create compute instances in the nodeset by using a reservation, especially a shared reservation, then verify that the reservation has enough resources available to create your specified maximum number of compute instances. Other workloads that use the same reservation might fully consume it and, therefore, AI Hypercomputer might be unable to create more compute instances in your nodeset.
  8. In the Boot disk type list and Boot disk size field, enter the type and size of the boot disk for the compute instances to use.
  9. Click Done.
- To remove a partition, click Delete partition.
Optional: To add prolog or epilog scripts to your Slurm environment, do the following:
1. Expand the Advanced orchestration settings section.
2. In the Scripts section, follow the prompts to add scripts.
Click Create. The Clusters page appears. Creating the cluster can take some time to complete. The completion time depends on the number of compute instances that you request and resource availability in the compute instances' zone. If your requested resources are unavailable, then AI Hypercomputer maintains the creation request until resources become available. To view the status of the cluster create operation, view your cluster's details.

Connect to the Slurm cluster

When AI Hypercomputer creates your login node, the cluster state changes to Ready. You can then connect to your cluster; however, you can run workloads only after AI Hypercomputer creates the compute nodes in the cluster.

To connect to a cluster's login node through SSH by using the Google Cloud console, complete the following steps:

In the Google Cloud console, go to the Clusters page.

Go to Clusters
In the Clusters table, in the Name column, click the name of the cluster that you created in the previous section. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Nodes tab.
In the Login nodes section, in the Connect column, locate the cluster's login node, which name is CLUSTER_NAME-login-001.
In the Connect column within the login's node, click the SSH button. The SSH-in-browser window opens.
If prompted, then click Authorize. Connecting to your node can take up to a minute to complete.

Note: If you encounter errors when connecting to your node, then see Troubleshooting SSH errors.

Verify Slurm cluster health

Before you run a job on a compute node, Slurm automatically runs a quick GPU health check on the node. If the node fails the check, then Slurm drains the node and prevents scheduling new jobs on it.

To more thoroughly test GPU health and network bandwidth across the compute nodes in a cluster partition, you can manually run NVIDIA Collective Communications Library (NCCL) tests. If an NCCL test identifies any unhealthy nodes, then you can repair the nodes or modify your cluster. NCCL tests help you verify a cluster's health before you run critical workloads. For more information, see Verify cluster health.

Delete the Slurm cluster

To delete a Slurm cluster in your project, select one of the following options:

In the Google Cloud console, go to the Clusters page.

Go to Clusters
In the Clusters table, in the Name column, click the name of the cluster that you want to delete. A page that gives the details of the cluster appears, and the Details tab is selected.
Click Delete.
In the dialog that appears, enter the name of your cluster, and then click Delete to confirm. The Clusters page appears. Deleting your cluster can take some time to complete.