This document outlines the steps to configure and deploy Slurm clusters that use A4X, A4, or A3 Ultra accelerator-optimized machine types.
Before you begin
Before creating a Slurm cluster, if you haven't already done so, complete the following steps:
- Choose a consumption option: your choice of consumption option determines how you get
    and use GPU resources.
    To learn more, see Choose a consumption option. 
- Obtain capacity: the process to obtain capacity differs for each consumption option.
    To learn about the process to obtain capacity for your chosen consumption option, see Capacity overview. 
- Ensure that you have enough Filestore quota: you need a minimum of
    10,240 GiB of zonal (also known as high scale SSD) capacity.
    - To check quota, see View API-specific quota.
- If you don't have enough quota, request a quota increase.
 
- Install Cluster Toolkit: to provision Slurm clusters, you must use
    Cluster Toolkit version
    v1.62.0or later.To install Cluster Toolkit, see Set up Cluster Toolkit. 
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Set up a storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
To create this bucket and enable versioning from the CLI, run the following commands:
gcloud storage buckets create gs://BUCKET_NAME \
    --project=PROJECT_ID \
    --default-storage-class=STANDARD --location=BUCKET_REGION \
    --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
- BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements.
- PROJECT_ID: your project ID.
- BUCKET_REGION: any Google Cloud region of your choice.
Open the Cluster Toolkit directory
To use Slurm with Google Cloud, you must install Cluster Toolkit. After you install the toolkit, ensure that you are in the Cluster Toolkit directory by running the following command:
cd cluster-toolkit
This cluster deployment requires Cluster Toolkit v1.62.0 or
later. To check your version, you can run the following command:
./gcluster --version
Create a deployment file
Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as project ID, region, and zone.
To create a deployment file, follow the steps for your required machine type and consumption option.
A4X machines
To create your deployment file, use a text editor to create a YAML file named
a4x-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4x_cluster_size: NUMBER_OF_VMS
  a4x_reservation_name: RESERVATION_NAME
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region that has the reserved machines.
- ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your account team when the capacity was delivered.
- NUMBER_OF_VMS: the number of A4X VMs in your cluster. You can specify any number of VMs. However, A4X VMs are physically interconnected by a multi-node NVLink system in groups of 18 VMs (72 GPUs) to form an NVLink domain.- For optimal network performance, we recommend that you specify a value that is a multiple of 18 VMs (for example, 18, 36, or 54). When you create an A4X cluster, the A4X blueprint automatically creates and applies a compact placement policy with a GPU topology of - 1x72for each group of 18 VMs. For more information about A4X topology, see A4X fundamentals.
- RESERVATION_NAME: the name of your reservation.
A4 machines
The parameters that you need to add to your deployment file depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-bound
To create your deployment file, use a text editor to create a YAML file named
a4high-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_reservation_name: RESERVATION_NAME
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region that has the reserved machines.
- ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your account team when the capacity was delivered.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
- RESERVATION_NAME: the name of your reservation.
Flex-start
To create your deployment file, use a text editor to create a YAML file named
a4high-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_dws_flex_enabled: true
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region where you want to provision your cluster.
- ZONE: the zone where you want to provision your cluster.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
This deployment provisions static compute nodes,
  which means that the cluster has a set number of nodes at all times. If you want to enable your
  cluster to autoscale instead, use examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml file and edit the values of
  node_count_static and node_count_dynamic_max to match the following:
      node_count_static: 0
      node_count_dynamic_max: $(vars.a4h_cluster_size)
Spot
To create your deployment file, use a text editor to create a YAML file named
a4high-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_enable_spot_vm: true
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region where you want to provision your cluster.
- ZONE: the zone where you want to provision your cluster.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
A3 Ultra machines
The parameters that you need to add to your deployment file depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.
Reservation-bound
To create your deployment file, use a text editor to create a YAML file named
a3ultra-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_reservation_name: RESERVATION_NAME
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region that has the reserved machines.
- ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your account team when the capacity was delivered.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
- RESERVATION_NAME: the name of your reservation.
Flex-start
To create your deployment file, use a text editor to create a YAML file named
a3ultra-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_dws_flex_enabled: true
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region where you want to provision your cluster.
- ZONE: the zone where you want to provision your cluster.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
This deployment provisions static compute nodes,
  which means that the cluster has a set number of nodes at all times. If you want to enable your
  cluster to autoscale instead, use examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml file and edit the values of
  node_count_static and node_count_dynamic_max to match the following:
      node_count_static: 0
      node_count_dynamic_max: $(vars.a3u_cluster_size)
Spot
To create your deployment file, use a text editor to create a YAML file named
a3ultra-slurm-deployment.yaml and add the following content.
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME
vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_enable_spot_vm: true
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
- DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
- PROJECT_ID: your project ID.
- REGION: the region where you want to provision your cluster.
- ZONE: the zone where you want to provision your cluster.
- NUMBER_OF_VMS: the number of VMs that you want for the cluster.
Provision a Slurm cluster
Cluster Toolkit provisions the cluster based on the deployment file you created in the previous step and the default cluster blueprint. For more information about the software that is installed by the blueprint, including NVIDIA drivers and CUDA, learn more about Slurm custom images.
To provision the cluster, run the command for your machine type from the Cluster Toolkit directory. This step takes approximately 20-30 minutes.
A4X machines
./gcluster deploy -d a4x-slurm-deployment.yaml examples/machine-learning/a4x-highgpu-4g/a4x-slurm-blueprint.yaml --auto-approve
A4 machines
./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --auto-approve
A3 Ultra machines
./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --auto-approve
Connect to the Slurm cluster
To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.
Console
- Go to the Compute Engine > VM instances page. 
- Locate the login node. It should have a name with the pattern - DEPLOYMENT_NAME+- login-001.
- From the Connect column of the login node, click SSH. 
gcloud
To connect to the login node, complete the following steps:
- Identify the login node by using the - gcloud compute instances listcommand.- gcloud compute instances list \ --zones= - ZONE\ --filter="name ~ login" --format "value(name)"- If the output lists multiple Slurm clusters, you can identify your login node by the - DEPLOYMENT_NAMEthat you specified.
- Use the - gcloud compute sshcommand to connect to the login node.- gcloud compute ssh LOGIN_NODE \ --zone= - ZONE--tunnel-through-iap- Replace the following: - ZONE: the zone where the VMs for your cluster are located.
- LOGIN_NODE: the name of the login node, which you identified in the previous step.
 
Test network performance on the Slurm cluster
To test NCCL communication, complete the steps for your machine type.
A4X and A4 machines
The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests. Ramble and its dependencies are compatible with the ARM64 architecture used by A4X machines.
The run scripts used for this test are staged in the
/opt/apps/system_benchmarks on the Slurm controller node and are
available to all nodes in the cluster. Running this test installs Ramble
to /opt/apps/ramble.
- From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses - nohupand redirects the- stdout/errto a log file .- nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log & - This command creates a folder called - nccl-tests_$(date +%s)that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.- For example, if your cluster has 16 nodes then NCCL tests are ran for - all-gather,- all-reduce, and- reduce-scatteron 2, 4, 8, and 16 nodes.
- Review the results. The - nccl.logcontains the logs from setting up and running the test. To view, you can run:- tail -f nccl.log - You can also use - Ctrl+Cto stop tailing the output at any time. At the end of the- nccl.log, your output should resemble the following:- ... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete ------- - All of the Slurm job scripts and nccl-tests output logs are stored in the - nccl-tests_$(date +%s)/experiments. A summary of the NCCL test performance is also stored in- nccl-tests_${date +%s)/summary.tsv.- Removing - nccl-tests_$(date +%s)/removes all of the files generated during these tests.
A3 Ultra machines
- Download the script needed to build the NCCL test. - From the shared directory of the login node, complete the following steps. The shared directory is usually located at - ${HOME}.- wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh 
- After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command: - sbatch build-nccl-tests.sh - The preceding script runs on one of your nodes. It uses the - --container-mountsswitch to mount your current directory,- $PWD, into the- /nccldirectory within the container.
- Verify that the NCCL test is built. To verify this, run the following command: - sacct -a - If successfully completed, the output is similar to the following: - JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1 build-ncc+ a3ultra 112 COMPLETED 0:0 - If the build is successful you should also have a file named - nvidia+pytorch+24.09-py3.sqshin the directory where you ran the command along with a directory named- nccl-tests.
- Check that the - nccl-tests/buildfolder contains several binaries, including- all_gather_perf.
- Download the NCCL test script. - wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh - To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the - run-nccl-tests.shscript that you just downloaded.
- Run the NCCL test script. The test can take approximately 15 minutes, or longer. - sbatch run-nccl-tests.sh 
- Review the results. The script outputs a - slurm-XX.outfile that contains the result of the nccl- all_gather_perfbenchmark.- The output is similar to the following: - # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 # Out of bounds values : 0 OK # Avg bus bandwidth : XXX.XX #
Redeploy the Slurm cluster
If you need to increase the number of compute nodes or add new partitions to
your cluster, you might need to update configurations for your Slurm cluster by
redeploying. Redeployment can be sped up by using an existing image from a
previous deployment. To avoid creating new images during a redeploy, specify the
--only flag.
To redeploy the cluster using an existing image do the following:
- Run the command for your required machine type: - A4X machines- ./gcluster deploy -d a4x-slurm-deployment.yaml examples/machine-learning/a4x-highgpu-4g/a4x-slurm-blueprint.yaml --only cluster-env,cluster --auto-approve -w - A4 machines- ./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --only cluster-env,cluster --auto-approve -w - A3 Ultra machines- ./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster --auto-approve -w - This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure. 
Destroy the Slurm cluster
By default, the A4X, A4, and A3 Ultra blueprints enable deletion protection on the Filestore instance. To delete the Filestore instance when you destroy the Slurm cluster, disable deletion protection before running the destroy command. For instructions, see set or remove deletion protection on an existing instance.
- Disconnect from the cluster if you haven't already. 
- Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory. 
- To destroy the cluster, run: 
./gcluster destroy DEPLOYMENT_FOLDER --auto-approve
Replace the following:
- DEPLOYMENT_FOLDER: the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME.
When destruction is complete you should see a message similar to the following:
Destroy complete! Resources: xx destroyed.
To learn how to cleanly destroy infrastructure and for advanced manual
deployment instructions, see the deployment folder located at the root of
the Cluster Toolkit directory: DEPLOYMENT_FOLDER/instructions.txt
What's next
- Verify reservation consumption
- View VMs topology
- Learn how to manage host events:
- Monitor VMs in your Slurm cluster
- Report faulty host