This document explains how to manage cluster health and perform maintenance tasks in Cluster Director.
Cluster Director provides several tools to help you proactively manage cluster resilience and help ensure optimal operation. By using the Google Cloud console, you can do one or more of the following:
Manually start scheduled maintenance events for your cluster.
Start the automated repair process for unhealthy nodes.
Run health checks across the nodes in your cluster.
Before you begin
When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.
Required roles
To get the permissions that
you need to view clusters,
ask your administrator to grant you the
Hypercompute Cluster Editor (roles/hypercomputecluster.editor)
IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Perform cluster management tasks
The following sections describe how you can manage the health of your cluster.
Manually start a cluster maintenance
If a maintenance event is scheduled for your cluster, then you can immediately start maintenance rather than waiting for the scheduled time.
To manually start a cluster maintenance by using the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Topology tab. Then, if a maintenance event is scheduled for your cluster, click Start maintenance now.
Report and repair unhealthy cluster nodes
Cluster Director automatically monitors the health of your nodes. If Cluster Director identifies a node as unhealthy due to host errors, then you can optionally report the node to start a repair operation. This action can help you minimize workload disruption and quickly restore the cluster to an optimal state.
To report and repair an unhealthy node in your cluster by using the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Topology tab. Then, if one or more nodes are unhealthy, click Report all unhealthy nodes.
Run health checks across cluster nodes
In addition to the automated health monitoring that Cluster Director performs, you can manually run a health check across the compute nodes in a cluster partition by using NVIDIA Collective Communications Library (NCCL) tests. If a test identifies one or more unhealthy nodes, then you can start a repair operation or modify your cluster. This approach helps minimize the number of unhealthy nodes in a cluster partition.
Based on the machine type that the compute nodes in a cluster partition use, select one of the following options:
A4X
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
run-nccl-tests-via-ramble.shfile with a text editor of your choice.Compare the network interface names in the
run-nccl-tests-via-ramble.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in theenv_varssection within therun-nccl-tests-via-ramble.shfile, edit theOMPI_MCA_btl_tcp_if_include,UCX_NET_DEVICES, andNCCL_SOCKET_IFNAMEvariables as follows:env_vars: set: OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME PMIX_MCA_gds: ^ds12 UCX_NET_DEVICES: UCX_NET_DEVICES_LIST PMIX_MCA_psec: native UCX_IB_FORK_INIT: n NCCL_NET: gIB NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/libReplace the following:
TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example,enp0s1.UCX_NET_DEVICES_LIST: a comma-separated list of Unified Communication X (UCX) network devices—for example,gpu0rdma0,gpu1rdma0,gpu2rdma0,gpu3rdma0.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.
If the partition where the node that you want to test isn't the default partition, then, at the end of the
sbatchsection, add the following line:#SBATCH --partition=PARTITION_NAMEBy default, the test runs the
AllGather,AllReduce, andReduceScatterbenchmarks across an incremental and increasing number of nodes, up to the number of nodes that are available in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in theapplicationssection, edit theworkloadandn_nodesvariables as follows:applications: nccl-tests: workloads: '{workload}': experiments: '{workload}-{n_nodes}': variables: workload: [NCCL_BENCHMARK] n_nodes: [NUMBER_OF_NODES] matrix: - n_nodes - workloadReplace the following:
NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example,all-gather, all-reduce, reduce-scatter.NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between1and the number of nodes in your cluster partition—for example,2, 4, 8, 16, 32.
To run the NCCL test script, run the following command:
nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.logRunning the NCCL test can take some time to complete. When it does complete, the output is similar to the following:
... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete -------
A4
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
run-nccl-tests-via-ramble.shfile with a text editor of your choice.Compare the network interface names in the
run-nccl-tests-via-ramble.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in theenv_varssection within therun-nccl-tests-via-ramble.shfile, edit theOMPI_MCA_btl_tcp_if_includeandNCCL_SOCKET_IFNAMEvariables as follows:env_vars: set: OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME PMIX_MCA_gds: ^ds12 NCCL_NET: gIB NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/libReplace the following:
TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example,enp0s1.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.
If the partition where the node that you want to test isn't the default partition, then, at the end of the
sbatchsection, add the following line:#SBATCH --partition=PARTITION_NAMEBy default, the test runs the
AllGather,AllReduce, andReduceScatterbenchmarks across an incremental and increasing number of nodes, up to the number of nodes in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in theapplicationssection, edit theworkloadandn_nodesvariables as follows:applications: nccl-tests: workloads: '{workload}': experiments: '{workload}-{n_nodes}': variables: workload: [NCCL_BENCHMARK] n_nodes: [NUMBER_OF_NODES] matrix: - n_nodes - workloadReplace the following:
NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example,all-gather, all-reduce, reduce-scatter.NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between1and the number of nodes in your cluster partition—for example,2, 4, 8, 16, 32.
To run the NCCL test script, run the following command:
nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.logRunning the NCCL test can take some time to complete. When it does complete, the output is similar to the following:
... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete -------
A3 Ultra
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/import_pytorch_container.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
build-nccl-tests.shandrun-nccl-tests.shfiles with a text editor of your choice.Compare the network interface names in the
run-nccl-tests.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in therun-nccl-tests.shfile, edit theNCCL_SOCKET_IFNAMEvariable as follows:source /usr/local/gib/scripts/set_nccl_env.sh export NCCL_NET=gIB export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMESReplace
NCCL_SOCKET_IFNAMESwith a comma-separated list of NCCL socket interface names—for example,enp0s1,enp192s1.If the partition where your node exists isn't the default cluster's partition, then, in the
sbatchsection of thebuild-nccl-tests.shandrun-nccl-tests.shfiles, add the following line:#SBATCH --partition=PARTITION_NAMEOptional: To run a different NCCL benchmark than the default
AllGatherbenchmark, edit the last line of therun-nccl-tests.shfile as follows:/nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;Replace
NCCL_BENCHMARKwith the type of NCCL benchmark that you want to run on the cluster nodes—for example,all_reduce_perf.
To import the squash container image, run the following command:
bash ./import_pytorch_container.shImporting the squash container image takes approximately 10 minutes to complete.
To build the test, as well as grant the test access to all of a node's resources by using the
--exclusiveflag, run the following command:sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.shBuilding the test takes approximately five minutes to complete.
To run the NCCL test, run the following command:
sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.shReplace
TEST_SIZEwith the number of nodes that you want to test in your nodeset. Specify a value between1and the number of nodes in your cluster partition.Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an
slurm-JOB_ID.outfile in the$HOMEdirectory with the results of your test.
A3 Mega
If you haven't already, then connect to a login node in your cluster.
In the
$HOMEdirectory, download the following NCCL scripts:wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/import_pytorch_container.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/build-nccl-tests.sh wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/run-nccl-tests.shTo view a list of network interface names that the compute nodes in a cluster partition use, run the following command:
srun -N 1 --partition=PARTITION_NAME ip addr showReplace
PARTITION_NAMEwith the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:
Open the
build-nccl-tests.shandrun-nccl-tests.shfiles with a text editor of your choice.Compare the network interface names in the
run-nccl-tests.shfile with the network interface names that you viewed in the output of the previous command. If they don't match, then, in therun-nccl-tests.shfile, edit theNCCL_FASTRAK_CTRL_DEV,NCCL_FASTRAK_IFNAME, andNCCL_SOCKET_IFNAMEvariables as follows:NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh export NCCL_FASTRAK_CTRL_DEV=NCCL_FASTRAK_CTRL_DEV export NCCL_FASTRAK_IFNAME=NCCL_FASTRAK_IFNAME export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMES export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devicesReplace the following:
NCCL_FASTRAK_CTRL_DEV: the NCCL fastrak control device interface name—for example,enp0s12.NCCL_FASTRAK_IFNAME: a comma-separated list of NCCL fastrak interface names—for example,enp6s0,enp7s0,enp13s0.NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example,enp0s1.
If the partition where your node exists isn't the default cluster's partition, then, in the
sbatchsection of thebuild-nccl-tests.shandrun-nccl-tests.shfiles, add the following line at the end of the section:#SBATCH --partition=PARTITION_NAMEOptional: To run a different NCCL benchmark than the default
AllGatherbenchmark, edit the last line of therun-nccl-tests.shfile as follows:/nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;Replace
NCCL_BENCHMARKwith the type of NCCL benchmark that you want to run on the cluster nodes—for example,all_reduce_perf.
To import the squash container image, run the following command:
bash ./import_pytorch_container.shImporting the squash container image takes approximately 10 minutes to complete.
To build the test, as well as grant the test access to all of a node's resources by using the
--exclusiveflag, run the following command:sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.shBuilding the test takes approximately five minutes to complete.
To run the NCCL test, run the following command:
sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.shReplace
TEST_SIZEwith the number of nodes that you want to test in your nodeset. Specify a value between1and the number of nodes in your cluster partition.Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an
slurm-JOB_ID.outfile in the$HOMEdirectory with the results of your test.