Manage cluster health

This document explains how to manage cluster health and perform maintenance tasks in Cluster Director.

Cluster Director provides several tools to help you proactively manage cluster resilience and help ensure optimal operation. By using the Google Cloud console, you can do one or more of the following:

  • Manually start scheduled maintenance events for your cluster.

  • Start the automated repair process for unhealthy nodes.

  • Run health checks across the nodes in your cluster.

Before you begin

When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.

Required roles

To get the permissions that you need to view clusters, ask your administrator to grant you the Hypercompute Cluster Editor (roles/hypercomputecluster.editor) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Perform cluster management tasks

The following sections describe how you can manage the health of your cluster.

Manually start a cluster maintenance

If a maintenance event is scheduled for your cluster, then you can immediately start maintenance rather than waiting for the scheduled time.

To manually start a cluster maintenance by using the Google Cloud console, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Director page.

    Go to Cluster Director

  2. In the navigation menu, click Clusters. The Clusters page appears.

  3. In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.

  4. Click the Topology tab. Then, if a maintenance event is scheduled for your cluster, click Start maintenance now.

Report and repair unhealthy cluster nodes

Cluster Director automatically monitors the health of your nodes. If Cluster Director identifies a node as unhealthy due to host errors, then you can optionally report the node to start a repair operation. This action can help you minimize workload disruption and quickly restore the cluster to an optimal state.

To report and repair an unhealthy node in your cluster by using the Google Cloud console, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Director page.

    Go to Cluster Director

  2. In the navigation menu, click Clusters. The Clusters page appears.

  3. In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.

  4. Click the Topology tab. Then, if one or more nodes are unhealthy, click Report all unhealthy nodes.

Run health checks across cluster nodes

In addition to the automated health monitoring that Cluster Director performs, you can manually run a health check across the compute nodes in a cluster partition by using NVIDIA Collective Communications Library (NCCL) tests. If a test identifies one or more unhealthy nodes, then you can start a repair operation or modify your cluster. This approach helps minimize the number of unhealthy nodes in a cluster partition.

Based on the machine type that the compute nodes in a cluster partition use, select one of the following options:

A4X

  1. If you haven't already, then connect to a login node in your cluster.

  2. In the $HOME directory, download the following NCCL scripts:

    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.sh
    
  3. To view a list of network interface names that the compute nodes in a cluster partition use, run the following command:

    srun -N 1 --partition=PARTITION_NAME ip addr show
    

    Replace PARTITION_NAME with the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.

  4. To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:

    1. Open the run-nccl-tests-via-ramble.sh file with a text editor of your choice.

    2. Compare the network interface names in the run-nccl-tests-via-ramble.sh file with the network interface names that you viewed in the output of the previous command. If they don't match, then, in the env_vars section within the run-nccl-tests-via-ramble.sh file, edit the OMPI_MCA_btl_tcp_if_include, UCX_NET_DEVICES, and NCCL_SOCKET_IFNAME variables as follows:

      env_vars:
        set:
          OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME
          PMIX_MCA_gds: ^ds12
          UCX_NET_DEVICES: UCX_NET_DEVICES_LIST
          PMIX_MCA_psec: native
          UCX_IB_FORK_INIT: n
          NCCL_NET: gIB
          NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES
          LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/lib
      

      Replace the following:

      • TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example, enp0s1.

      • UCX_NET_DEVICES_LIST: a comma-separated list of Unified Communication X (UCX) network devices—for example, gpu0rdma0,gpu1rdma0,gpu2rdma0,gpu3rdma0.

      • NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example, enp0s1,enp192s1.

    3. If the partition where the node that you want to test isn't the default partition, then, at the end of the sbatch section, add the following line:

      #SBATCH --partition=PARTITION_NAME
      
    4. By default, the test runs the AllGather, AllReduce, and ReduceScatter benchmarks across an incremental and increasing number of nodes, up to the number of nodes that are available in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in the applications section, edit the workload and n_nodes variables as follows:

      applications:
        nccl-tests:
          workloads:
            '{workload}':
              experiments:
                '{workload}-{n_nodes}':
                  variables:
                    workload: [NCCL_BENCHMARK]
                    n_nodes: [NUMBER_OF_NODES]
                  matrix:
                  - n_nodes
                  - workload
      

      Replace the following:

      • NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example, all-gather, all-reduce, reduce-scatter.

      • NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between 1 and the number of nodes in your cluster partition—for example, 2, 4, 8, 16, 32.

  5. To run the NCCL test script, run the following command:

    nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.log
    

    Running the NCCL test can take some time to complete. When it does complete, the output is similar to the following:

    ...
    ---- SUMMARY for >1GB Message Sizes ----
    workload        n_nodes msg_size        busbw
    all-gather      2       1073741824      XXX.XX
    all-gather      2       2147483648      XXX.XX
    all-gather      2       4294967296      XXX.XX
    all-gather      2       8589934592      XXX.XX
    ...
    all-reduce      2       1073741824      XXX.XX
    ...
    reduce-scatter  2       1073741824      XXX.XX
    ...
    
    -------- Benchmarking Complete -------
    

A4

  1. If you haven't already, then connect to a login node in your cluster.

  2. In the $HOME directory, download the following NCCL scripts:

    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a4-highgpu-8g/system_benchmarks/run-nccl-tests-via-ramble.sh
    
  3. To view a list of network interface names that the compute nodes in a cluster partition use, run the following command:

    srun -N 1 --partition=PARTITION_NAME ip addr show
    

    Replace PARTITION_NAME with the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.

  4. To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:

    1. Open the run-nccl-tests-via-ramble.sh file with a text editor of your choice.

    2. Compare the network interface names in the run-nccl-tests-via-ramble.sh file with the network interface names that you viewed in the output of the previous command. If they don't match, then, in the env_vars section within the run-nccl-tests-via-ramble.sh file, edit the OMPI_MCA_btl_tcp_if_include and NCCL_SOCKET_IFNAME variables as follows:

      env_vars:
        set:
          OMPI_MCA_btl_tcp_if_include: TCP_INTERFACE_NAME
          PMIX_MCA_gds: ^ds12
          NCCL_NET: gIB
          NCCL_SOCKET_IFNAME: NCCL_SOCKET_IFNAMES
          LD_LIBRARY_PATH: /usr/local/gib/lib64:usr/local/nvidia/lib
      

      Replace the following:

      • TCP_INTERFACE_NAME: the name of the Transmission Control Protocol (TCP) interface to include—for example, enp0s1.

      • NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example, enp0s1,enp192s1.

    3. If the partition where the node that you want to test isn't the default partition, then, at the end of the sbatch section, add the following line:

      #SBATCH --partition=PARTITION_NAME
      
    4. By default, the test runs the AllGather, AllReduce, and ReduceScatter benchmarks across an incremental and increasing number of nodes, up to the number of nodes in the cluster partition. The test submits a separate job for each combination of benchmarks and number of nodes to test. To edit the benchmarks to run or the number of nodes to test, in the applications section, edit the workload and n_nodes variables as follows:

      applications:
        nccl-tests:
          workloads:
            '{workload}':
              experiments:
                '{workload}-{n_nodes}':
                  variables:
                    workload: [NCCL_BENCHMARK]
                    n_nodes: [NUMBER_OF_NODES]
                  matrix:
                  - n_nodes
                  - workload
      

      Replace the following:

      • NCCL_BENCHMARK: a comma-separated list of NCCL benchmarks to run on the nodes—for example, all-gather, all-reduce, reduce-scatter.

      • NUMBER_OF_NODES: a comma-separated list of the different number of nodes the test runs against. Specify values between 1 and the number of nodes in your cluster partition—for example, 2, 4, 8, 16, 32.

  5. To run the NCCL test script, run the following command:

    nohup bash ./run-nccl-tests-via-ramble.sh "$HOME" >& nccl-$(date -Iseconds).log & tail -f nccl-*.log
    

    Running the NCCL test can take some time to complete. When it does complete, the output is similar to the following:

    ...
    ---- SUMMARY for >1GB Message Sizes ----
    workload        n_nodes msg_size        busbw
    all-gather      2       1073741824      XXX.XX
    all-gather      2       2147483648      XXX.XX
    all-gather      2       4294967296      XXX.XX
    all-gather      2       8589934592      XXX.XX
    ...
    all-reduce      2       1073741824      XXX.XX
    ...
    reduce-scatter  2       1073741824      XXX.XX
    ...
    
    -------- Benchmarking Complete -------
    

A3 Ultra

  1. If you haven't already, then connect to a login node in your cluster.

  2. In the $HOME directory, download the following NCCL scripts:

    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/import_pytorch_container.sh
    
    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
    
    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
    
  3. To view a list of network interface names that the compute nodes in a cluster partition use, run the following command:

    srun -N 1 --partition=PARTITION_NAME ip addr show
    

    Replace PARTITION_NAME with the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.

  4. To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:

    1. Open the build-nccl-tests.sh and run-nccl-tests.sh files with a text editor of your choice.

    2. Compare the network interface names in the run-nccl-tests.sh file with the network interface names that you viewed in the output of the previous command. If they don't match, then, in the run-nccl-tests.sh file, edit the NCCL_SOCKET_IFNAME variable as follows:

      source /usr/local/gib/scripts/set_nccl_env.sh
      export NCCL_NET=gIB
      export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMES
      

      Replace NCCL_SOCKET_IFNAMES with a comma-separated list of NCCL socket interface names—for example, enp0s1,enp192s1.

    3. If the partition where your node exists isn't the default cluster's partition, then, in the sbatch section of the build-nccl-tests.sh and run-nccl-tests.sh files, add the following line:

      #SBATCH --partition=PARTITION_NAME
      
    4. Optional: To run a different NCCL benchmark than the default AllGather benchmark, edit the last line of the run-nccl-tests.sh file as follows:

      /nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;
      

      Replace NCCL_BENCHMARK with the type of NCCL benchmark that you want to run on the cluster nodes—for example, all_reduce_perf.

  5. To import the squash container image, run the following command:

    bash ./import_pytorch_container.sh
    

    Importing the squash container image takes approximately 10 minutes to complete.

  6. To build the test, as well as grant the test access to all of a node's resources by using the --exclusive flag, run the following command:

    sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.sh
    

    Building the test takes approximately five minutes to complete.

  7. To run the NCCL test, run the following command:

    sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.sh
    

    Replace TEST_SIZE with the number of nodes that you want to test in your nodeset. Specify a value between 1 and the number of nodes in your cluster partition.

    Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an slurm-JOB_ID.out file in the $HOME directory with the results of your test.

A3 Mega

  1. If you haven't already, then connect to a login node in your cluster.

  2. In the $HOME directory, download the following NCCL scripts:

    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/import_pytorch_container.sh
    
    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/build-nccl-tests.sh
    
    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/tags/v1.65.0/examples/machine-learning/a3-megagpu-8g/nccl-tests/run-nccl-tests.sh
    
  3. To view a list of network interface names that the compute nodes in a cluster partition use, run the following command:

    srun -N 1 --partition=PARTITION_NAME ip addr show
    

    Replace PARTITION_NAME with the name of the cluster partition that you want to test. To review the partitions in your cluster, view the details of your cluster.

  4. To help ensure that the network interface names and the name of the partition to test are correct, complete the following steps:

    1. Open the build-nccl-tests.sh and run-nccl-tests.sh files with a text editor of your choice.

    2. Compare the network interface names in the run-nccl-tests.sh file with the network interface names that you viewed in the output of the previous command. If they don't match, then, in the run-nccl-tests.sh file, edit the NCCL_FASTRAK_CTRL_DEV, NCCL_FASTRAK_IFNAME, and NCCL_SOCKET_IFNAME variables as follows:

      NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
      export NCCL_FASTRAK_CTRL_DEV=NCCL_FASTRAK_CTRL_DEV
      export NCCL_FASTRAK_IFNAME=NCCL_FASTRAK_IFNAME
      export NCCL_SOCKET_IFNAME=NCCL_SOCKET_IFNAMES
      export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices
      

      Replace the following:

      • NCCL_FASTRAK_CTRL_DEV: the NCCL fastrak control device interface name—for example, enp0s12.

      • NCCL_FASTRAK_IFNAME: a comma-separated list of NCCL fastrak interface names—for example, enp6s0,enp7s0,enp13s0.

      • NCCL_SOCKET_IFNAMES: a comma-separated list of NCCL socket interface names—for example, enp0s1.

    3. If the partition where your node exists isn't the default cluster's partition, then, in the sbatch section of the build-nccl-tests.sh and run-nccl-tests.sh files, add the following line at the end of the section:

      #SBATCH --partition=PARTITION_NAME
      
    4. Optional: To run a different NCCL benchmark than the default AllGather benchmark, edit the last line of the run-nccl-tests.sh file as follows:

      /nccl/nccl-tests/build/NCCL_BENCHMARK -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200;
      

      Replace NCCL_BENCHMARK with the type of NCCL benchmark that you want to run on the cluster nodes—for example, all_reduce_perf.

  5. To import the squash container image, run the following command:

    bash ./import_pytorch_container.sh
    

    Importing the squash container image takes approximately 10 minutes to complete.

  6. To build the test, as well as grant the test access to all of a node's resources by using the --exclusive flag, run the following command:

    sbatch --partition=PARTITION_NAME --exclusive build-nccl-tests.sh
    

    Building the test takes approximately five minutes to complete.

  7. To run the NCCL test, run the following command:

    sbatch -N TEST_SIZE --partition=PARTITION_NAME --exclusive run-nccl-tests.sh
    

    Replace TEST_SIZE with the number of nodes that you want to test in your nodeset. Specify a value between 1 and the number of nodes in your cluster partition.

    Running the NCCL test takes approximately five minutes to complete. When the test completes, the system creates and stores an slurm-JOB_ID.out file in the $HOME directory with the results of your test.

What's next