Cluster resiliency

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

Managed Training on reserved clusters integrates a comprehensive health check system to ensure the reliability of compute nodes and the stability of your Slurm jobs. This system features both automated and manual recovery options. An automated process runs during job execution to monitor critical components like GPU health and disk usage, automatically replacing nodes that fail. For situations requiring user intervention, Managed Training provides a reportFaultyNodes API, letting you manually delete a specific faulty node or to report a suspected hardware failure on its underlying host.

Run a test workload to verify GPU functionality

Step 1: Connect to the cluster nodes using SSH

From Cloud Shell or the Google Cloud console, connect to the Login Node using IAP. The following example shows the command for Cloud Shell:

gcloud compute ssh --zone $ZONE "Login Node Name" --tunnel-through-iap --project $PROJECT_ID

Step 2: Run a standard Slurm command

After connecting to a login node, run a few standard Slurm commands to verify that the cluster is functioning correctly.

~$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
partition1*    up   infinite      2   idle hcsa3m1236-a3mnodeset-[0-1]

Next, submit a batch job.

~$ sbatch --qos normal --wrap "echo start! && sleep 10s && echo done!"

You should see that a slurm-job-id.out file is created in your home directory.

Step 3: Run a GPU workload

Save the following content as a script file named test.sh in your home directory.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:8
#SBATCH --job-name=nvidia_smi

srun nvidia-smi -L

Set the script's permissions to 755 to make it executable, then submit the Slurm job:

~$ sbatch ./test.sh

Slurm saves the script's output to a file named slurm-job-id.out.

Expected output:

GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-f75045e8-4d87-49d1-2eb9-39ec2baddf9b)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-b91556d8-5215-d0ed-50b8-a88720e5b29c)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-7600155a-0036-35f5-9489-a7b4ed0ce887)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-a402e125-7841-033f-f08b-7921526c121f)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-20eef8f8-b2c7-1716-5ce7-7f64475bd2c0)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-65463286-e587-b52f-4d5b-8880eecbf0e7)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-d5ff75e7-dd54-edf6-a684-33c26fc365e1)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-26e81ae2-11fd-9d7e-95b6-c186e5173007)
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-e66a185a-b40c-81d9-d35d-19cab811df34)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-d23e5cf7-afd8-bec2-1487-9e27eeb6aae0)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-4dde1b05-ea5e-01e9-5c1e-e1c0d3b4b113)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-3a0d734a-6fb8-d841-a97f-d6846553ea7f)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-76fe0d37-08b2-a3a6-8ddf-55501426bc7c)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-9e0a41e1-b399-8934-01af-6198b749c02a)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-dddd09ee-c944-1098-9c4e-d96f8762ecb1)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-df52c109-0ac1-30cc-226b-85b1a8a6bc16)

Cluster health verification

This section shows how to test your Managed Training cluster using the Cluster Health Scanner (CHS) tool, which is pre-installed on the Managed Training image. The CHS tool checks the health of the cluster, running tests such as DCGM diagnostics and NCCL tests to verify that the cluster is ready to run your workloads.

From the login node of the cluster, you can run the following script to run tests using the CHS tool.

export CLUSTER_ID=<your_cluster_id>
export PARTITION=a3u
export MACHINE_TYPE=a3-ultragpu-8g
cd ~
/opt/cluster-health-scanner/deploy/slurm/cluster-validation.sh \
--nodelist=${CLUSTER_ID}-${PARTITION}-[0-1] \
--nodes=2 \
--partition=${PARTITION} \
--machine-type=${MACHINE_TYPE} \
--relative-exec-path=../../opt/cluster-health-scanner/deploy/slurm \
--results-dir=results

A successful test run provides two sets of results:

  • Summary Output: A brief summary is printed to the console, which should resemble the following example.
  • Detailed Logs: For a complete report, see the detailed logs saved in the ~/results directory.
Starting DCGM Diagnostics...
DCGM diagnostics passing on all nodes!
Starting NCCL all_reduce_perf...
CURR_NODES:  cluster-id-0
cluster-id-1
NCCL test passing on all nodes!

Automated health checks and recovery

To ensure node reliability, Managed Training continuously monitors node health using the following suite of automated checks. Managed Training runs health checks during the Slurm prolog (before a job starts) and epilog (after a job completes).

Health check suite

  • GPU Health: Performs detailed, individual GPU diagnostics including nvidia-smi, dcgmi, and xid code monitoring.
  • Disk Usage: Checks for high disk usage on critical partitions (/, /mnt/localssd, /mnt/localdisk) to prevent jobs from failing due to lack of space.
  • Network Health: Verifies that primary network interfaces have an IPv4 address. If an issue is found, it attempts to self-heal by resetting the interface.
  • CPU Load: Monitors the system's load average and logs a warning if it exceeds a predefined threshold.

Failure recovery process

If a check detects a severe, unrecoverable error, Managed Training automatically initiates a failure recovery process. The standard process involves draining the faulty nodes, requeuing the affected Slurm job, and then deleting and recreating the drained nodes to restore them to a healthy state.

This automated recovery is subject to the following conditions:

  • Restart Limit: The recovery process is skipped if the affected Slurm job has already been restarted a set number of times.

  • GPU Utilization: Node deletion and recreation is also skipped if the job running on the node doesn't use all of the available GPUs. In this case, the node is only drained to prevent new jobs from being scheduled on it.

Manually managing faulty compute nodes

Managed Training provides APIs for manually reporting and managing faulty compute nodes, which is particularly useful if automated health checks don't resolve an issue. You can only run these operations on one node at a time.

Action Description Best For
Delete Node Removes a specified faulty node from the cluster. This is the default action. General errors or when a node is unresponsive and needs to be recycled.
Report Host as Faulty Reports the underlying physical host as faulty, triggering a repair or migration process. Suspected hardware failures on the physical machine hosting the GPU node.

Action 1: Delete a faulty node

This action deletes the specified node. The outcome of this operation depends on whether the node is classified as "static" or "dynamic" by Slurm:

  • Static Nodes: If a deleted node's index is less than the minimum node count of the node pool, a new compute node is recreated with the same name and specifications.

  • Dynamic Nodes: If a deleted node's index is greater than the minimum node count, it's only recreated if there is a pending workload scheduled for it. Otherwise, it is removed.

API request to delete a node

To delete a faulty node, execute the following POST request. The NODE_ID should be in the format CLUSTER_ID-NODEPOOL_ID-INDEX.

  gcurl -X POST https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID:reportFaultyNodes -d '{"nodeActions": [{"nodeId": "NODE_ID"}]}'
  

Check operation status
You can monitor the result of the reportFaultyNodes action by checking the operation status.

  gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID
  

Action 2: Report a host as faulty

You can report the physical host of a GPU node as faulty if you suspect a hardware failure.

  • Supported VMs: A3 Ultra and A4 High-GPU

  • Node State: The target node must be in a RUNNING state before you call the API. It will transition to REPAIRING upon a successful call and return to RUNNING after the host is repaired or the node is recreated on a new host. This is a best-effort operation.

Prerequisite: Grant IAM role

To use this feature, you must grant the Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1) role to the Vertex AI Service Agent.

  PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

  gcloud projects add-iam-policy-binding PROJECT_ID\
  --member="serviceAccount:service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.iam.gserviceaccount.com" \
  --role="roles/compute.instanceAdmin.v1"
  

API request to report a host
Execute the following POST request to report the underlying host as faulty. Provide one or more observed behaviors and descriptions for the faultReasons.

gcurl -X POST \
  https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID:reportFaultyNodes \
  -d '{"nodeActions": [{"nodeId": "NODE_ID", "reportFaultyHost": {"faultReasons": [{"behavior": "BEHAVIOR_1", "description": "DESCRIPTION_1"}, {"behavior": "BEHAVIOR_2", "description": "DESCRIPTION_2"}]}}]}'
  

Putting it all together

By leveraging both automated health checks and the manual controls detailed on this page, you're able to maintain a highly resilient training environment. Proactively managing the health of your cluster by deleting faulty nodes or reporting hardware issues ensures maximum uptime and the successful completion of your training jobs. For persistent or complex issues, always consider consulting with the Google Cloud support team for in-depth diagnostics and assistance.