Before starting a job on a compute node in your cluster, Slurm runs a prolog
script named a_chs_gpu_health_check_hcs.sh to quickly check the node's GPU
health. If this check fails, then Slurm drains the node, and
Cluster Director sets the node state to Unhealthy. To identify
unhealthy nodes in your cluster,
view the cluster topology.
This document explains how to troubleshoot a node's GPU health check failure and return the node to service. If the health check failed due to a false positive, or if your workload can tolerate the hardware issue that the health check detected, then you can bypass the health check for a specific job.
Resolve a GPU health check failure
To resolve a compute node's health check failure and return the node to service, complete the following steps:
Identify the cause of the failure
To identify the cause of a GPU health check failure in a compute node, review the node's log files by using the Logs Explorer:
In the Google Cloud console, go to the Logs Explorer page.
In the Query pane, enter the following query. If you can't see the Query pane, then click the Show query toggle to the on position.
SEARCH("`/var/log/slurm/chs_health_check.log`") resource.labels.instance_id="NODE_NAME"Replace
NODE_NAMEwith the name of the compute node.To run the query, click Run query. The Query results pane displays error messages from
nvidia-smiand DCGM diagnostics. These logs show a history of diagnostic results from the prolog health check script, which runs each time Slurm assigns a job to a node. The error messages describe critical failures, including asynchronous driver events like XID errors. An error may occur when no job is running on the node, but the script only detects it when a job tries to start.
Resolve the error
After you identify the cause of the GPU health check failure in the previous section, choose one of the following resolution methods based on the error:
If you can't resolve the error, then contact your account team or support.
Resolve a hardware error
If the GPU health check failure indicates a hardware error, then do one of the following based on the specific error:
Recreate the node for XID errors: if the log files show a XID error that requires a node recreation, then recreate the node by completing the following steps:
If you haven't already, then connect to a login node in your cluster.
Recreate the node:
sudo scontrol update nodename=NODE_NAME state=POWER_DOWN_ASAP reason="Recreate a faulty compute node."
Report the node for other persistent errors: for persistent hardware errors, such as permanent GPUs fault, corrupted drivers, or other unrecoverable problems, report the node to start a repair operation.
Resolve a temporary issue
If the GPU health check failure indicates a temporary issue, such as a transient power spike or cosmic ray-induced bit flip, then do the following:
If you haven't already, then connect to the compute node.
Stop all tasks that use the node's GPUs:
sudo systemctl stop nvidia-persistenced.service nvidia-dcgm.service google-cloud-ops-agent-opentelemetry-collector.serviceReset the GPUs:
sudo nvidia-smi -rRestart the tasks:
sudo systemctl start nvidia-persistenced.service nvidia-dcgm.service google-cloud-ops-agent-opentelemetry-collector.service
Run a manual health check
After you attempt to fix a GPU health check failure, as described in the
previous section, run a manual health check to verify the compute node's status.
If the a_chs_gpu_health_check_hcs.sh script runs without errors, then you've
resolved the issue.
To manually run a health check on a compute node, complete the following steps:
If you haven't already, then connect to the compute node.
Run the
a_chs_gpu_health_check_hcs.shprolog script:sudo /slurm/custom_scripts/prolog.d/a_chs_gpu_health_check_hcs.sh
Undrain the node
If you resolved the issue in the previous sections, and you want to return the compute node to service and let it accept new jobs, then undrain the node:
If you haven't already, then connect to a login node in your cluster.
Undrain the node:
sudo scontrol update nodename=NODE_NAME state=RESUME
Before starting new jobs on the node, Slurm runs a GPU health check.