Troubleshoot GPU health check failures

Before starting a job on a compute node in your cluster, Slurm runs a prolog script named a_chs_gpu_health_check_hcs.sh to quickly check the node's GPU health. If this check fails, then Slurm drains the node, and Cluster Director sets the node state to Unhealthy. To identify unhealthy nodes in your cluster, view the cluster topology.

This document explains how to troubleshoot a node's GPU health check failure and return the node to service. If the health check failed due to a false positive, or if your workload can tolerate the hardware issue that the health check detected, then you can bypass the health check for a specific job.

Resolve a GPU health check failure

To resolve a compute node's health check failure and return the node to service, complete the following steps:

  1. Identify the cause of the failure

  2. Resolve the error

  3. Run a manual health check

  4. Undrain the node

Identify the cause of the failure

To identify the cause of a GPU health check failure in a compute node, review the node's log files by using the Logs Explorer:

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

  2. In the Query pane, enter the following query. If you can't see the Query pane, then click the Show query toggle to the on position.

    SEARCH("`/var/log/slurm/chs_health_check.log`")
    resource.labels.instance_id="NODE_NAME"
    

    Replace NODE_NAME with the name of the compute node.

  3. To run the query, click Run query. The Query results pane displays error messages from nvidia-smi and DCGM diagnostics. These logs show a history of diagnostic results from the prolog health check script, which runs each time Slurm assigns a job to a node. The error messages describe critical failures, including asynchronous driver events like XID errors. An error may occur when no job is running on the node, but the script only detects it when a job tries to start.

Resolve the error

After you identify the cause of the GPU health check failure in the previous section, choose one of the following resolution methods based on the error:

If you can't resolve the error, then contact your account team or support.

Resolve a hardware error

If the GPU health check failure indicates a hardware error, then do one of the following based on the specific error:

  • Recreate the node for XID errors: if the log files show a XID error that requires a node recreation, then recreate the node by completing the following steps:

    1. If you haven't already, then connect to a login node in your cluster.

    2. Recreate the node:

      sudo scontrol update nodename=NODE_NAME state=POWER_DOWN_ASAP reason="Recreate a faulty compute node."
      
  • Report the node for other persistent errors: for persistent hardware errors, such as permanent GPUs fault, corrupted drivers, or other unrecoverable problems, report the node to start a repair operation.

Resolve a temporary issue

If the GPU health check failure indicates a temporary issue, such as a transient power spike or cosmic ray-induced bit flip, then do the following:

  1. If you haven't already, then connect to the compute node.

  2. Stop all tasks that use the node's GPUs:

    sudo systemctl stop nvidia-persistenced.service nvidia-dcgm.service google-cloud-ops-agent-opentelemetry-collector.service
    
  3. Reset the GPUs:

    sudo nvidia-smi -r
    
  4. Restart the tasks:

    sudo systemctl start nvidia-persistenced.service nvidia-dcgm.service google-cloud-ops-agent-opentelemetry-collector.service
    

Run a manual health check

After you attempt to fix a GPU health check failure, as described in the previous section, run a manual health check to verify the compute node's status. If the a_chs_gpu_health_check_hcs.sh script runs without errors, then you've resolved the issue.

To manually run a health check on a compute node, complete the following steps:

  1. If you haven't already, then connect to the compute node.

  2. Run the a_chs_gpu_health_check_hcs.sh prolog script:

    sudo /slurm/custom_scripts/prolog.d/a_chs_gpu_health_check_hcs.sh
    

Undrain the node

If you resolved the issue in the previous sections, and you want to return the compute node to service and let it accept new jobs, then undrain the node:

  1. If you haven't already, then connect to a login node in your cluster.

  2. Undrain the node:

    sudo scontrol update nodename=NODE_NAME state=RESUME
    

Before starting new jobs on the node, Slurm runs a GPU health check.