Troubleshoot GPU VMs

This page shows you how to resolve issues for VMs running on Compute Engine that have attached GPUs.

If you are trying to create a VM with attached GPUs and are getting errors, then review Troubleshooting resource availability errors and Troubleshooting creating and updating VMs.

Troubleshoot GPU VMs by using NVIDIA DCGM

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA data center GPUs in cluster environments.

If you want to use DCGM to troubleshoot issues in your GPU environment, then complete the following:

  • Ensure that you are using the latest recommended NVIDIA driver for the GPU model that is attached to your VM. To review driver versions, see Recommended NVIDIA driver versions.
  • Ensure that you installed the latest version of DCGM. To install the latest version, see DCGM installation.

Diagnose issues

When you run a dcgmi diagnostic command, the issues reported by the diagnostic tool include next steps for taking action on the issue. The following example shows the actionable output from the dcgmi diag -r memory -j command.

{
  ........
   "category":"Hardware",
   "tests":[
      {
         "name":"GPU Memory",
         "results":[
            {
               "gpu_id":"0",
               "info":"GPU 0 Allocated 23376170169
bytes (98.3%)",
               "status":"Fail",
               ""warnings":[
                  {
                     "warning":"Pending page
retirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.",
                     "error_id":83,
                     "error_category":10,
                     "error_severity":6
                  }
               ]
            }
  .........

From the preceding output snippet, you can see that GPU 0 has pending page retirements that are caused by a non-recoverable error. The output provided the unique error_id and advice on debugging the issue. For this example output, it is recommended that you drain the GPU and reboot the VM. In most cases, following the instructions in this section of the output can help to resolve the issue.

Troubleshoot GPU performance issues for A3 VMs

The A3 machine series is available with NVIDIA H200 or H100 GPUs attached. This series includes A3 Ultra (H200), A3 Mega (H100), A3 High (H100), and A3 Edge (H100) machine types.

Identify a faulty node

Large-scale training or benchmark jobs on a multi-node GPU cluster can stop responding or perform poorly. This often occurs because one or more nodes underperform and slow down the entire operation. This section describes how to identify a faulty node or host machine by either running an NCCL benchmark test or analyzing NCCL logs.

Run NCCL benchmark test

To identify the group of nodes causing the failure, systematically test subsets of your cluster by using NCCL benchmarks like all_reduce_perf.

  1. To identify your nodesets, group your nodes into logical sets, for example, partitions in Slurm.
  2. To create hostfiles, create a separate hostfile for each nodeset, listing hostnames and the number of GPUs per node. The number of slots you specify depends on the GPU count of your A3 VM type. For example, a3-highgpu-8g VMs have 8 GPUs, so you must specify slots=8.
  3. To run benchmarks, execute the all_reduce_perf benchmark against each nodeset individually.
    mpirun -x LD_LIBRARY_PATH --hostfile HOSTFILE_NAME -n TOTAL_PROCESSES \
        ./build/all_reduce_perf -b 1G -e 8G -f 2 -g NUM_GPUS_PER_NODE
              

    Replace the following:

    • HOSTFILE_NAME: the name of the hostfile that contains the list of nodes and the number of GPUs per node for the nodeset.
    • TOTAL_PROCESSES: the total number of MPI processes to launch across all hosts in the nodeset.
    • NUM_GPUS_PER_NODE: the number of GPUs per node. For all A3 machine types, this value is 8.
  4. To analyze results, if a job hangs or shows significantly lower bus bandwidth (busbw) on a particular nodeset, that set is likely faulty.
  5. To sub-divide, if a nodeset is faulty, divide its hostfile in half and re-test to narrow down the binary search until you pinpoint the individual misbehaving node.

Analyze NCCL logs

If the benchmark method doesn't pinpoint a node, analyze detailed NCCL logs.

  1. To enable debug logging, set the following environment variables in the shell session where you plan to run your workload:
    export NCCL_DEBUG=INFO
            export NCCL_DEBUG_SUBSYS=INIT,NET,COLL
            export NCCL_DEBUG_FILE="LOG_DIRECTORY/nccl_log.%h.%p"
            

    Replace LOG_DIRECTORY with the directory where you want to store your logs.

    Setting NCCL_DEBUG_FILE with %h and %p creates unique, non-interleaved log files for each process.

    If you are running a multi-node workload using mpirun, you must propagate these variables to all nodes by using the -x flag. For example:

    mpirun -x NCCL_DEBUG -x NCCL_DEBUG_SUBSYS -x NCCL_DEBUG_FILE ...
              
  2. To find the first error, use the following command to find the earliest timeout or failure events across all log files:
    grep "NCCL WARN.*NET/FasTrak" LOG_DIRECTORY/* | sed 's/.*NET\/FasTrak\(.*\)/\1/g' \
      | sort | head -n 20
              

    Replace LOG_DIRECTORY with the directory where your logs are stored.

  3. To count collective operations, a straggler node completes fewer collective operations. Count "opCount" entries for suspect ranks:
    grep "opCount" LOG_DIRECTORY/nccl_log.HOSTNAME.PID | wc -l
              

    Replace the following:

    • LOG_DIRECTORY: the directory where your logs are stored
    • HOSTNAME: the hostname of the node
    • PID: the process ID of the NCCL process
  4. To gather more logging data before a job aborts, temporarily increase the data transfer timeout:
    export NCCL_FASTRAK_DATA_TRANSFER_TIMEOUT_MS=3600000
            

Monitor GPU thermal throttling

A3 series VMs can experience performance degradation if they consistently reach temperatures greater than 87 °C under load. To check for GPU thermal throttling across nodes in a cluster, use either nvidia-smi or dcgmi.

Using nvidia-smi

To check the current temperature and throttling status of all GPUs on a node, run the following command:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,clocks_throttle_reasons.hw_slowdown --format=csv
    

In the output, a value of Active in the clocks_throttle_reasons.hw_slowdown column indicates the GPU is being throttled due to high temperatures.

Using dcgmi

The NVIDIA Data Center GPU Manager (DCGM) diagnostic suite includes checks for thermal violations. To run a level 1 diagnostic, run the following command:

dcgmi diag -r 1

A result of Warn or Fail in the Thermal section indicates a thermal violation occurred during the test. If a thermal violation is accompanied by clock throttling, the GPU is likely overheating and requires further investigation.

Open a support case

If you're unable to resolve the issues by using the guidance on this page, gather the following information and open a support case:

  • Project ID and a list of all instance names or IDs in the cluster.
  • List of suspect nodes identified through troubleshooting.
  • Complete, non-interleaved NCCL logs with debug settings enabled.
  • Output from hardware health checks (dcgmi, nvidia-smi).
  • Exact benchmark or workload command that is failing.
  • Relevant log files such as host engine and diagnostic logs. To gather these, run gather-dcgm-logs.sh, located in /usr/local/dcgm/scripts in default installations.
  • NVIDIA bug report. Run nvidia-bug-report.sh. For Blackwell GPUs, follow Generate NVIDIA Bug Report for Blackwell GPUs.
  • Details about any recent changes that were made to your environment preceding the failure.

Review Xid messages

After you create a VM that has attached GPUs, you must install NVIDIA device drivers on your GPU VMs so that your applications can access the GPUs. However, sometimes these drivers return error messages.

An Xid message is an error report from the NVIDIA driver that is printed to the operating system's kernel log or event log for your Linux VM. These messages are placed in the /var/log/messages file.

For more information about Xid messages including potential causes, see NVIDIA documentation.

The following section provides guidance on handling some Xid messages grouped by the most common types: GPU memory errors, GPU System Processor (GSP) errors, and illegal memory access errors.

GPU memory errors

GPU memory is the memory that is available on a GPU that can be used for temporary storage of data. GPU memory is protected with Error Correction Code, ECC, which detects and corrects single bit errors (SBE) and detects and reports Double Bit Errors (DBE).

Prior to the release of the NVIDIA A100 GPUs, dynamic page retirement was supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100), row remap error recovery is introduced. ECC is enabled by default. Google highly recommends keeping ECC enabled.

The following are common GPU memory errors and their suggested resolutions.

Xid error message Resolution
Xid 48: Double Bit ECC
  1. Stop your workloads.
  2. Delete and recreate the VM. If the error persists, then file a case with Cloud Customer Care.
Xid 63: ECC page retirement or row remapping recording event
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 64: ECC page retirement or row remapper recording failure

And the message contains the following information:

Xid 64: All reserved rows for bank are remapped
  1. Stop your workloads.
  2. Delete and recreate the VM. If the error persists, then file a case with Cloud Customer Care.

If you get at least two of the following Xid messages together:

  • Xid 48
  • Xid 63
  • Xid 64

And the message contains the following information:

Xid XX: row remap pending
  1. Stop your workloads.
  2. Reset the GPUs. Resetting the GPU allows the row remap and page retirement process to complete and heal the GPU.
Xid 92: High single-bit ECC error rate This Xid message is returned after the GPU driver corrects a correctable error, and it shouldn't affect your workloads. This Xid message is informational only. No action is needed.
Xid 94: Contained ECC error
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 95: Uncontained ECC error
  1. Stop your workloads.
  2. Reset the GPUs.

GSP errors

A GPU System Processor (GSP) is a microcontroller that runs on GPUs and handles some of the low level hardware management functions.

Xid error message Resolution
Xid 119: GSP RPC timeout
  1. Stop your workloads.
  2. Delete and recreate the VM. If the error persists, then collect the NVIDIA bug report and file a case with Cloud Customer Care.
Xid 120: GSP error

Illegal memory access errors

The following Xids are returned when applications have illegal memory access issues:

  • Xid 13: Graphics Engine Exception
  • Xid 31: GPU memory page fault

Illegal memory access errors are typically caused by your workloads trying to access memory that is already freed or is out of bounds. This can be caused by issues such as the dereferencing of an invalid pointer, or an out bounds array.

To resolve this issue, you need to debug your application. To debug your application, you can use cuda-memcheck and CUDA-GDB.

In some very rare cases, hardware degradation might cause illegal memory access errors to be returned. To identify if the issue is with your hardware, then use NVIDIA Data Center GPU Manager (DCGM). You can run dcgmi diag -r 3 or dcgmi diag -r 4 to run different levels of test coverage and duration. If you identify that the issue is with the hardware, then file a case with Cloud Customer Care.

Other common Xid error messages

Xid error message Resolution
Xid 74: NVLINK error
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 79: GPU has fallen off the bus

This means the driver is not able to communicate with the GPU.

Reboot the VM.
Xid 149 that mentions 0x02a, such as the following example:
Xid (PCI:0000:c0:00): 149,NETIR_LINK_EVT Fatal XC0 i0 Link 04 (0x02a485c6 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000)

This indicates a known issue affecting firmware for NVIDIA B200 GPUs.

  1. Stop your workloads.
  2. Reset the GPUs.

Reset GPUs

Some issues might require you to reset your GPUs. To reset GPUs, complete the following steps:

  • For N1, G2, and A2 VMs, reboot the VM.
  • For G4 VMs that have less than one GPU attached, delete and recreate the VM.
  • For A3 and A4 VMs, run sudo nvidia-smi --gpu-reset.
    • For most Linux VMs, the nvidia-smi executable is located in the /var/lib/nvidia/bin directory.
    • For GKE nodes, the nvidia-smi executable is located in the /home/kubernetes/bin/nvidia directory.
    • If you are using GKE nodes, then you can use the gpu-reset-tool to automate the reset of all GPUs on a node. This tool requires only that you specify the target node name.

Alternatively, GPUs are also reset whenever you reset a VM or restart a VM.

If errors persist after resetting the GPU, then you need to delete and recreate the VM.

If the error persists after a delete and recreate, then file a case with Cloud Customer Care to move the VM into the repair stage.

What's next

Review GPU machine types.