Troubleshoot GPU VMs

This guide describes how to diagnose and resolve common issues with Compute Engine VMs that have attached GPUs, including hardware errors and performance bottlenecks.

Troubleshoot GPU VMs by using NVIDIA DCGM

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA data center GPUs in cluster environments.

To use DCGM to troubleshoot issues in your GPU environment, complete the following:

Ensure that you're using the latest recommended NVIDIA driver for the GPU model that's attached to your VM. To review driver versions, see Recommended NVIDIA driver versions.
Ensure that you installed the latest version of DCGM. To install the latest version, see DCGM installation.

Diagnose issues

When you run a dcgmi diagnostic command, the issues reported by the diagnostic tool include next steps for taking action on the issue. The following example shows the actionable output from the dcgmi diag -r memory -j command.

{
  ........
   "category":"Hardware",
   "tests":[
      {
         "name":"GPU Memory",
         "results":[
            {
               "gpu_id":"0",
               "info":"GPU 0 Allocated 23376170169
bytes (98.3%)",
               "status":"Fail",
               ""warnings":[
                  {
                     "warning":"Pending page
retirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.",
                     "error_id":83,
                     "error_category":10,
                     "error_severity":6
                  }
               ]
            }
  .........

From the preceding output snippet, you can see that GPU 0 has pending page retirements that are caused by a non-recoverable error. The output provided the unique error_id and advice on debugging the issue. For this example output, we recommend that you drain the GPU and reboot the VM. In most cases, following the instructions in this section of the output can help to resolve the issue.

Troubleshoot GPU performance issues for A3 VMs

The A3 machine series is available with NVIDIA H200 or H100 GPUs attached. This series includes A3 Ultra (H200), A3 Mega (H100), A3 High (H100), and A3 Edge (H100) machine types.

Identify a faulty node

Large-scale training or benchmark jobs on a multi-node GPU cluster can stop responding or perform poorly. This often occurs because one or more nodes underperform and slow down the entire operation. This section describes how to identify a faulty node or host machine by either running an NCCL benchmark test or analyzing NCCL logs.

Run NCCL benchmark test

To identify the group of nodes causing the failure, systematically test subsets of your cluster by using NCCL benchmarks like all_reduce_perf.

To identify your nodesets, group your nodes into logical sets, for example, partitions in Slurm.
To create hostfiles, create a separate hostfile for each nodeset, listing hostnames and the number of GPUs per node. The number of slots you specify depends on the GPU count of your A3 VM type. For example, a3-highgpu-8g VMs have 8 GPUs, so you must specify slots=8.
To run benchmarks, execute the all_reduce_perf benchmark against each nodeset individually.
```
mpirun -x LD_LIBRARY_PATH --hostfile HOSTFILE_NAME -n TOTAL_PROCESSES \
    ./build/all_reduce_perf -b 1G -e 8G -f 2 -g NUM_GPUS_PER_NODE
          
```
Replace the following:
- HOSTFILE_NAME: the name of the hostfile that contains the list of nodes and the number of GPUs per node for the nodeset.
- TOTAL_PROCESSES: the total number of MPI processes to launch across all hosts in the nodeset.
- NUM_GPUS_PER_NODE: the number of GPUs per node. For all A3 machine types, this value is 8.
To analyze results, if a job hangs or shows significantly lower bus bandwidth (busbw) on a particular nodeset, that set is likely faulty.
To sub-divide, if a nodeset is faulty, divide its hostfile in half and re-test to narrow down the binary search until you pinpoint the individual misbehaving node.

Analyze NCCL logs

If the benchmark method doesn't pinpoint a node, analyze detailed NCCL logs.

To enable debug logging, set the following environment variables in the shell session where you plan to run your workload:
```
export NCCL_DEBUG=INFO
        export NCCL_DEBUG_SUBSYS=INIT,NET,COLL
        export NCCL_DEBUG_FILE="LOG_DIRECTORY/nccl_log.%h.%p"
        
```
Replace LOG_DIRECTORY with the directory where you want to store your logs.
Setting NCCL_DEBUG_FILE with %h and %p creates unique, non-interleaved log files for each process.
If you run a multi-node workload using mpirun, propagate these variables to all nodes by using the -x flag. For example:
```
mpirun -x NCCL_DEBUG -x NCCL_DEBUG_SUBSYS -x NCCL_DEBUG_FILE ...
          
```
To find the first error, use the following command to find the earliest timeout or failure events across all log files:
```
grep "NCCL WARN.*NET/FasTrak" LOG_DIRECTORY/* | sed 's/.*NET\/FasTrak\(.*\)/\1/g' \
  | sort | head -n 20
          
```
Replace LOG_DIRECTORY with the directory where your logs are stored.
To count collective operations, a straggler node completes fewer collective operations. Count "opCount" entries for suspect ranks:
```
grep "opCount" LOG_DIRECTORY/nccl_log.HOSTNAME.PID | wc -l
          
```
Replace the following:
- LOG_DIRECTORY: the directory where your logs are stored
- HOSTNAME: the hostname of the node
- PID: the process ID of the NCCL process
To gather more logging data before a job aborts, temporarily increase the data transfer timeout:
```
export NCCL_FASTRAK_DATA_TRANSFER_TIMEOUT_MS=3600000
        
```

Monitor GPU thermal throttling

A3 series VMs can experience performance degradation if they consistently reach temperatures greater than 87 °C under load. To check for GPU thermal throttling across nodes in a cluster, use either nvidia-smi or dcgmi.

Using nvidia-smi

To check the current temperature and throttling status of all GPUs on a node, run the following command:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,clocks_throttle_reasons.hw_slowdown --format=csv

In the output, a value of Active in the clocks_throttle_reasons.hw_slowdown column indicates the GPU is being throttled due to high temperatures.

Using dcgmi

The NVIDIA Data Center GPU Manager (DCGM) diagnostic suite includes checks for thermal violations. To run a level 1 diagnostic, run the following command:

dcgmi diag -r 1

A result of Warn or Fail in the Thermal section indicates a thermal violation occurred during the test. If a thermal violation is accompanied by clock throttling, the GPU is likely overheating and requires further investigation.

Xid errors

After you create a VM that has attached GPUs, you must install NVIDIA device drivers on your GPU VMs so that your applications can access the GPUs. However, sometimes these drivers return error messages.

An Xid message is an error report from the NVIDIA driver that is printed to the operating system's kernel log or event log for your Linux VM. These messages are placed in the /var/log/messages file. For more information about Xid messages including potential causes, see NVIDIA documentation.

How Google handles Xid errors

Google uses passive health checks to evaluate GPU systems. If hardware replacement is indicated, Google initiates emergency maintenance automatically. Google detects Xid errors and proactively sends machines to repair where error codes indicate a high probability of hardware failure, such as Xid 74, 79, and 140. For some Xid codes, because they can be caused by either software or hardware problems, Google uses pattern matching to trigger repairs, so not every occurrence results in an automatic repair.

Types of Xid errors

The following list describes the three main categories of Xid errors and their recommended recovery actions:

Application errors: these indicate issues inside your application code. Application errors include Xids such as Xid 13, 31, 94, 95, and 137, which indicate various types of memory access violation, similar to a segmentation fault. These don't indicate an ECC error. To troubleshoot these errors, NVIDIA recommends using either of the following debugging approaches:
- Direct debugging: run the application directly in cuda-gdb or run the Compute Sanitizer memcheck tool.
- Post-exception debugging: run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1. When an exception occurs, the GPU driver freezes the application state without exiting so you can attach a debugger later (cuda-gdb -p <PID>) to inspect the live stack trace.
Driver errors: these indicate issues caused by the NVIDIA GPU driver. To resolve these errors, ensure that you're using the latest NVIDIA driver version. Google monitors these errors and collaborates with NVIDIA on driver fixes.
Firmware or hardware recoverable errors: these indicate firmware or hardware errors that allow for recovery without hardware replacement. To resolve these errors, apply manual recovery measures such as resetting the GPU or rebooting the instance. Firmware or hardware recoverable errors include Error Correcting Code (ECC) errors (applicable to Xids such as Xid 48, 63, and 64) that indicate various stages of detecting and mitigating ECC errors. For more information about page retirement and ECC error mitigation, see NVIDIA's Dynamic Page Retirement FAQ.

Note: When you encounter an uncorrectable ECC error, your workload terminates and the volatile error count increments. Our recommendation in this case is to reset the GPU or reboot the instance and not to report the host as faulty.

Review Xid messages

To quickly diagnose why a GPU workload failed, stopped responding, or experienced performance degradation, check your instance's kernel logs (dmesg or /var/log/kern.log) for numeric NVIDIA Xid error codes.

Reviewing the Xid error tables in the following subsections helps you immediately:

Pinpoint the root cause: identify whether the failure is caused by an application bug (such as illegal memory access), a driver conflict, or a physical hardware fault (such as double-bit ECC memory errors).
Determine operational ownership: check what immediate manual recovery measures you must apply, such as resetting GPUs, rebooting VMs, or running debuggers, versus what automated repair and hardware replacement actions Google is actively managing on the host.
Take the correct recovery steps: avoid unnecessary troubleshooting procedures and know precisely when manual recovery is sufficient versus when you need to report the host as faulty. Sometimes, manual recovery isn't sufficient, for example, if the error source is in the GPU cache (SRAM), which can't be remapped, indicated by Xid 48 with SRAM Threshold Exceeded=Yes, or if the GPU has exhausted its remap bank, indicated by Xid 64: All reserved rows for bank are remapped. In these cases, Google detects that the GPU is eligible for hardware replacement and proactively sends the machine to repair. If your workloads encounter recurring errors or if you observe repeated memory faults, you can report the faulty host to initiate automated repair or replacement. For GKE, see How to report faulty hosts in GKE.

Xid handling

The following sections group common Xid error messages by technical category alongside their authoritative resolutions and responsibilities:

GPU memory errors (Xids 48, 63, 64, 92, 94, 95)
GPU System Processor (GSP) errors (Xids 119, 120)
Illegal memory access errors (Xids 13, 31, 137)
Other common Xid error messages (Xids 74, 79, 109, 149)

GPU memory errors

GPU memory is the memory that is available on a GPU that can be used for temporary storage of data. GPU memory is protected with Error Correction Code (ECC), which detects and corrects single-bit errors (SBE) and detects and reports double-bit uncorrectable errors (DBE).

These memory errors are expected to occur over the lifespan of a GPU. Prior to the release of the NVIDIA A100 GPUs, dynamic page retirement was supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100), row remap error recovery is introduced for HBM (DRAM) errors. ECC is enabled by default, and Google highly recommends keeping ECC enabled.

The following table lists common GPU memory errors and their suggested resolutions:

Xid error message	Customer action	Google action
`Xid 48: Double Bit ECC` A double-bit (uncorrectable) memory error was detected by ECC. This error always interrupts the running workload and generates Xid 48.	Stop your workloads. Depending on your environment, reset the GPUs or reboot the VM to recover and resume workloads: For Compute Engine VMs: Reset the GPUs or reboot the VM. For more information about VM actions and lifecycle states, see Compute Engine instance lifecycle. For GKE nodes: Apply `kubectl label nodes NODE_NAME cloud.google.com/perform-reboot=true` to the impacted node to trigger a guest OS reboot.	Google monitors when the GPU is eligible for hardware replacement, such as if the HBM remap bank is exhausted or the GPU exceeds its lifetime SRAM error threshold, and proactively sends the machine to repair to replace the GPU.
`Xid 63: ECC page retirement or row remapping recording event` Indicates a dynamic page retirement or row remapping event was recorded due to a memory error.	Stop your workloads. Depending on your environment, reset the GPUs or reboot the VM to recover and resume workloads: For Compute Engine VMs: Reset the GPUs or reboot the VM. For more information about VM actions and lifecycle states, see Compute Engine instance lifecycle. For GKE nodes: Apply `kubectl label nodes NODE_NAME cloud.google.com/perform-reboot=true` to the impacted node to trigger a guest OS reboot.	Google monitors error thresholds and sends the machine to repair when the GPU requires physical repair or replacement.
`Xid 64: ECC page retirement or row remapper recording failure` And the message contains the following information: `Xid 64: All reserved rows for bank are remapped`	Stop your workloads. Depending on your environment, reset the GPUs or reboot the VM to recover and resume workloads: For Compute Engine VMs: Reset the GPUs or reboot the VM. For more information about VM actions and lifecycle states, see Compute Engine instance lifecycle. For GKE nodes: Apply `kubectl label nodes NODE_NAME cloud.google.com/perform-reboot=true` to the impacted node to trigger a guest OS reboot.	When the remap bank is exhausted (`All reserved rows for bank are remapped`), Google detects that the GPU is eligible for hardware replacement and proactively sends the machine to repair.
If you get at least two of the following Xid messages together: `Xid 48` `Xid 63` `Xid 64` And the message contains the following information: `Xid XX: row remap pending`	Stop your workloads. Depending on your environment, reset the GPUs or reboot the VM to recover and resume workloads: For Compute Engine VMs: Reset the GPUs or reboot the VM. For more information about VM actions and lifecycle states, see Compute Engine instance lifecycle. For GKE nodes: Apply `kubectl label nodes NODE_NAME cloud.google.com/perform-reboot=true` to the impacted node to trigger a guest OS reboot.	Google sends the machine to repair if the remap bank is exhausted or when the GPU requires physical repair or replacement.
`Xid 92: High single-bit ECC error rate`	This Xid message is returned after the GPU driver corrects a correctable error, and it shouldn't affect your workloads. This Xid message is informational only. No action is needed.	None
`Xid 94: Contained error` Indicates that a GPU error occurred and whether the error was contained within a single application. Alone, Xid 94 doesn't indicate the root cause of the error; it must be interpreted alongside other co-occurring Xid errors in order to determine the fundamental cause.	Since the error was contained within a single application, restart the application to recover. If necessary, reset the GPUs or stop your workloads. Investigate other co-occurring Xid errors for further recovery steps and root-cause determination.	None
`Xid 95: Uncontained error` Indicates that a GPU error occurred and was not contained to a single application. Alone, Xid 95 doesn't indicate the root cause of the error; it must be interpreted alongside other co-occurring Xid errors in order to determine the fundamental cause.	Since the error was not contained, stop your workloads and reset the GPUs or reboot the VM to recover. Investigate other co-occurring Xid errors to determine the underlying root cause and further recovery steps.	None

GSP errors

A GPU System Processor (GSP) is a microcontroller that runs on GPUs and handles some of the low level hardware management functions.

Xid error message	Customer action	Google action
`Xid 119: GSP RPC timeout`	Stop your workloads. Check Recommended NVIDIA driver branches to ensure you are using a supported branch and a recent or latest driver version, as driver bugs in earlier versions are a major cause of GSP errors. If the error persists after checking or updating your driver, delete and recreate the VM. If the error continues, collect the NVIDIA bug report and file a case with Cloud Customer Care.	None. If the error persists and you file a support case, Google investigates hardware or driver state through the support workflow.
`Xid 120: GSP error`

Illegal memory access errors

The following Xids are returned when applications have illegal memory access faults:

Xid error message Customer action Google action

Xid error message	Customer action	Google action
`Xid 13: Graphics Engine Exception` `Xid 31: GPU memory page fault` `Xid 137: Memory access fault` A memory access violation was detected, analogous to a segmentation fault. These errors typically indicate an application bug where GPU memory is accessed out of bounds, or on freed buffers such as dereferencing an invalid pointer or an out-of-bounds array. These don't represent ECC errors unless Xid 48 is also present.	To resolve this issue, debug the memory access faults in your application. You can use cuda-gdb, Compute Sanitizer, or cuda-memcheck. For more details, see the NVIDIA Xid documentation.	None. In rare cases where hardware degradation might cause falsely reported illegal memory access errors, you can use NVIDIA Data Center GPU Manager (DCGM) to run `dcgmi diag -r 3` or `dcgmi diag -r 4` for different levels of test coverage and duration. If you identify a hardware issue, file a case with Customer Care.

Xid 13: Graphics Engine Exception

Xid 31: GPU memory page fault

Xid 137: Memory access fault

A memory access violation was detected, analogous to a segmentation fault. These errors typically indicate an application bug where GPU memory is accessed out of bounds, or on freed buffers such as dereferencing an invalid pointer or an out-of-bounds array. These don't represent ECC errors unless Xid 48 is also present.

To resolve this issue, debug the memory access faults in your application. You can use cuda-gdb, Compute Sanitizer, or cuda-memcheck.

For more details, see the NVIDIA Xid documentation.

None. In rare cases where hardware degradation might cause falsely reported illegal memory access errors, you can use NVIDIA Data Center GPU Manager (DCGM) to run dcgmi diag -r 3 or dcgmi diag -r 4 for different levels of test coverage and duration. If you identify a hardware issue, file a case with Customer Care.

Other common Xid error messages

Xid error message	Customer action	Google action
`Xid 74: NVLINK error`	Stop your workloads. Reset the GPUs.	None
`Xid 79: GPU has fallen off the bus` This means the driver isn't able to communicate with the GPU because a hardware issue caused the GPU to disappear from the PCI bus.	To recover your workloads, use either of the following approaches, depending on whether emergency maintenance is enabled for your project: Request an emergency maintenance: If emergency maintenance is rolled out to your project, trigger the maintenance event at your convenience. Wait for automated maintenance: Otherwise, wait for an unplanned maintenance event right on the instance.	Google detects that the GPU has fallen off the PCI bus and sends the machine to repair.
`Xid 109: Context switch timeout` Xid 109 is a generic error reported by the NVIDIA GPU driver, generated when a GPU instance fails to preempt or switch tasks within the expected timeout period. Google has a long history of investigating Xid 109 with NVIDIA, and known causes from driver bugs are fixed in the latest drivers. Xid 109 is not caused by a hardware issue.	Stop your workloads. Depending on your environment, reset the GPUs or reboot the VM to recover and resume workloads: For Compute Engine VMs: Reset the GPUs or reboot the VM. For more information about VM actions and lifecycle states, see Compute Engine instance lifecycle. For GKE nodes: Apply `kubectl label nodes NODE_NAME cloud.google.com/perform-reboot=true` to the impacted node to trigger a guest OS reboot. Consider upgrading to a newer NVIDIA driver version for your environment such as installing the latest driver on your Compute Engine VM or upgrading your GKE node pool/driver DaemonSet.	None
`Xid 149` that mentions `0x02a`, such as the following example: `Xid (PCI:0000:c0:00): 149,NETIR_LINK_EVT Fatal XC0 i0 Link 04 (0x02a485c6 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000)` This indicates a known issue affecting firmware for NVIDIA B200 GPUs.	Stop your workloads. Reset the GPUs.	None

Reset GPUs

Some issues might require you to reset your GPUs. To reset GPUs, complete the following steps:

For N1, G2, A2, and G4 VMs with one or more GPUs attached, reboot the VM.
For G4 VMs with fractional GPUs (less than one GPU attached), complete the following steps:
1. Delete the VM.
2. Recreate the VM.
For A3, A4, A4X, and A4X Max instances, run sudo nvidia-smi --gpu-reset.
- For most Linux VMs, the nvidia-smi executable is located in the /var/lib/nvidia/bin directory.
- For GKE nodes, the nvidia-smi executable is located in the /home/kubernetes/bin/nvidia directory.
For A3, A4, A4X, and A4X Max instances on GKE nodes, you can also use the gpu-reset-tool to automate the reset of all GPUs on a node. This tool requires only that you specify the target node name.

Alternatively, GPUs are also reset whenever you reset a VM or stop and restart a VM. For more information about VM lifecycle states and the differences between VM recovery actions, see Compute Engine instance lifecycle and Suspend, stop, or reset Compute Engine instances.

Open a support case

If you're unable to resolve the issues by using the guidance on this page, gather the following information and open a support case:

Project ID of the project where the affected instances are located.
List of all instance names or IDs in the cluster.
List of suspect nodes identified through troubleshooting.
Complete, non-interleaved NCCL logs with debug settings enabled.
Output from hardware health checks (dcgmi, nvidia-smi).
Exact benchmark or workload command that is failing.
Relevant log files such as host engine and diagnostic logs. To gather these, run gather-dcgm-logs.sh, located in /usr/local/dcgm/scripts in default installations.
NVIDIA bug report. Run nvidia-bug-report.sh. For Blackwell GPUs, follow Generate NVIDIA Bug Report for Blackwell GPUs.
Details about any recent changes that were made to your environment preceding the failure.

What's next

Review GPU machine types.

Troubleshoot GPU VMs Stay organized with collections Save and categorize content based on your preferences.