Known issues

This page describes known issues that you might run into while running your AI-optimized VMs or clusters. For issues with Compute Engine VMs, see Compute Engine known issues.

Issues

The following section list known issues and workarounds for AI Hypercomputer.

Workload interruptions on A4 VMs due to firmware issues for NVIDIA B200 GPUs

NVIDIA has identified two firmware issues for B200 GPUs, which are used by A4 VMs, that are causing workload interruptions. Specifically, if you notice workload interruptions on A4 VMs, then check if either of the following are true:

The VM's uptime (lastStartTimestamp field) exceeds 65 days.
Logs show an Xid 149 message that mentions 0x02a.

To mitigate this issue, we recommend resetting your GPUs. To help prevent the issue, we recommend resetting the GPUs on A4 VMs at least once every 60 days.

Note: If you are running in GKE, you can use the gpu-reset-tool to reset the GPUs. This tool automates the reset process and requires only the target node name.

The metadata server might display old `physicalHost` VM metadata

After experiencing a host error or using the report faulty host API to move an instance to a new host, when you query the metadata server, it might display the physicalHost metadata of the instance's previous host.

To workaround this issue, do one of the following:

Use the instances.get method or the gcloud compute instances describe command to retrieve the correct physicalHost information.
Stop and then start your instance. This process updates the physicalHost information in the metadata server.
Wait 24 hours for the impacted instance's physicalHost information to be updated.

Known issues Stay organized with collections Save and categorize content based on your preferences.

Issues

Workload interruptions on A4 VMs due to firmware issues for NVIDIA B200 GPUs

The metadata server might display old physicalHost VM metadata

Known issues

The metadata server might display old `physicalHost` VM metadata