This page describes known issues that you might run into while running your AI-optimized VMs or clusters. For issues with Compute Engine VMs, see Compute Engine known issues.
Issues
The following section list known issues and workarounds for AI Hypercomputer.
Workload interruptions on A4 VMs due to firmware issues for NVIDIA B200 GPUs
NVIDIA has identified two firmware issues for B200 GPUs, which are used by A4 VMs, that are causing workload interruptions. Specifically, if you notice workload interruptions on A4 VMs, then check if either of the following are true:
- The VM's uptime
(
lastStartTimestampfield) exceeds 65 days. - Logs show an
Xid 149message that mentions0x02a.
To mitigate this issue, we recommend resetting your GPUs. To help prevent the issue, we recommend resetting the GPUs on A4 VMs at least once every 60 days.
The metadata server might display old physicalHost VM metadata
After experiencing a
host error or
using the
report faulty host API to
move an instance to a new host, when you
query the metadata server,
it might display the
physicalHost metadata of the instance's previous host.
To workaround this issue, do one of the following:
- Use the
instances.getmethod or thegcloud compute instances describecommand to retrieve the correctphysicalHostinformation. - Stop and then start your
instance. This process updates the
physicalHostinformation in the metadata server. - Wait 24 hours for the impacted instance's
physicalHostinformation to be updated.