Known issues

This page describes known issues that you might run into while running your AI-optimized VMs or clusters. For issues with Compute Engine VMs, see Compute Engine known issues.

Issues

The following section list known issues and workarounds for AI Hypercomputer.

Workload interruptions on A4 VMs due to firmware issues for NVIDIA B200 GPUs

NVIDIA has identified two firmware issues for B200 GPUs, which are used by A4 VMs, that are causing workload interruptions. Specifically, if you notice workload interruptions on A4 VMs, then check if either of the following are true:

To mitigate this issue, we recommend resetting your GPUs. To help prevent the issue, we recommend resetting the GPUs on A4 VMs at least once every 60 days.

The metadata server might display old physicalHost VM metadata

After experiencing a host error or using the report faulty host API to move an instance to a new host, when you query the metadata server, it might display the physicalHost metadata of the instance's previous host.

To workaround this issue, do one of the following: