Manage maintenance events for TPUs in managed capacity mode

TPU VMs are instances of Compute Engine VMs with attached TPU hardware. Compute Engine VMs are subject to Compute Engine VM maintenance events. Each TPU is connected to a Compute Engine VM, so using more TPUs (for example, in a TPU slice) increases the likelihood of one of your VMs encountering a maintenance event.

This document discusses approaches to handle maintenance events for long-running training jobs on TPUs. For information about handling maintenance events for TPUs in Google Kubernetes Engine (GKE), see Manage GKE node disruption for GPUs and TPUs.

View notifications for upcoming maintenance

By monitoring your instance's upcoming maintenance windows, you can proactively prepare your workloads to handle upcoming maintenance with minimal disruption. For more information, see Monitor and plan for a host maintenance event in the Compute Engine documentation.

Use checkpoints for fast recovery from maintenance events

Checkpoints are key to short recoveries from maintenance events and should be saved frequently. We recommend saving checkpoints approximately every hour. Not checkpointing often enough risks losing a lot of training progress due to maintenance events or other training interruptions.

Checkpoints generally refer to all of the saved parameters used in training, such as model weights. The time it takes to save a checkpoint can range from seconds to minutes.

While TPUs often recover automatically from maintenance events, there are edge cases where the job doesn't restart automatically. When this happens, you need to delete and recreate the TPU resources and restart the training job from a saved checkpoint.

There are different mechanisms for saving and loading checkpoints for each ML framework. Supported Cloud TPU models generally have checkpointing built-in. For more information on checkpointing, see the following documentation:

Detect maintenance events

To detect if and when a maintenance event occurred on your TPU, check the system event audit logs in Cloud Logging. For more information, see View maintenance event logs.

You can also check for upcoming maintenance events using the gcloud compute instances describe command. For more information, see Monitor and plan for a host maintenance event in the Compute Engine documentation.

View maintenance event logs

You can view historical logs of maintenance events on your TPU in system event audit logs.

In the Google Cloud console navigation menu, go to the Logs Explorer page:

Go to Logs Explorer
Use the following search query to view any TPU VMs that have been terminated for maintenance:

"compute.instances.terminateOnHostMaintenance"

The results display logs for any interruptions and repairs of your TPU workers within your search timeframe. The logs include:
- The date and time of the event
- The type of event
- The reason for the termination in the protoPayload.metadata.terminateReason field

Manually start maintenance

You can manually start a pending host maintenance event on your TPU VM to proactively handle upcoming maintenance with minimal disruption. For more information, see Manually start a host maintenance event in the Compute Engine documentation.