This document discusses how you can minimize disruptions to your GPU workloads during a maintenance event.
When Compute Engine performs maintenance on a Compute Engine instance with attached graphics processing units (GPUs), the compute instance must be stopped. This is because compute instances with attached GPUs can't be live migrated.
You must set these compute instances to stop for host maintenance events. You can set your stopped compute instances to automatically restart after the maintenance event completes.
Host maintenance events typically occur once every two weeks, but might occasionally run more frequently. Compute instances with attached GPUs can take up to one hour to terminate after failures or host errors.
Receive advance notice before maintenance events
You can monitor the maintenance schedule for your Compute Engine instance, and prepare your workloads to transition through the system restart.
To receive advance notice of host events, monitor the
/computeMetadata/v1/instance/maintenance-event metadata value.
If the request to the metadata server returns NONE, then the compute instance
isn't scheduled to stop. For example, run the following command from within a
compute instance:
curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"
NONEIf the metadata server returns TERMINATE_ON_HOST_MAINTENANCE, then your
compute instance is scheduled for stopping. For compute instances that have
attached GPUs, Compute Engine provides this notice 1 hour before the compute
instance stops.
For some GPU machine series, such as A3, Compute Engine
provides notice of upcoming maintenance more than an hour in advance through the
upcoming-maintenance metadata attribute. To learn more, see
Monitor and plan for a host maintenance event.
Configure your application to transition through the maintenance event. For example, you might use one of the following techniques:
Use these notices to configure your application to transition through host maintenance events. For example, see Migrate your temporary data off of Local SSD disks in this document.
Migrate your temporary data off of Local SSD disks
Due to Local SSD data persistence, data on any Local SSD disks attached to a compute instance is unrecoverable whenever Compute Engine stops the compute instance for host maintenance events. If you want to help prevent data loss, configure your workload to migrate data off of the Local SSD disks before the compute instance is stopped. For example, you might use one of the following techniques:
Configure your application to temporarily move work in progress to a Cloud Storage bucket, then retrieve that data after the compute instance restarts.
Write data to a secondary Persistent Disk. When the compute instance automatically restarts, the Persistent Disk can be reattached and your application can resume work.
What's next?
- Learn more about GPU platforms.
- To learn more about managing and scaling groups of compute instances, see Set the group's target size.
- To monitor GPU performance, see Monitor GPU performance.
- To improve network performance, see Use higher network bandwidth.
- Learn how to troubleshoot VM shutdowns and reboots.