Perform host maintenance for nodes running training and inference workloads

Autopilot Standard

This document explains how to perform host maintenance of the underlying Compute Engine instances for nodes in Google Kubernetes Engine (GKE) clusters. You only need to actively manage this maintenance for certain types of Compute Engine instances that don't live migrate, including instances with GPUs and TPUs. The strategies described in this document work well for training and inference workloads. If you only need to manually perform host maintenance for an individual node, or your workloads can tolerate automatic host maintenance, see Understand how to do host maintenance on GKE.

These strategies perform host maintenance for groups of nodes, and, optionally, initiate GKE cluster upgrades.

Use the parallel strategy for the nodes of workloads where you can have one single period of downtime, such as for the nodes of training workloads. Use the rolling strategy for the nodes of workloads where you can have batches of downtime while maintaining availability of the majority of resources, such as for the nodes of inference workloads.

Use a parallel strategy to update the nodes of training workloads

This strategy performs changes simultaneously for a group of nodes that use accelerators. You can use this strategy for training workloads. Or, you can use it for other types of workloads where the least disruptive method of performing changes is to have a single window of complete downtime for all nodes in the group and the workloads that run on them.

The strategy follows these high-level steps:

Stop workloads: select the node pools and either stop the workloads running on them or move the workloads to other nodes that remain available.
Trigger host maintenance: apply the maintenance label to all selected nodes at the same time and wait for the process to complete on all nodes.
Upgrade the GKE version: change the GKE version of the nodes.
Restart workloads: after all host maintenance and upgrades finish, restart your workloads.

The provided instructions perform changes for a single node pool. However, you can adapt the steps to perform changes for multiple node pools at the same time. Ensure that, before you begin these steps, you have at least a few hours where this workload doesn't need to run on these nodes.

To minimize disruption while receiving critical changes for both the underlying Compute Engine instances and the GKE nodes, use this period of downtime to perform both the host maintenance and GKE version upgrades. However, you can perform only host maintenance if you don't want to upgrade the version of your GKE nodes.

Considerations before you begin

Review the following considerations before you begin:

Avoid redeploying workloads: to avoid unnecessary delays due to PodDisruptionBudgets, don't redeploy any workloads until you've completed all steps.
Plan for disruption: ensure that your workloads can be disrupted for a period of time. These steps take multiple hours to complete, primarily due to the time required for host maintenance.

Perform updates for all nodes simultaneously

To perform host maintenance and, optionally, GKE version upgrades, complete the following steps:

Prepare your workloads: stop your workloads, or ensure that they have taken a recent snapshot or checkpoint.
Start host maintenance: apply the maintenance label to all nodes in the selected node pool:
```
kubectl label nodes -l cloud.google.com/gke-nodepool=NODE_POOL_NAME cloud.google.com/perform-maintenance=true --overwrite
```
Note: Labeling nodes performs maintenance using the instances.performMaintenance API. To trigger maintenance at a reservation sub-block, we recommend using the gcloud beta compute reservations perform-maintenance command.

Compute Engine begins draining and updating the underlying instances simultaneously. This process might take a few hours. For more information, see Process of graceful termination.
Monitor the status of host maintenance: GKE removes the maintenance label when maintenance completes. When maintenance finishes, you can find a log with the following message in Cloud Logging:
```
Maintenance window has completed for this instance. All maintenance
notifications on the instance have been removed.
```
Optional: Upgrade the version of the GKE nodes: follow the instructions to upgrade the GKE version of the nodes.

Use a rolling strategy to update the nodes of inference workloads

This strategy outlines a manual approach to performing maintenance on GKE nodes running inference workloads. It involves updating nodes in batches to maintain service availability. This method is best suited for workloads that can tolerate a certain percentage of replicas being temporarily offline.

The strategy follows these high-level steps:

Identify and batch nodes: choose the node pools to update. Group the nodes into batches sized according to your workload's failure tolerance.
Iterate through batches: for each batch, apply the maintenance label and monitor the batch of nodes until the label is removed.
Upgrade the GKE version: after all batches complete host maintenance, change the version of the GKE nodes.

Considerations before you begin

Review the following considerations before you begin:

Understand your deployment: success requires detailed knowledge of your workload distribution, replica placement, and failure domains. Ensure that you maintain sufficient serving capacity throughout the process.
Plan batch sizes: update nodes in batches. The size of each batch is determined by your workload's fault tolerance. Factors to consider include the following:
- The number of replicas per serving model.
- The distribution of replicas across nodes and failure domains.
- PodDisruptionBudgets can help enforce the maximum number of Pods that are down simultaneously.
- Recommendation: to simplify management, consider dedicating different node pools to different sets of replicas, which lets you isolate failure domains at the node pool level.
Calculate time constraints: consider the following timing factors:
- Each batch can take several hours to complete the host maintenance step.
- Calculate the minimum batch size to help ensure that all maintenance finishes within required deadlines:
  1. MAINTENANCE_BLOCKS = floor(HOURS_TO_MAINTENANCE / 4) (where HOURS_TO_MAINTENANCE is the total time available).
  2. MIN_PER_BATCH = TOTAL_NODE_COUNT / MAINTENANCE_BLOCKS
- Your chosen batch size must be equal to or greater than MIN_PER_BATCH.
Review specific workload types: consider the following for the respective configuration types:
- Mixture of Experts (MOE): ensure that your batching strategy maintains the minimum required number of replicas for each model.
- Disaggregated serving: ensure that you track all replicas involved in the disaggregated setup when planning batches.
- Multi-host node pools (TPU, MNNVL): for these configurations, you'll likely take down an entire node pool at a time. Plan your failure domains across multiple node pools accordingly.

Perform rolling updates in batches

To perform rolling updates, complete the following steps:

Identify nodes for maintenance: identify all nodes where you want to perform maintenance, and save this list. To identify nodes, use any of the following methods or manually select them:
- Get all nodes in the cluster that use accelerators (TPUs or GPUs):
```
kubectl get nodes -o json | jq -r '.items[] | select(.spec.taints[]? | select(.key=="nvidia.com/gpu" or .key=="google.com/tpu")) | .metadata.name'
```
- Get all nodes in a specific node pool:
```
kubectl get nodes -l cloud.google.com/gke-nodepool=NODE_POOL_NAME --no-headers -o custom-columns=":metadata.name"
```
  Replace NODE_POOL_NAME with the name of the node pool.
- Get all nodes with a specific label:
```
kubectl get nodes -l LABEL -o jsonpath='{.items[*].metadata.name}'
```
  Replace LABEL with the node label.
Divide nodes into batches: divide the identified nodes into equal batches. Determine the batch size using the formula described in the Calculate time constraints list item in the earlier Considerations before you begin section.
Perform host maintenance: for each batch, complete the following steps:
1. Select a batch of nodes and apply the maintenance label:
```
kubectl label nodes LIST_OF_NODES_IN_BATCH cloud.google.com/perform-maintenance=true --overwrite
```
  Replace LIST_OF_NODES_IN_BATCH with a space-separated list of nodes from the batch. For example, node-1 node-2 node-3.
  
  Note: Labeling nodes performs maintenance using the instances.performMaintenance API. To trigger maintenance at a reservation sub-block, we recommend using the gcloud beta compute reservations perform-maintenance command.
2. Monitor the status of host maintenance. GKE removes the maintenance label when maintenance completes. When maintenance finishes, you can find a log with the following message in Logging:
```
Maintenance window has completed for this instance. All maintenance
notifications on the instance have been removed.
```
3. Repeat the previous two steps for each remaining batch until you've completed host maintenance for all batches.
Optional: Upgrade the version of the GKE nodes: perform this step only after host maintenance completes for all nodes, to avoid scenarios where the GKE nodes are deployed on hosts that haven't finished maintenance yet. See the instructions in the following section.

Upgrade the GKE version of the nodes

Consider the number of nodes that you want to upgrade at the same time. With the parallel strategy, you performed host maintenance for your entire node pool or multiple node pools at the same time. With the rolling strategy, you performed host maintenance in batches. Determine which upgrade method that you'll use based on the size of the groups of nodes:

Parallel strategy: if your node pools each have 20 or fewer nodes per zone, use surge upgrades. If your node pools each have greater than 20 nodes per zone, delete and re-create the node pools.
Rolling strategy: if your batches have 20 nodes, per zone, per node pool, or less, use surge upgrades. If your batches have greater than 20 nodes, per zone, per node pool, delete and re-create the nodes.

Use surge upgrades

Configure surge upgrades, using the maxUnavailable setting to determine how many nodes can be unavailable at the same time, per zone, in a node pool. For example, if you have 18 nodes in one zone in a node pool, set the value of the maxUnavailable field to 18.

This setting works best when using capacity from a reservation where you don't have excess capacity. For more information about why to use this setting, see Upgrade in a resource-constrained environment.
Upgrade the node pool by running the following command. If you want to upgrade multiple node pools, run this command for each node pool:
```
gcloud container clusters upgrade CLUSTER_NAME \
    --node-pool NODE_POOL_NAME \
    --cluster-version VERSION \
    --location CONTROL_PLANE_LOCATION \
    --quiet
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- NODE_POOL_NAME: the name of the node pool.
- VERSION: a recommended auto-upgrade target for the node pool. For more information, see Get upgrades information for Standard cluster node pools. If your cluster doesn't have a recommended auto-upgrade target, check the latest Version updates entries in the GKE release notes.
- CONTROL_PLANE_LOCATION: the location of the control plane of your cluster.

Delete and re-create the nodes

Delete the node pool and re-create it using the later version:

Delete the node pool:

gcloud container node-pools delete NODE_POOL_NAME \
    --cluster CLUSTER_NAME \
    --location CONTROL_PLANE_LOCATION

Re-create the node pool, passing the new version by using the --cluster-version flag. Pass the recommended auto-upgrade target for the node pool. For more information, see Get upgrades information for Standard cluster node pools. If your cluster doesn't have a recommended auto-upgrade target, check the latest Version updates entries in the GKE release notes.

Perform host maintenance for nodes running training and inference workloads Stay organized with collections Save and categorize content based on your preferences.

Use a parallel strategy to update the nodes of training workloads

Considerations before you begin

Perform updates for all nodes simultaneously

Use a rolling strategy to update the nodes of inference workloads

Considerations before you begin

Perform rolling updates in batches

Upgrade the GKE version of the nodes

Use surge upgrades

Delete and re-create the nodes

Perform host maintenance for nodes running training and inference workloads