Enable node health prediction in a GKE cluster

After you create an AI-optimized Google Kubernetes Engine (GKE) cluster, you can enable node health prediction. If you plan to schedule workloads by using Topology Aware Scheduling (TAS) and Kueue, then enabling node health prediction lets the cluster's scheduler do the following:

Identify nodes that are likely to degrade within the next five hours.
Avoid scheduling new workloads on those nodes.

This approach helps you minimize interruptions on critical and interruption-sensitive workloads, such as large-scale training workloads.

This document explains how to enable node health prediction in a GKE cluster that uses A4X Max, A4X, A4, or A3 Ultra nodes. To learn how to use the node health prediction metric in a Cloud Monitoring dashboard when, for example, you want to troubleshoot performance issues on a Slurm cluster, see instead Monitor Compute Engine instances and Slurm clusters.

Limitations

Before you enable node health prediction in your GKE cluster, consider the following limitations:

The node must use A4X Max, A4X, A4, or A3 Ultra machine types.
The node must use the reservation-bound provisioning model.

Note: If the nodes in your cluster use an A3 Mega or A3 High machine type, or you created the nodes by using a different provisioning model, then contact your account team.

Understand node health prediction

When you enable node health prediction in a GKE cluster, the CronJob applies the gke.google.com/recommended-to-run-large-training-workload label to each node in your cluster. The CronJob sets the label values to the likelihood that a node's GPU health will degrade, and updates these values every 10 minutes. If the label value is true, then the node is healthy. Otherwise, if the label value is false, the node will likely degrade within the next five hours. The label value can change over time based on the node's GPU health.

If you see that a node is likely to degrade, then you can do one or both of the following:

Avoid scheduling workloads on the node. You can configure Kueue to avoid scheduling workloads on nodes that show a value of false, as described in this document.
Report the node as faulty. If the node is experiencing issues like high GPU temperature or slow performance, then you can report the node as faulty. This action starts a host maintenance event for the node, making it available again for running workloads after maintenance completes. For instructions, see Report faulty hosts through GKE.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

To connect to your cluster, run the following command:
```
gcloud container clusters get-credentials CLUSTER_NAME
```
Replace CLUSTER_NAME with the name of your cluster.

Enable node health prediction

After you've prepared to schedule workloads on your GKE cluster by using TAS, you can enable node health prediction by completing the following steps:

Deploy automatic node labeling
Update your Job configuration
Verify node labeling

Deploy automatic node labeling

To deploy automatic node labeling for node health prediction in your GKE cluster, complete the following steps:

Clone the hardware accelerators in GKE git repository:

git clone https://github.com/GoogleCloudPlatform/container-engine-accelerators.git

Go to the topology-scheduler directory:

cd container-engine-accelerators/gpudirect-tcpxo/topology-scheduler

Create the Kubernetes ConfigMap containing the Python scripts, schedule-daemon.py and label-nodes-daemon.py, that query the health scores:

kubectl create configmap predictor-scheduler-scripts \
    --namespace=kube-system \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

Apply the service account configuration to grant the necessary permissions (reading Monitoring metrics and patching Node objects) to the CronJob:
```
kubectl apply -f service-account.yaml
```
Deploy the DaemonSet that schedules the node labeling job:
```
kubectl apply -f label-nodes-daemon.yaml
```

Update your Job configuration

To enable node health prediction when using Kueue, you must update your Job configuration to check for health prediction values and, if supported, topology requirements before starting a workload.

To update your Job configuration and enable node health prediction, in the spec field, add the following fields:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gke.google.com/recommended-to-run-large-training-workload
            operator: NotIn
            values:
            - "False"
...

Verify node labeling

After the CronJob runs for the first time, which is approximately 10 minutes after deployment, verify whether it has applied the gke.google.com/recommended-to-run-large-training-workload label to your nodes.

View a list of nodes that have the gke.google.com/recommended-to-run-large-training-workload label applied to them:

kubectl get nodes -L gke.google.com/recommended-to-run-large-training-workload

The label value can be one of the following:

true: the node is predicted to be healthy in the next five hours.
false: the node is likely to degrade within the next five hours. If you configured your Job configuration as described in this document, then Kueue avoids scheduling new workloads on the node.

What's next

To learn about managing common events relevant to GKE clusters and AI workloads, see Manage AI-optimized GKE clusters.
To learn more about scheduling jobs on GKE with Kueue, see Deploy a batch system using Kueue.