After you create an AI-optimized Google Kubernetes Engine (GKE) cluster, you can enable node health prediction. If you plan to schedule workloads by using Topology Aware Scheduling (TAS) and Kueue, then enabling node health prediction lets the cluster's scheduler do the following:
Identify nodes that are likely to degrade within the next five hours.
Avoid scheduling new workloads on those nodes.
This approach helps you minimize interruptions on critical and interruption-sensitive workloads, such as large-scale training workloads.
This document explains how to enable node health prediction in a GKE cluster that uses A4X, A4, or A3 Ultra nodes. To learn how to use the node health prediction metric in a Cloud Monitoring dashboard when, for example, you want to troubleshoot performance issues on a Slurm cluster, see instead Monitor VMs and Slurm clusters.
Limitations
Before you enable node health prediction in your GKE cluster, consider the following limitations:
The node must use A4X, A4, or A3 Ultra machine types.
The node must use the reservation-bound provisioning model.
Understand node health prediction
When you enable node health prediction in a GKE cluster, the
CronJob applies the gke.google.com/recommended-to-run-large-training-workload
label to each node in your cluster. The CronJob sets the label values to the
likelihood that a node's GPU health will degrade, and updates these values every
10 minutes. If the label value is true, then the node is healthy. Otherwise,
if the label value is false, the node will likely degrade within the next five
hours. The label value can change over time based on the node's GPU health.
If you see that a node is likely to degrade, then you can do one or both of the following:
Avoid scheduling workloads on the node. You can configure Kueue to avoid scheduling workloads on nodes that show a value of
false, as described in this document.Report the node as faulty. If the node is experiencing issues like high GPU temperature or slow performance, then you can report the node as faulty. This action starts a host maintenance event for the node, making it available again for running workloads after maintenance completes. For instructions, see Report faulty hosts through GKE.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
To connect to your cluster, run the following command:
gcloud container clusters get-credentials CLUSTER_NAMEReplace
CLUSTER_NAMEwith the name of your cluster.
Enable node health prediction
After you've prepared to schedule workloads on your GKE cluster by using TAS, you can enable node health prediction by completing the following steps:
Deploy automatic node labeling
To deploy automatic node labeling for node health prediction in your GKE cluster, complete the following steps:
Clone the hardware accelerators in GKE git repository:
git clone https://github.com/GoogleCloudPlatform/container-engine-accelerators.gitGo to the
topology-schedulerdirectory:cd container-engine-accelerators/gpudirect-tcpxo/topology-schedulerCreate the Kubernetes ConfigMap containing the Python scripts,
schedule-daemon.pyandlabel-nodes-daemon.py, that query the health scores:kubectl create configmap predictor-scheduler-scripts \ --namespace=kube-system \ --from-file=schedule-daemon.py=schedule-daemon.py \ --from-file=label-nodes-daemon.py=label-nodes-daemon.pyApply the service account configuration to grant the necessary permissions (reading Monitoring metrics and patching Node objects) to the CronJob:
kubectl apply -f service-account.yamlDeploy the DaemonSet that schedules the node labeling job:
kubectl apply -f label-nodes-daemon.yaml
Update your Job configuration
To enable node health prediction when using Kueue, you must update your Job configuration to check for health prediction values and, if supported, topology requirements before starting a workload.
To update your Job configuration and enable node health prediction, in the
spec field, add the following fields:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gke.google.com/recommended-to-run-large-training-workload
operator: NotIn
values:
- "False"
...
Verify node labeling
After the CronJob runs for the first time, which is approximately 10 minutes
after deployment, verify whether it has applied the
gke.google.com/recommended-to-run-large-training-workload label to your nodes.
View a list of nodes that have the
gke.google.com/recommended-to-run-large-training-workload label applied to
them:
kubectl get nodes -L gke.google.com/recommended-to-run-large-training-workload
The label value can be one of the following:
true: the node is predicted to be healthy in the next five hours.false: the node is likely to degrade within the next five hours. If you configured your Job configuration as described in this document, then Kueue avoids scheduling new workloads on the node.
What's next
To learn about managing common events relevant to GKE clusters and AI workloads, see Manage AI-optimized GKE clusters.
To learn more about scheduling jobs on GKE with Kueue, see Deploy a batch system using Kueue.