This document teaches you how to configure a GKE cluster to maximize AI training availability on Tensor Processing Units (TPUs). You configure an automated fallback system using an open-source job queueing tool called Kueue, and Google Cloud's Dynamic Workload Scheduler (DWS).
The configuration that you define in this document sets up a primary node pool and a backup pool of nodes:
- On-demand (Plan A): This node pool is your first choice for nodes. The system first attempts to schedule jobs on these on-demand machines. The term on-demand means that once these machines start running, you have reliable, uninterrupted access to them.
- DWS flex-start (Plan B): This is your backup node pool. When Plan A machines aren't available, the Kueue scheduling program automatically assigns your job to this Plan B pool. DWS then searches for the Plan B hardware, but it doesn't guarantee immediate access because that hardware might also be unavailable. DWS doesn't give up though: it holds your request in line for up to 7 days and automatically provides the machines as soon as they're available.
This approach minimizes the time jobs spend waiting in line. It means that you don't have to manually check for available resources or rewrite your script for different machines.
Overview of configuration steps
To configure the automated fallback system, you must complete several configuration steps. It helps to divide this configuration into two categories:
- Cluster Administrator tasks: One-time infrastructure configuration, such as creating the GKE cluster, provisioning node pools, and installing the Kueue scheduling controller.
- AI Developer tasks: Repetitive, daily workflows, such as defining the training job requirements and submitting the workload.
Even if you perform all of these steps yourself, keeping this distinction in mind helps clarify the overall process.
Before you configure the system, review the configuration steps that you'll perform.
Key concepts
- On-demand (Plan A) node pool: The primary, high-priority node pool. Your job always tries to use this pool first.
- DWS flex-start (Plan B) node pool: The backup node pool. If machines in the primary pool aren't available, the system automatically uses this pool to search for available hardware.
- Kueue: A scheduling program that manages the job queue. It intercepts your job request and decides which node pool to use (Plan A or Plan B).
- Job: The AI training workload you want to run. In this document, you define it using a RayJob manifest.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Google Kubernetes Engine, Cloud TPU API APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, activate Cloud Shell.
-
Ensure that you have enough preemptible quota to use TPU flex-start VMs. If the default quota isn't enough for your needs, request a higher allocation. For details, see Cloud TPU quotas and Set up the Cloud TPU environment.
Define environment variables
To simplify the commands that you run in this document, you can set environment variables in Cloud Shell. These variables store values such as the ID of your Google Cloud project, the names of your node pools, and the location of your GKE cluster.
After you define these variables, you can reuse them in multiple commands by
referencing the variable name (for example, $CLUSTER_NAME) instead of retyping
or replacing values each time. This approach makes the process easier to follow
and reduces the risk of errors.
To define the following useful environment variables in Cloud Shell, run the following commands:
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export ZONE="us-east5-b"
export REGION="us-east5"
export CLUSTER_NAME="tpu-cluster"
export GKE_VERSION="1.34"
export ONDEMAND_NODEPOOL="on-demand-pool"
export DWS_NODEPOOL="dws-pool"
Here's an explanation of these environment variables:
PROJECT_ID: The ID of your Google Cloud project.PROJECT_NUMBER: The unique identifier number for your project (for example, 123456789012).ZONE: The compute zone for your cluster (for example,us-east5-b). Select a zone with availability for your chosen accelerator type. See Cloud TPU quotas or GPU quotas for availability info.REGION: The region where you create the cluster resources (for example,us-east5).CLUSTER_NAME: The name you choose for your GKE cluster.GKE_VERSION: The GKE version of your cluster. Use version 1.34 or later.ONDEMAND_NODEPOOL: The name of your standard on-demand node pool. (This is your Plan A node pool).DWS_NODEPOOL: The name of your DWS flex-start node pool. (This is your Plan B node pool).
Configure the infrastructure (Cluster Administrator)
As the Cluster Administrator, you configure the GKE cluster and node pools to support the fallback mechanism.
Create a GKE cluster
First, create the GKE cluster. This cluster is the environment in which you install the Kueue controller, configure your node pools, and run your AI training jobs. To create and connect to a cluster, perform these steps:
Create the cluster:
gcloud container clusters create ${CLUSTER_NAME} \ --cluster-version=${GKE_VERSION} \ --machine-type=n2-standard-16 \ --location=${ZONE} \ --enable-image-streaming \ --addons=RayOperator \ --project=${PROJECT_ID}This command uses the following key flags:
--addons=RayOperator: Installs the Ray Operator on your cluster. You need this operator to manage the RayJob workload that you submit later in this document.--enable-image-streaming: Allows your cluster to pull container images faster. This feature significantly reduces the time it takes for large AI container images to start running.
Retrieve the cluster's credentials so that the kubectl CLI can connect to it. This command updates your Kubernetes config file, which is stored by default in the
~/.kube/configdirectory:gcloud container clusters get-credentials ${CLUSTER_NAME} \ --location=${ZONE} \ --project=${PROJECT_ID}
Create the node pools
Create the primary and backup node pools for your environment: the on-demand node pool (Plan A) and the DWS flex-start node pool (Plan B):
Create the on-demand node pool: This pool serves as the primary resource for training jobs:
gcloud container node-pools create ${ONDEMAND_NODEPOOL} \ --cluster=${CLUSTER_NAME} \ --location=${ZONE} \ --machine-type=ct6e-standard-4t \ --tpu-topology=4x4 \ --reservation-affinity=none \ --enable-autoscaling \ --num-nodes=0 \ --min-nodes=0 \ --max-nodes=4In this example, your first choice machines are TPU v6e accelerators. You specify this hardware using the
--machine-type=ct6e-standard-4tflag. You can change this machine type to match the hardware, such as GPUs or different TPUs, that you want for your AI model.Create the DWS flex-start node pool: In this example, you select the same machine type (
--machine-type=ct6e-standard-4t) that you chose for the primary on-demand pool. Your Plan B node pool doesn't have to use a different machine type. Because you really want this specific hardware, making it your Plan B choice just means you switch to a different method of acquiring it if it's not immediately available. This alternative method uses DWS, which continuously searches for the available hardware for up to 7 days:gcloud container node-pools create ${DWS_NODEPOOL} \ --cluster=${CLUSTER_NAME} \ --location=${ZONE} \ --machine-type=ct6e-standard-4t \ --tpu-topology=4x4 \ --reservation-affinity=none \ --enable-autoscaling \ --enable-queued-provisioning \ --flex-start \ --num-nodes=0 \ --min-nodes=0 \ --max-nodes=4These commands use the following key flags:
--num-nodes=0,--min-nodes=0,--max-nodes=4, and--enable-autoscaling: This combination allows the node pools to scale up from zero nodes when a job needs them and scale back down when idle, which helps save costs.--tpu-topology: Defines the physical arrangement of the TPU chips. You specify this layout because the physical arrangement of the chips affects how fast your distributed training job runs.--reservation-affinity=none: Helps ensure that the node pool doesn't consume your pre-reserved hardware. Google Cloud lets you reserve specific machines to help ensure availability. Setting this flag tononetells the system to bypass those reservations and dynamically request unreserved machines instead.--enable-queued-provisioningand--flex-start: (Plan B pool only) These flags enable DWS to provision nodes for your Plan B pool from flexible capacity when it becomes available.
Verify the status of flex-start in the node pool
Inspect the DWS flex-start node pool and verify that flex-start is enabled:
gcloud container node-pools describe ${DWS_NODEPOOL} \
--cluster=${CLUSTER_NAME} \
--location=${ZONE} \
--format="get(config.flexStart)"
If flex-start is enabled, the output is True.
Install and configure Kueue (Cluster Administrator)
In this section, you install the Kueue controller on your cluster. Recall that Kueue is a scheduling program that manages the job queue. It intercepts your job request, decides which node pool to use (on-demand or DWS flex-start), and then assigns the job.
Install Kueue
Run the following command to install Kueue. This command downloads the installation manifests from the official repository and applies them to your cluster:
helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
--namespace kueue-system \
--create-namespace \
--set "controllerManager.featureGates[0].name=ElasticJobsViaWorkloadSlices" \
--set "controllerManager.featureGates[0].enabled=true"
Define the configuration rules
Create a YAML manifest that defines the priority rules. These rules tell Kueue to use the on-demand pool first and the DWS flex-start pool second:
Create a file named
dws-tpu-queue.yamlwith the following content. This file defines two flavors of resources (on-demand and DWS flex-start) and a cluster queue that prioritizes them. This configuration file defines the logic that Kueue uses to handle your jobs:ResourceFlavor: At the beginning of this document, you created two node pools and assigned them names using the environment variables${ONDEMAND_NODEPOOL}and${DWS_NODEPOOL}. When it created those node pools, GKE automatically labeled every node in those pools with the name you chose for those environment variables. TheResourceFlavorsection tells Kueue to look for nodes with those labels.ClusterQueue: This section of the manifest defines the priority rule. It lists the on-demand flavor first, so Kueue tries to provision on-demand machines first. If Kueue can't get those machines, it tries to provision DWS flex-start machines instead.Quotas: The file sets a quota, which is a limit on the total resources (such as CPU, memory, and TPU chips) your jobs can use at any given time in the on-demand node pool. When your jobs reach this limit, Kueue automatically tries to provision DWS flex-start machines (your Plan B machines), which you configured indws-tpu-queue.yamlwith a much higher quota limit.
Apply the configuration to your cluster. The following command uses a command-line tool called
envsubstto replace placeholder variables that appear in the filedws-tpu-queue.yaml.envsubstreplaces the placeholders with the values of the environment variables that you defined earlier:envsubst < dws-tpu-queue.yaml | kubectl apply -f -
Run a training job (AI Developer)
As the AI Developer, you define and submit a training workload by creating a RayJob manifest. You specify your resource requirements in this manifest, and the automated fallback system, which the Cluster administrator configured earlier with Kueue and DWS, handles the underlying node pools for you.
In this section, you perform the following steps:
- Create a Python training script.
- Store that script in a Kubernetes ConfigMap.
- Deploy a RayJob that mounts the ConfigMap as a volume so that the training script can be executed on the nodes.
After performing these steps, Ray Train automatically distributes the JAX workload across the nodes, and Kueue handles getting the machines you need.
The training script
Copy and paste the following Python script into a file called train.py:
The training script uses JAX, a Python library for high-performance numerical computing, to train a linear regression model. This script is a simplified example designed to demonstrate how to use DWS and Kueue for automated fallback, and doesn't perform data parallelism or model parallelism.
Notice that the ScalingConfig section of the training script defines the
hardware requirements for the training job. That section requests a 4x4 TPU
topology, which matches the physical layout of the node pools you configured
earlier.
Create the ConfigMap
Upload the contents of your train.py script into a Kubernetes
ConfigMap object. This allows the cluster to store the script and make it
available to your RayJob:
kubectl create configmap jax-train-script --from-file=train.py
The RayJob that you define in the next section mounts this ConfigMap as a volume. This makes the script file appear inside the Ray containers so that the Ray software can find and execute it.
Apply the RayJob manifest
Create a file named rayjob-tpu-v6e-dws.yaml with the following content. This
manifest defines your training job and tells the system how to route it:
This manifest includes three configurations that make the fallback system work:
- Requests specific hardware: The
nodeSelectorsection specifies the hardware that your script requires (in this example, atpu-v6e-slicewith a 4x4 topology). - Selects the queue: The
kueue.x-k8s.io/queue-namelabel routes your job directly to Kueue. This enables the automated fallback logic. - Tolerates DWS flex-start nodes: The
tolerationssection allows the job to run on your Plan B node pool. Because DWS flex-start nodes are specially marked (tainted) by GKE so that normal workloads don't accidentally run on them, your job must explicitly tolerate thecloud.google.com/gke-queuedtaint.
Submit the workload
To prove that the fallback system works, you need to submit two jobs. The first job consumes the Plan A on-demand capacity, which forces the second job to fall back to the Plan B DWS flex-start capacity.
Run the following command to submit the two jobs. The command uses a for loop
and envsubst to inject a unique job ID into the manifest for each run:
for i in 1 2; do
export JOB_ID=$i
envsubst < rayjob-tpu-v6e-dws.yaml | kubectl apply -f -
echo "Submitted Job $i"
sleep 2
done
After you submit the jobs, the system handles the workload as follows:
- Intercept: Kueue detects the jobs by using the queue label and temporarily suspends them.
- Decision: Kueue evaluates resource availability against the administrator's rules. It checks the Plan A pool first.
- Assignment:
- Because Plan A resources are available for the first job, Kueue assigns Job 1 there.
- Because Job 1 consumes the Plan A resources, Kueue automatically assigns Job 2 to the Plan B (DWS flex-start) pool.
- Launch: Kueue unsuspends the jobs. This action triggers the GKE cluster autoscaler to provision the nodes and start the training scripts.
Connect to the RayJob
As a final verification step, you can use the kubectl port-forward command to
connect to the Ray Dashboard and watch your jobs run.
To check the status of your first job, run the following command:
kubectl port-forward service/rayjob-tpu-v6e-dws-1-head-svc 8265:8265 &
After you run this command, open a web browser and go to
http://localhost:8265. In the Ray Dashboard, you can view the job status and
reported metrics to verify that both jobs complete successfully on their
respective node pools.
You can also view the logs for the first job by running the following command:
kubectl logs job/rayjob-tpu-v6e-dws-1
The truncated output of the training script should resemble the following. You
should see the messages Training Complete! and Job
'rayjob-tpu-v6e-dws-1-498t6' succeeded near the end of the output:
(pid=, ip=10.68.3.4) 5] XLA::TPU program HBM usage: 52.5K / 31.25G
(pid=, ip=10.68.9.4) :2152] XLA::TPU program VMEM usage: 141.0K / 128.00M [repeated 5x across cluster]
(pid=, ip=10.68.9.4) I0320 03:59:34.722540 855 deepsea_compiler_backend.cc:2163] Total hbm usage >= 260.14M: [repeated 5x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777634 888 deepsea_compiler_backend.cc:2167] reserved 204B [repeated 19x across cluster]
(pid=, ip=10.68.9.4) I0320 03:59:34.722542 855 deepsea_compiler_backend.cc:2163] program 70.0K [repeated 5x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777626 888 deepsea_compiler_backend.cc:2163] arguments 0B [repeated 12x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777627 888 deepsea_compiler_backend.cc:2163] Output size 0B; shares 0B with arguments. [repeated 14x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777625 888 deepsea_compiler_backend.cc:2163] Total host usage >= 0B: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777626 888 deepsea_compiler_backend.cc:2163] program unknown size [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777634 888 deepsea_compiler_backend.cc:2167] Program sflag requirement 224B: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777637 888 deepsea_compiler_backend.cc:2167] scoped 40B [repeated 21x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777636 888 deepsea_compiler_backend.cc:2167] Program vmem requirement 141.0K: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777637 888 deepsea_compiler_backend.cc:2167] Program smem requirement 40B: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777637 888 deepsea_compiler_backend.cc:2167] Program host requirement 0B: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777637 888 deepsea_compiler_backend.cc:2167] Program hbm requirement 70.0K: [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777638 888 deepsea_compiler_backend.cc:2167] overlays 70.0K [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777638 888 deepsea_compiler_backend.cc:2175] XLA::TPU program SMEM usage: 1.9K / 1.00M (3 parameters) [repeated 7x across cluster]
(pid=, ip=10.68.6.4) I0320 03:59:34.777636 888 deepsea_compiler_backend.cc:2167] HLO temp 76.0K (0.0% utilization: Unpadded (0B) Padded (0B), 100.0% fragmentation (76.0K)) [repeated 14x across cluster]
(RayTrainWorker pid=542, ip=10.68.6.4) Training Complete! [repeated 3x across cluster]
(RayTrainWorker pid=542, ip=10.68.6.4) Epoch 40: Loss 0.0000 [repeated 3x across cluster]
2026-03-20 03:59:51,008 SUCC cli.py:65 -- ------------------------------------------
2026-03-20 03:59:51,008 SUCC cli.py:66 -- Job 'rayjob-tpu-v6e-dws-1-498t6' succeeded
2026-03-20 03:59:51,008 SUCC cli.py:67 -- ------------------------------------------
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this document, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
Delete the individual resources
If you want to keep the GGoogle Cloud project you used in this document, run the following command to delete the cluster:
gcloud container clusters delete ${CLUSTER_NAME} \
--location=${ZONE} \
--project=${PROJECT_ID} \
--quiet
Summary
In this document, you configured and tested a Ray training environment. This environment uses a primary node pool and a backup DWS pool to maximize hardware availability. By automatically falling back to DWS when primary machines are unavailable, you minimized the time your training jobs spend waiting in line.
To get this working, you performed the following steps:
- Created a GKE cluster: Established the environment to host the node pools and scheduling tools.
- Configured the node pools: Created an on-demand node pool (Plan A) and a DWS node pool (Plan B).
- Installed and configured Kueue: Deployed the Kueue controller and applied priority rules instructing the system to try Plan A first and fall back to Plan B.
- Created a ConfigMap: Deployed a simplified JAX training script to the cluster to serve as the test workload.
- Defined a RayJob manifest: Configured the job to request specific hardware, route to the Kueue controller, and tolerate DWS nodes.
- Submitted the workload: Submitted two jobs to force Kueue to automatically route the second job to Plan B when Plan A resources are consumed.
- Verified the results: Used port forwarding to connect to the Ray Dashboard and confirmed that both jobs ran successfully.
What's next
- Learn how to view logs and metrics for your Ray clusters and jobs.
- Learn how to manage larger, distributed TPU workloads that span multiple slices.