This page describes how to accelerate machine learning (ML) workloads by using Cloud TPU accelerators (TPUs) in Google Kubernetes Engine (GKE) Autopilot clusters. This guidance can help you to select the correct libraries for your ML application frameworks, set up your TPU workloads to run optimally on GKE, and monitor your workloads after deployment.
This page is for Platform admins and operators, Data and AI specialists, and Application developers who want to prepare and run ML workloads on TPUs. To learn more about the common roles, responsibilities, and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.
Before reading this page, ensure that you're familiar with the following resources:
How TPUs work in Autopilot
To use TPUs in Autopilot workloads, you specify the following in your workload manifest:
- The TPU version in the spec.nodeSelectorfield.
- The TPU topology in the spec.nodeSelectorfield. The topology must be supported by the specified TPU version.
- The number of TPU chips in the
spec.containers.resources.requestsand thespec.containers.resources.limitsfields.
When you deploy the workload, GKE provisions nodes that have the requested TPU configuration and schedules your Pods on the nodes. GKE places each workload on its own node so that each Pod can access the full resources of the node with minimized risk of disruption.
TPUs in Autopilot are compatible with the following capabilities:
Plan your TPU configuration
Before you use this guide to deploy TPU workloads, plan your TPU configuration based on your model and how much memory it requires. For details, see Plan your TPU configuration.
Pricing
For pricing information, see Autopilot pricing.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
    install and then
    initialize the
    gcloud CLI. If you previously installed the gcloud CLI, get the latest
    version by running the gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Ensure that you have an Autopilot cluster running GKE version 1.32.3-gke.1927000 or later. For instructions, see Create an Autopilot cluster.
- To use reserved TPUs, ensure that you have an existing specific capacity reservation. For instructions, see Consume a reservation.
Ensure quota for TPUs and other GKE resources
The following sections help you ensure that you have enough quota when using TPUs in GKE. To create TPU slice nodes, you must have TPU quota available unless you're using an existing capacity reservation. If you're using reserved TPUs, skip this section.Creating TPU slice nodes in GKE requires Compute Engine API quota (compute.googleapis.com), not Cloud TPU API quota (tpu.googleapis.com). The name of the quota is different in regular Autopilot Pods and in Spot Pods.
To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:
- Go to the Quotas page in the Google Cloud console: 
- In the Filter box, do the following: - Use the following table to select and copy the property of the quota based on the TPU version and value in the - cloud.google.com/gke-tpu-acceleratornode selector. For example, if you plan to create on-demand TPU v5e nodes whose value in the- cloud.google.com/gke-tpu-acceleratornode selector is- tpu-v5-lite-podslice, enter- Name: TPU v5 Lite PodSlice chips.- TPU version, - cloud.google.com/gke-tpu-accelerator- Property and name of the quota for on-demand instances - Property and name of the quota for Spot2 instances - TPU v3, 
 - tpu-v3-device- Dimensions (e.g. location): 
 tpu_family:CT3- Not applicable - TPU v3, 
 - tpu-v3-slice- Dimensions (e.g. location): 
 tpu_family:CT3P- Not applicable - TPU v4, 
 - tpu-v4-podslice- Name: 
 TPU v4 PodSlice chips- Name: 
 Preemptible TPU v4 PodSlice chips- TPU v5e, 
 - tpu-v5-lite-podslice- Name: 
 TPU v5 Lite PodSlice chips- Name: 
 Preemptible TPU v5 Lite Podslice
 chips- TPU v5p, 
 - tpu-v5p-slice- Name: 
 TPU v5p chips- Name: 
 Preemptible TPU v5p chips- TPU Trillium, 
 - tpu-v6e-slice- Dimensions (e.g. location): 
 tpu_family:CT6E- Name: 
 Preemptible TPU slices v6e
- Select the Dimensions (e.g. locations) property and enter - region:followed by the name of the region in which you plan to create TPUs in GKE. For example, enter- region:us-west4if you plan to create TPU slice nodes in the zone- us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.
 
If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota adjustment.
When a TPU reservation is created, both the limit and current use values for
the corresponding quota increase by the number of chips in the TPU
reservation. For example, when a reservation is created for 16 TPU v5e chips
whose
value in the cloud.google.com/gke-tpu-accelerator node selector is tpu-v5-lite-podslice,
then both the Limit and
Current usage for the TPU v5 Lite PodSlice chips quota in the relevant
region increase by 16.
Quotas for additional GKE resources
You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.
- Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
- In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
- Ensure that max-pods-per-nodealigns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example,max-pods-per-nodeof 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the--max-pods-per-nodeflag to limit the number of pods allowed to be scheduled on a node. The quota formax-pods-per-nodeshould be set at least as high as the maximum number of GKE nodes you anticipate creating.
To request an increase in quota, see Request a quota adjustment.
Options for provisioning TPUs in GKE
GKE Autopilot lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors.
Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.
For instructions, see the Centrally provision TPUs with custom compute classes section.
Prepare your TPU application
TPU workloads have the following preparation requirements.
- Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpushared library.libtpuincludes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certainlibtpu.soversion. To avoid package version conflicts, we recommend using a JAX AI image. To use TPUs in GKE, ensure that you use the following versions:TPU type libtpu.soversionTPU Trillium (v6e) 
 tpu-v6e-slice
 - Recommended JAX AI image: jax0.4.35-rev1 or later
- Recommended jax[tpu] version: v0.4.9 or later
- Recommended torchxla[tpuvm] version: v2.1.0 or later
 TPU v5e 
 tpu-v5-lite-podslice
 - Recommended JAX AI image: jax0.4.35-rev1 or later
- Recommended jax[tpu] version: v0.4.9 or later
- Recommended torchxla[tpuvm] version: v2.1.0 or later
 TPU v5p 
 tpu-v5p-slice- Recommended JAX AI image: jax0.4.35-rev1 or later
- Recommended jax[tpu] version: 0.4.19 or later.
- Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
 TPU v4 
 tpu-v4-podslice- Recommended JAX AI image: jax0.4.35-rev1 or later
- Recommended jax[tpu]: v0.4.4 or later
- Recommended torchxla[tpuvm]: v2.0.0 or later
 TPU v3 
 tpu-v3-slice
 tpu-v3-device- Recommended JAX AI image: jax0.4.35-rev1 or later
- Recommended jax[tpu]: v0.4.4 or later
- Recommended torchxla[tpuvm]: v2.0.0 or later
 
- Set the following environment variables for the container requesting the TPU resources:
    - TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
- TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the- TPU_WORKER_ID.
 GKE automatically injects these environment variables by using a mutating webhook when a Job is created with the completionMode: Indexed,subdomain,parallelism > 1, and requestinggoogle.com/tpuproperties. GKE adds a headless Service so that the DNS records are added for the Pods backing the Service.
After you complete the workload preparation, you can run a Job that uses TPUs.
Request TPUs in a workload
This section shows you how to create a Job that requests TPUs in Autopilot. In any workload that needs TPUs, you must specify the following:
- Node selectors for the TPU version and topology
- The number of TPU chips for a container in your workload
For a list of supported TPU versions, topologies, and the corresponding number of TPU chips and nodes in a slice, see Choose the TPU version.
Considerations for TPU requests in workloads
Only one container in a Pod can use TPUs. The number of TPU chips that a container
requests must be equal to the number of TPU chips attached to a node in the slice.
For example, if you request TPU v5e (tpu-v5-lite-podslice) with a 2x4
topology, you can request any of the following:
- 4chips, which creates two multi-host nodes with 4 TPU chips each
- 8chips, which creates one single-host node with 8 TPU chips
As a best practice to maximize your cost efficiency, always consume all of the TPU in the slice that you request. If you request a multi-host slice of two nodes with 4 TPU chips each, you should be deploying a workload that runs on both nodes and consumes all 8 TPU chips in the slice.
Create a workload that requests TPUs
The following steps create a Job that requests TPUs. If you have workloads that run on multi-host TPU slices, you must also create a headless Service that selects your workload by name. This headless Service lets Pods on different nodes in the multi-host slice to communicate with each other by updating the Kubernetes DNS configuration to point at the Pods in the workload.
- Save the following manifest as - tpu-autopilot.yaml:- apiVersion: v1 kind: Service metadata: name: headless-svc spec: clusterIP: None selector: job-name: tpu-job --- apiVersion: batch/v1 kind: Job metadata: name: tpu-job spec: backoffLimit: 0 completions: 4 parallelism: 4 completionMode: Indexed template: spec: # Optional: Run in GKE Sandbox # runtimeClassName: gvisor subdomain: headless-svc restartPolicy: Never nodeSelector: cloud.google.com/gke-tpu-accelerator: TPU_TYPE cloud.google.com/gke-tpu-topology: TOPOLOGY containers: - name: tpu-job image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest ports: - containerPort: 8471 # Default port using which TPU VMs communicate - containerPort: 8431 # Port to export TPU runtime metrics, if supported. command: - bash - -c - | python -c 'import jax; print("TPU cores:", jax.device_count())' resources: requests: cpu: 10 memory: MEMORY_SIZE google.com/tpu: NUMBER_OF_CHIPS limits: cpu: 10 memory: MEMORY_SIZE google.com/tpu: NUMBER_OF_CHIPS- Replace the following: - TPU_TYPE: the TPU type to use, like- tpu-v4-podslice. Must be a value supported by GKE.
- TOPOLOGY: the arrangement of TPU chips in the slice, like- 2x2x4. Must be a supported topology for the selected TPU type.
- NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for- limitsand- requests.
- MEMORY_SIZE: The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators.
 - Optionally, you can also change the following fields: - image: the JAX AI image to use. In the example manifest, this field is set to the latest JAX AI image. To set a different version, see the list of current JAX AI images.
- runtimeClassname: gvisor: the setting that lets your run this Pod in GKE Sandbox. To use, uncomment this line. GKE Sandbox supports TPUs version v4 and later. To learn more, see GKE Sandbox.
 
- Deploy the Job: - kubectl create -f tpu-autopilot.yaml- When you create this Job, GKE automatically does the following: - Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices.
- Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
 
- When you finish this section, you can avoid continued billing by deleting the workload you created: - kubectl delete -f tpu-autopilot.yaml
Create a workload that requests TPUs and collection scheduling
In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.
TPU Trillium supports collection scheduling for single-host and multi-host node pools that run inference workloads. The following describes how collection scheduling behavior depends on the type of TPU slice that you use:
- Multi-host TPU slice: GKE groups multi-host TPU slices to form a collection. Each GKE node pool is a replica within this collection. To define a collection, create a multi-host TPU slice and assign a unique name to the collection. To add more TPU slices to the collection, create another multi-host TPU slice node pool with the same collection name and workload type.
- Single-host TPU slice: GKE considers the entire single-host TPU slice node pool as a collection. To add more TPU slices to the collection, you can resize the single-host TPU slice node pool.
To learn about the limitation of collection scheduling, see How collection scheduling works
Use a multi-host TPU slice
Collection schedulling in multi-host TPU slice nodes is available for
Autopilot clusters in
    version 1.31.2-gke.1537000 and later. Multi-host TPU slice nodes with a
    2x4 topology are only supported in 1.31.2-gke.1115000 or later. To create
    multi-host TPU slice nodes and group it as a collection, add the
    following Kubernetes labels to your workload specification:
- cloud.google.com/gke-nodepool-group-name: each collection should have a unique name at the cluster level. The value in the- cloud.google.com/gke-nodepool-group-namelabel must adhere to requirements for cluster labels.
- cloud.google.com/gke-workload-type: HIGH_AVAILABILITY- For example, the following code block defines a collection with a multi-host TPU slice: - nodeSelector: cloud.google.com/gke-nodepool-group-name: ${COLLECTION_NAME} cloud.google.com/gke-workload-type: HIGH_AVAILABILITY cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice cloud.google.com/gke-tpu-topology: 4x4 ...
Use a single-host TPU slice
Collection schedulling in single-host TPU slice nodes is available for
Autopilot clusters in version 1.31.2-gke.1088000 and later. To create
single-host TPU slice nodes and group it as a collection, add the
cloud.google.com/gke-workload-type:HIGH_AVAILABILITY label in your workload
specification.
For example, the following code block defines a collection with a single-host TPU slice:
  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
    cloud.google.com/gke-tpu-topology: 2x2
    cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
  ...
Use custom compute classes to deploy a collection
For more information about deploying a workload that requests TPU workload and collection scheduling using custom compute classes see TPU multi-host collection and Define workload type for TPU SLO.
Centrally provision TPUs with custom compute classes
To provision TPUs with a custom compute class that follows the TPU rules and deploy the workload, complete the following steps:
- Save the following manifest as - tpu-compute-class.yaml:- apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: tpu-class spec: priorities: - tpu: type: tpu-v5-lite-podslice count: 4 topology: 2x4 - spot: true tpu: type: tpu-v5-lite-podslice count: 4 topology: 2x4 - flexStart: enabled: true tpu: type: tpu-v6e-slice count: 4 topology: 2x4 nodePoolAutoCreation: enabled: true
- Deploy the compute class: - kubectl apply -f tpu-compute-class.yaml- For more information about custom compute classes and TPUs, see TPU configuration. 
- Save the following manifest as - tpu-job.yaml:- apiVersion: v1 kind: Service metadata: name: headless-svc spec: clusterIP: None selector: job-name: tpu-job --- apiVersion: batch/v1 kind: Job metadata: name: tpu-job spec: backoffLimit: 0 completions: 4 parallelism: 4 completionMode: Indexed template: spec: subdomain: headless-svc restartPolicy: Never nodeSelector: cloud.google.com/compute-class: tpu-class containers: - name: tpu-job image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest ports: - containerPort: 8471 # Default port using which TPU VMs communicate - containerPort: 8431 # Port to export TPU runtime metrics, if supported. command: - bash - -c - | python -c 'import jax; print("TPU cores:", jax.device_count())' resources: requests: cpu: 10 memory: MEMORY_SIZE google.com/tpu: NUMBER_OF_CHIPS limits: cpu: 10 memory: MEMORY_SIZE google.com/tpu: NUMBER_OF_CHIPS- Replace the following: - NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for- limitsand- requests, equal to the value in the- tpu.countfield in the selected custom compute class.
- MEMORY_SIZE: The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators.
- NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for- limitsand- requests.
 
- Deploy the Job: - kubectl create -f tpu-job.yaml- When you create this Job, GKE automatically does the following: - Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
- Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
 - To learn more, see About custom compute classes. 
- When you finish this section, you can avoid continued billing by deleting the resources you created: - kubectl delete -f tpu-job.yaml
Example: Display the total TPU chips in a multi-host slice
The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:
- TPU version: TPU v4
- Topology: 2x2x4
This version and topology selection result in a multi-host slice.
- Save the following manifest as available-chips-multihost.yaml:apiVersion: v1 kind: Service metadata: name: headless-svc spec: clusterIP: None selector: job-name: tpu-available-chips --- apiVersion: batch/v1 kind: Job metadata: name: tpu-available-chips spec: backoffLimit: 0 completions: 4 parallelism: 4 completionMode: Indexed template: spec: subdomain: headless-svc restartPolicy: Never nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice # Node selector to target TPU v4 slice nodes. cloud.google.com/gke-tpu-topology: 2x2x4 # Specifies the physical topology for the TPU slice. containers: - name: tpu-job image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest ports: - containerPort: 8471 # Default port using which TPU VMs communicate - containerPort: 8431 # Port to export TPU runtime metrics, if supported. command: - bash - -c - | python -c 'import jax; print("TPU cores:", jax.device_count())' # Python command to count available TPU chips. resources: requests: cpu: 10 memory: 407Gi google.com/tpu: 4 # Request 4 TPU chips for this workload. limits: cpu: 10 memory: 407Gi google.com/tpu: 4 # Limit to 4 TPU chips for this workload. 
- Deploy the manifest:
kubectl create -f available-chips-multihost.yaml GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips. 
- Verify that the Job created four Pods:
kubectl get pods The output is similar to the following: NAME READY STATUS RESTARTS AGE tpu-job-podslice-0-5cd8r 0/1 Completed 0 97s tpu-job-podslice-1-lqqxt 0/1 Completed 0 97s tpu-job-podslice-2-f6kwh 0/1 Completed 0 97s tpu-job-podslice-3-m8b5c 0/1 Completed 0 97s 
- Get the logs of one of the Pods:
kubectl logs POD_NAME Replace POD_NAMEwith the name of one of the created Pods. For example,tpu-job-podslice-0-5cd8r.The output is similar to the following: TPU cores: 16 
- Optional: Remove the workload:
kubectl delete -f available-chips-multihost.yaml 
Example: Display the TPU chips in a single node
The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:
- TPU version: TPU v5e
- Topology: 2x4
This version and topology selection result in a single-host slice.
- Save the following manifest as available-chips-singlehost.yaml:apiVersion: v1 kind: Pod metadata: name: tpu-job-jax-v5 spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice # Node selector to target TPU v5e slice nodes. cloud.google.com/gke-tpu-topology: 2x4 # Specify the physical topology for the TPU slice. containers: - name: tpu-job image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest ports: - containerPort: 8431 # Port to export TPU runtime metrics, if supported. command: - bash - -c - | python -c 'import jax; print("Total TPU chips:", jax.device_count())' resources: requests: google.com/tpu: 8 # Request 8 TPU chips for this container. limits: google.com/tpu: 8 # Limit to 8 TPU chips for this container. 
- Deploy the manifest:
kubectl create -f available-chips-singlehost.yaml GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice). 
- Get the logs of the Pod:
kubectl logs tpu-job-jax-v5 The output is similar to the following: Total TPU chips: 8 
- Optional: Remove the workload:
  kubectl delete -f available-chips-singlehost.yaml 
Observe and monitor TPUs
Dashboard
Node pool observability in the Google Cloud console is generally available. To view the status of your TPU multi-host node pools on GKE, go to GKE TPU Node Pool Status dashboard provided by Cloud Monitoring:
Go to GKE TPU Node Pool Status
This dashboard gives you comprehensive insights into the health of your multi-host TPU node pools. For more information, see Monitor health metrics for TPU nodes and node pools.
In the Kubernetes Clusters page in the Google Cloud console, the Observability tab also displays TPU observability metrics, such as TPU usage, under the Accelerators > TPU heading. For more information, see View observability metrics.
The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.
Runtime metrics
In GKE version 1.27.4-gke.900 or later, TPU workloads
that both use JAX version
0.4.14
or later and specify containerPort: 8431 export TPU utilization metrics as GKE
system metrics.
The following metrics are available in Cloud Monitoring
to monitor your TPU workload's runtime performance:
- Duty cycle: percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
- Memory used: amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
- Memory total: total accelerator memory in bytes. Sampled every 60 seconds.
These metrics are located in the Kubernetes node (k8s_node) and Kubernetes
container (k8s_container) schema.
Kubernetes container:
- kubernetes.io/container/accelerator/duty_cycle
- kubernetes.io/container/accelerator/memory_used
- kubernetes.io/container/accelerator/memory_total
Kubernetes node:
- kubernetes.io/node/accelerator/duty_cycle
- kubernetes.io/node/accelerator/memory_used
- kubernetes.io/node/accelerator/memory_total
Monitor health metrics for TPU nodes and node pools
When a training job has an error or terminates in failure, you can check metrics related to the underlying infrastructure to figure out if the interruption was caused by an issue with the underlying node or node pool.
Node status
In GKE version 1.32.1-gke.1357001 or later, the following GKE system metric exposes the condition of a GKE node:
- kubernetes.io/node/status_condition
The condition field reports conditions on the node, such as Ready, DiskPressure, and MemoryPressure. The status field shows the reported status of the condition,
which can be True, False, or Unknown. This is a metric with the k8s_node monitored resource type.
This PromQL query shows if a particular node is Ready:
kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    node_name="NODE_NAME",
    condition="Ready",
    status="True"}
To help troubleshoot issues in a cluster, you might want to look at nodes that have exhibited other conditions:
kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition!="Ready",
    status="True"}
You might want to specifically look at nodes that aren't Ready:
kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition="Ready",
    status="False"}
If there is no data, then the nodes are ready. The status condition is sampled every 60 seconds.
You can use the following query to understand the node status across the fleet:
avg by (condition,status)(
  avg_over_time(
    kubernetes_io:node_status_condition{monitored_resource="k8s_node"}[${__interval}]))
Node pool status
The following GKE system metric for the k8s_node_pool monitored resource
exposes the status of a GKE node pool:
- kubernetes.io/node_pool/status
This metric is reported only for multi-host TPU node pools.
The status field reports the status of the node pool, such as Provisioning, Running, Error,
Reconciling, or Stopping. Status updates happen after GKE API operations complete.
To verify if a particular node pool has Running status, use the following PromQL query:
kubernetes_io:node_pool_status{
    monitored_resource="k8s_node_pool",
    cluster_name="CLUSTER_NAME",
    node_pool_name="NODE_POOL_NAME",
    status="Running"}
To monitor the number of node pools in your project grouped by their status, use the following PromQL query:
count by (status)(
  count_over_time(
    kubernetes_io:node_pool_status{monitored_resource="k8s_node_pool"}[${__interval}]))
Node pool availability
The following GKE system metric shows whether a multi-host TPU node pool is available:
- kubernetes.io/node_pool/multi_host/available
The metric has a value of True if all of the nodes in the node pool are available,
and False otherwise. The metric is sampled every 60 seconds.
To check the availability of multi-host TPU node pools in your project, use the following PromQL query:
avg by (node_pool_name)(
  avg_over_time(
    kubernetes_io:node_pool_multi_host_available{
      monitored_resource="k8s_node_pool",
      cluster_name="CLUSTER_NAME"}[${__interval}]))
Node interruption count
The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):
- kubernetes.io/node/interruption_count
The interruption_type (such as TerminationEvent, MaintenanceEvent, or PreemptionEvent) and interruption_reason
(like HostError, Eviction, or AutoRepair) fields can help provide the reason for why
a node was interrupted.
To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:
  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node"}[${__interval}]))
To only see the
host maintenance events,
update the query to filter the HW/SW Maintenance value for the interruption_reason. Use the following PromQL query:
  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node", interruption_reason="HW/SW Maintenance"}[${__interval}]))
To see the interruption count aggregated by node pool, use the following PromQL query:
  sum by (node_pool_name,interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_pool_interruption_count{monitored_resource="k8s_node_pool", interruption_reason="HW/SW Maintenance", node_pool_name=NODE_POOL_NAME }[${__interval}]))
Node pool times to recover (TTR)
The following GKE system metric reports the distribution of recovery period durations for GKE multi-host TPU node pools:
- kubernetes.io/node_pool/accelerator/times_to_recover
Each sample recorded in this metric indicates a single recovery event for the node pool from a downtime period.
This metric is useful for tracking the multi-host TPU node pool time to recover and time between interruptions.
You can use the following PromQL query to calculate the mean time to recovery (MTTR) for the last 7 days in your cluster:
sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_sum{
    monitored_resource="k8s_node_pool", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_count{
    monitored_resource="k8s_node_pool",cluster_name="CLUSTER_NAME"}[7d]))
Node pool times between interruptions (TBI)
Node pool times between interruptions measures how long your infrastructure runs before experiencing an interruption. It is computed as the average over a window of time, where the numerator measures the total time that your infrastructure was up and the denominator measures the total interruptions to your infrastructure.
The following PromQL example shows the 7-day mean time between interruptions (MTBI) for the given cluster:
sum(count_over_time(
  kubernetes_io:node_memory_total_bytes{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_interruption_count{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))
Host metrics
In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:
- TensorCore utilization: current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
- Memory bandwidth utilization: current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.
These metrics are located in the Kubernetes node (k8s_node) and Kubernetes
container (k8s_container) schema.
Kubernetes container:
- kubernetes.io/container/accelerator/tensorcore_utilization
- kubernetes.io/container/accelerator/memory_bandwidth_utilization
Kubernetes node:
- kubernetes.io/node/accelerator/tensorcore_utilization
- kubernetes.io/node/accelerator/memory_bandwidth_utilization
For more information, see Kubernetes metrics and GKE system metrics.
Logging
Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging.
Recommendations for TPU workloads in Autopilot
The following recommendations might improve the efficiency of your TPU workloads:
- Use extended run time Pods for a grace period of up to seven days before GKE terminates your Pods for scale-downs or node upgrades. You can use maintenance windows and exclusions with extended run time Pods to further delay automatic node upgrades.
- Use capacity reservations to ensure that your workloads receive requested TPUs without being placed in a queue for availability.
To learn how to set up Cloud TPU in GKE, see the following Google Cloud resources:
- Plan TPUs in GKE to start your TPU setup
- Deploy TPU workloads in GKE Autopilot
- Deploy TPU workloads in GKE Standard
- Learn about best practices for using Cloud TPU for your machine learning tasks.
- Video: Build large-scale machine learning on Cloud TPU with GKE.
- Serve Large Language Models with KubeRay on TPUs.
- Learn about sandboxing GPU workloads with GKE Sandbox