This page describes how to plan your usage of Tensor Processing Units (TPUs) in Google Kubernetes Engine (GKE) to reduce the risk of TPU misconfiguration, non-availability errors, or out-of-quota interruptions.
Before you use TPUs in GKE, ensure that you are familiar with TPUs definitions and terminology in GKE.
Plan your TPU configuration
To work with TPUs in GKE clusters, you must plan their configuration. We recommend that you follow these steps:
- Choose a GKE mode of operation: Run your workloads on TPUs in a GKE Autopilot or Standard cluster. - Best practice: - Use an Autopilot cluster for a fully managed Kubernetes experience. 
- Choose the TPU version: Different TPU types have different capabilities, like price-performance ratios, training throughput, and serving latency. The TPU types affect the available CPU and memory capacities. 
- Validate TPU availability: TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE workload, your cluster must be in a supported region for that type. 
- Choose the TPU Topology: The physical arrangement of the TPUs within a TPU slice. Select a topology that matches your model's parallelism requirements. 
Use the reference tables on this page to identify if your node pools are single-host or multi-host TPU slice nodes.
Choose a GKE mode of operation
You can use TPUs in the available GKE modes of operation for clusters:
- Autopilot mode (recommended): GKE manages the underlying infrastructure such as node configuration, autoscaling, auto-upgrades, baseline security configurations, and baseline networking configuration. In Autopilot, you choose a TPU type and topology, then specify them in your Kubernetes manifest. GKE manages provisioning nodes with TPUs and scheduling your workloads.
- Standard mode: You manage the underlying infrastructure, including configuring the individual nodes.
To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Choose a TPU consumption option
When you plan your TPU configuration in GKE, select a consumption option that aligns with your workload needs. Your choice of consumption option impacts the available TPU versions and the quota you need to configure. GKE offers the following TPU consumption options to help you optimize resource allocation and cost while maintaining workload performance:
- Flex-start: to provision Flex-start VMs for up to seven days, with GKE automatically allocating the hardware on a best-effort basis based on availability. For more information, see About GPU and TPU provisioning with flex-start provisioning mode.
- Spot VMs: to provision Spot VMs, you can get significant discounts, but Spot VMs can be preempted at any time, with a 30-second warning. For more information, see Spot VMs.
- Future reservation for up to 90 days (in calendar mode): to provision TPU resources for up to 90 days, for a specified time period. For more information, see Request TPUs with future reservation in calendar mode.
- TPU reservations: to request a future reservation for one year or longer.
To choose the consumption option that meets your workload requirements, see About accelerator consumption options for AI/ML workloads in GKE.
Choose the TPU version
The VMs in a TPU slice have the following technical characteristics.
Autopilot
| TPU version | Machine type | Number of vCPUs | Memory (GiB) | Number of NUMA nodes | Maximum TPU chips in a TPU slice node | 
|---|---|---|---|---|---|
| TPU Trillium (v6e) | tpu-v6e-slice | 44 to 180 | 176 to 1440 | 1 to 2 | 256 | 
| TPU v5p | tpu-v5p-slice | 208 | 448 | 2 | 6,144 | 
| TPU v5e | tpu-v5-lite-podslice | 24 to 224 | 48 to 384 | 1 | 256 | 
| TPU v4 | tpu-v4-podslice | 240 | 407 | 2 | 4,096 | 
| TPU v3 (single-host only) | tpu-v3-device | 96 | 340 | 2 | 8 | 
| TPU v3 | tpu-v3-slice | 48 | 340 | 1 | 256 | 
Standard
| TPU version | Machine type | Number of vCPUs | Memory (GiB) | Number of NUMA nodes | Likelihood of being preempted | 
|---|---|---|---|---|---|
| TPU Trillium (v6e) | ct6e-standard-1t | 44 | 448 | 2 | Higher | 
| TPU Trillium (v6e) | ct6e-standard-4t | 180 | 720 | 1 | Medium | 
| TPU Trillium (v6e) | ct6e-standard-8t | 180 | 1440 | 2 | Lower | 
| TPU v5p | ct5p-hightpu-4t | 208 | 448 | 2 | |
| TPU v5e | ct5lp-hightpu-1t | 24 | 48 | 1 | Higher | 
| TPU v5e | ct5lp-hightpu-4t | 112 | 192 | 1 | Medium | 
| TPU v5e | ct5lp-hightpu-8t | 224 | 384 | 1 | Low | 
| TPU v4 | ct4p-hightpu-4t | 240 | 407 | 2 | |
| TPU v3 (single-host only) | ct3-hightpu-4t | 96 | 340 | 2 | |
| TPU v3 | ct3p-hightpu-4t | 48 | 340 | 1 | 
Multi-host ct5lp- machine types are more suitable
for serving large models or training. Multi-host ct5lp- machines are
interconnected with high-speed links.
Review the TPU specifications and pricing in the Cloud TPU pricing documentation to decide which TPU configuration to use.
Limitations
Consider these limitations when choosing the TPU to use:
- TPU Trillium is available in the following versions:
- Standard clusters in version 1.31.1-gke.1846000 and later.
- Autopilot clusters in version 1.31.2-gke.1115000 and later.
 
- TPU Trillium doesn't support configuring SMT set to 2onct6e-standard-8t.
- TPU v5p autoscaling is supported on GKE clusters with control planes running at least version 1.29.2-gke.1035000 or 1.28.7-gke.1020000.
- For capacity reservations, use a specific reservation.
- You can run a maximum of 256 Pods in a single TPU VM.
- GKE cost allocation and usage metering don't include any data about the usage or costs of TPUs.
- The cluster autoscaler cancels TPU node pool scale-up operations that remain in waiting status for more than 10 hours. The cluster autoscaler retries such scale-up operations when resources are available. This behavior might reduce TPU obtainability if you don't use reservations.
- Ubuntu nodes are not supported.
- TPU Node architecture is deprecated. TPU v3 is the only TPU version that still supports the TPU Node architecture in GKE.
Validate TPU availability in GKE
TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE cluster, your cluster must be in a supported region for that type.
Autopilot
| TPU version | cloud.google.com/gke-tpu-accelerator | Minimum GKE version | Availability | Zone | 
|---|---|---|---|---|
| TPU Trillium (v6e) | tpu-v6e-slice | 1.31.2-gke.1384000 | GA | 
 | 
| TPU v5e | tpu-v5-lite-podslice | 1.27.2-gke.2100 | GA | 
 | 
| TPU v5p | tpu-v5p-slice | 1.28.3-gke.1024000 | GA | 
 | 
| TPU v4 | tpu-v4-podslice | 1.26.1-gke.1500 | GA | 
 | 
| TPU v3 | tpu-v3-slice | 1.31.1-gke.1146000 | GA | 
 | 
| TPU v3 | tpu-v3-device | 1.31.0-gke.1500 | GA | 
 | 
Standard
| TPU version | Machine type beginning with | Minimum GKE version | Availability | Zone | 
|---|---|---|---|---|
| TPU Trillium (v6e) | ct6e- | 1.31.2-gke.1115000 | GA | 
 | 
| TPU v5e | ct5lp- | 1.27.2-gke.2100 | GA | 
 | 
| TPU v5p | ct5p- | 1.28.3-gke.1024000 | GA | 
 | 
| TPU v4 | ct4p- | 1.26.1-gke.1500 | GA | 
 | 
| TPU v3 | ct3p- | 1.31.1-gke.1146000 | GA | 
 | 
| TPU v3 | ct3- | 1.31.0-gke.1500 | GA | 
 | 
Choose a topology
After you decide on a TPU version, select a topology that's supported by that TPU type. Depending on the TPU type, the topology is two- or three-dimensional. Your model's parallelism requirements help you to decide on a topology. You can identify the number of TPU chips in the slice by calculating the product of each size in the topology. For example:
- 2x2x2is an 8-chip multi-host TPU v4 slice
- 2x2is a 4-chip single-host TPU v5e slice
If a specific topology supports both single-host and multi-host TPU slice nodes, the number of TPU chips that your workload requests determines the host type.
For example, TPU v5e
(tpu-v5-lite-podslice) supports the 2x4 topology as both single- and
multi-host. If you:
- Request 4 chips in your workload, you get a multi-host node that has 4 TPU chips.
- Request 8 chips in your workload, you get a single-host node that has 8 TPU chips.
Use the following table to choose the TPU machine type and topology for your use case:
- For small-scale model training or inference, use TPU v4 or TPU v5e with single-host TPU slice node pools.
- For large-scale model training or inference, use TPU v4 or TPU v5e with multi-host TPU slice node pools.
- For large-scale training or inferencing, use Pathways. Pathways simplifies large-scale machine learning computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices. For more information, see Pathways.
Autopilot
After you choose a TPU type and topology, specify these in your workload manifest. For instructions, see Deploy TPU workloads on GKE Autopilot.
| TPU version | Machine type | Node pool type | Technical specifications | 
|---|---|---|---|
| TPU Trillium (v6e) | tpu-v6e-slice | Single-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Single-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Single-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Multi-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Multi-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Multi-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Multi-host | 
 | 
| TPU Trillium (v6e) | tpu-v6e-slice | Multi-host | 
 | 
| TPU v5p | tpu-v5p-slice | Single-host | 
 | 
| TPU v5p | tpu-v5p-slice | Multi-host | 
 | 
| TPU v5p | tpu-v5p-slice | Multi-host | 
 | 
| TPU v5p | tpu-v5p-slice | Multi-host | 
 | 
| TPU v5p | tpu-v5p-slice | Multi-host | 
 | 
| TPU v5p | tpu-v5p-slice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Single-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Single-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Single-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e | tpu-v5-lite-podslice | Multi-host | 
 | 
| TPU v5e (single-host only) | tpu-v5-lite-device | Single-host | 
 | 
| TPU v5e (single-host only) | tpu-v5-lite-device | Single-host | 
 | 
| TPU v5e (single-host only) | tpu-v5-lite-device | Single-host | 
 | 
| TPU v4 | tpu-v4-podslice | Single-host | 
 | 
| TPU v4 | tpu-v4-podslice | Multi-host | 
 | 
| TPU v4 | tpu-v4-podslice | Multi-host | 
 | 
| TPU v4 | tpu-v4-podslice | Multi-host | 
 | 
| TPU v4 | tpu-v4-podslice | Multi-host | 
 | 
| TPU v4 | tpu-v4-podslice | Multi-host | 
 | 
| TPU v3 | tpu-v3-slice | Multi-host | 
 | 
| TPU v3 | tpu-v3-slice | Multi-host | 
 | 
| TPU v3 | tpu-v3-slice | Multi-host | 
 | 
| TPU v3 | tpu-v3-slice | Multi-host | 
 | 
| TPU v3 | tpu-v3-slice | Multi-host | 
 | 
| TPU v3 | tpu-v3-device | Single-host | 
 | 
- 
      Calculated by the topology product divided by four. ↩ Custom topologies for more than 64 chips are supported. The following conditions apply: - For more than 64 chips, {A},{B}, and{C}must be multiples of 4
- The largest topology is 16x16x24
- The values must be {A}≤{B}≤{C}, like8x12x16.
 
- For more than 64 chips, 
- 
      Custom topologies aren't supported. 
Standard
After you choose a TPU type and topology, specify these in your workload manifest. For instructions, see Deploy TPU workloads on GKE Standard.
| TPU version | Machine type | Node pool type | Technical specifications | 
|---|---|---|---|
| TPU Trillium (v6e) | ct6e-standard-1t | Single-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-8t | Single-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Single-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU Trillium (v6e) | ct6e-standard-4t | Multi-host | 
 | 
| TPU v5p | ct5p-hightpu-4t | Single-host | 
 | 
| TPU v5p | ct5p-hightpu-4t | Multi-host | 
 | 
| TPU v5p | ct5p-hightpu-4t | Multi-host | 
 | 
| TPU v5p | ct5p-hightpu-4t | Multi-host | 
 | 
| TPU v5p | ct5p-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5lp-hightpu-1t | Single-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Single-host | 
 | 
| TPU v5e | ct5lp-hightpu-8t | Single-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5lp-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5p-hightpu-4t | Multi-host | 
 | 
| TPU v5e | ct5p-hightpu-4t | Single-host | 
 | 
| TPU v4 | ct4p-hightpu-4t | Multi-host | 
 | 
| TPU v4 | ct4p-hightpu-4t | Multi-host | 
 | 
| TPU v4 | ct4p-hightpu-4t | Multi-host | 
 | 
| TPU v4 | ct4p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3-hightpu-4t | Single-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
| TPU v3 | ct3p-hightpu-4t | Multi-host | 
 | 
- 
  Calculated by the topology product divided by four. ↩ 
Advanced configurations
The following sections describe scheduling best practices for advanced TPU configurations.
Autoscaling TPUs in GKE
GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.
With the
--enable-autoprovisioning flag on a GKE cluster,
GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU
version and topology that meets the requirements of pending workloads.
When you use --enable-autoscaling, GKE scales the node pool based on its type, as follows:
- Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool. 
- Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type - ct5lp-hightpu-4tand a topology of- 16x16, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.
Provision additional storage to a TPU slice
A VM in a TPU slice includes a 100 GiB boot disk. If your TPU slice needs additional storage for training or preprocessing, or if you need to save checkpoints, you can use Google Cloud Hyperdisk or Balanced Persistent Disk storage if it's available for your TPU. For more information about supported disk types for each TPU version, see the TPU support for Hyperdisk and Persistent Disk.
CPU for Standard clusters
This section doesn't apply to Autopilot clusters because GKE places each TPU slice on its own node. To learn more, see How TPUs work in Autopilot mode.
For Standard clusters, consider the following scheduling best practices.
To schedule a non-TPU workload on a VM in a TPU slice node, ensure that your
GKE Pod can tolerate the google.com/tpu taint. If you want the
workload to be deployed to specific nodes, use
node selectors.
Kubernetes resource management and priority treats VMs in TPUs the same as other VM types. To give scheduling priority to Pods that require TPUs over other Pods on the same nodes, request the maximum CPU or memory for those TPU slices. Low-priority TPU slices should do the following:
- Set low CPU and memory requests to ensure that the node has enough allocatable resources for the TPU workloads. To learn more, see How Kubernetes applies resource requests and limits.
- Set no CPU limit (unlimited) to ensure that Pods can burst to use all unused cycles.
- Set appropriate memory limits to ensure Pods can function correctly without risking node-pressure eviction.
If a Kubernetes Pod doesn't request CPU and memory (even if it is requesting TPUs), then Kubernetes considers it a best-effort Pod, and there is no guarantee that it needed any CPU and memory. Only Pods that explicitly request CPU and memory have such guarantees. For specific Kubernetes scheduling, configure the Pod needs with explicit CPU and memory request. For more information, see Resource Management for Pods and Containers.
To learn more best practices, see Kubernetes best practices: Resource requests and limits.
Reduce workload interruption
If you are using TPUs to train a machine learning model and your workload is interrupted, all work performed since the last checkpoint is lost. To decrease the probability that your workload is interrupted, do the following:
- Set a higher priority for this Job than for all other Jobs: If resources are scarce, the GKE scheduler preempts lower priority Jobs to schedule a higher priority Job. This also ensures that your higher priority workload receives all the resources that it needs (up to the total resources available in the cluster). To learn more, see Pod priority and preemption.
- Configure maintenance exclusion: A maintenance exclusion is a non-repeating window of time during which automatic maintenance is forbidden. To learn more, see Maintenance exclusions.
- Use extended run time Pods in Autopilot: Use extended run time Pods for a grace period of up to seven days before GKE terminates your Pods for scale-downs or node upgrades.
- Use collection scheduling in TPU Trillium: Use collections to indicate that a TPU slice node pool is part of a serving workload. Google Cloud limits and streamlines interruptions to the operations of inference workloads. To learn more, see How collection scheduling works.
These recommendations help to minimize interruptions, but not to prevent them. For example, a preemption due to a hardware failure or preemption for defragmentation can still occur. Similarly, setting a GKE maintenance exclusion doesn't prevent Compute Engine maintenance events.
Save checkpoints frequently and
  add code to your training script to start from the last checkpoint when
  resumed.
Handle disruption due to node maintenance
The GKE nodes that host the TPUs are subject to maintenance events or other disruptions that might cause node shutdown. In GKE clusters with the control plane running version 1.29.1-gke.1425000 and later, you can reduce disruption to workloads by configuring GKE to terminate your workloads gracefully.
To understand, configure, and monitor disruption events that might occur on GKE nodes running AI/ML workloads, see Manage GKE node disruption for GPUs and TPUs.
Maximize TPU utilization
To maximize your investment in TPUs, schedule a mix of Job priorities and queue them to maximize the amount of time that your TPUs are operating. For Job-level scheduling and preemption, you need to use an add-on to Kubernetes that orchestrates Jobs into queues.
Use Kueue to orchestrate Jobs into queues.
What's next
- Follow the Deploy TPU workloads in GKE to set up Cloud TPU with GKE.
- Learn about best practices for using Cloud TPU for your machine learning tasks.
- Build large-scale machine learning on Cloud TPUs with GKE.
- Serve Large Language Models with KubeRay on TPUs.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-10-20 UTC.