TPU7x (Ironwood)

This page describes the architecture and available configurations for TPU7x, the latest TPU available on Google Cloud. TPU7x is the first release within the Ironwood family, Google Cloud's seventh generation TPU. The Ironwood generation is designed for large-scale AI training and inference.

With a 9,216-chip footprint per pod, TPU7x shares many similarities with TPU v5p. TPU7x provides high performance for large scale dense and MoE models, pre-training, sampling and decode-heavy inference.

To use TPU7x, you must use Google Kubernetes Engine (GKE). For more information, see About TPUs in GKE.

You can also use TPU7x and GKE with TPU Cluster Director. TPU Cluster Director is available through an All Capacity mode reservation, which gives you full access to all of your reserved capacity (no hold-backs) and full visibility into the TPU hardware topology, utilization status, and health status. For more information, see All Capacity mode overview.

To get access to TPU7x, contact your account team.

System architecture

Each TPU7x chip contains two TensorCores and four SparseCores. The following table shows the key specifications and their values for TPU7x compared to prior generations.

Specification v5p v6e (Trillium) TPU7x (Ironwood)
Number of chips per pod 8960 256 9216
Peak compute per chip (BF16) (TFLOPs) 459 918 2307
Peak compute per chip (FP8) (TFLOPs) 459 918 4614
HBM capacity per chip (GiB) 95 32 192
HBM bandwidth per chip (GB/s) 2765 1638 7380
Number of vCPUs (4-chip VM) 208 180 224
RAM (GB) (4-chip VM) 448 720 960
Number of TensorCores per chip 2 1 2
Number of SparseCores per chip 4 2 4
Bi-directional inter-chip interconnect (ICI) bandwidth per chip (GB/s) 1200 800 1200
Data center network (DCN) bandwidth per chip (Gb/s) 50 100 100

The following diagram illustrates the architecture of Ironwood:

Ironwood architecture diagram

Dual-chiplet architecture

The Ironwood programming model lets you access two TPU devices instead of the single logical core (also known as MegaCore) architecture used in previous generations (TPU v4 and v5p). This change improves the cost-effectiveness and efficiency of manufacturing the chip. While this represents an architectural shift, the new design ensures that you can reuse existing software models with minimal changes.

Ironwood TPUs are composed of two distinct chiplets. This is a departure from the unified memory space of the MegaCore architecture.

  • Chiplet composition: Each chiplet is a self-contained unit with one TensorCore, two SparseCores, and 96 GB of high-bandwidth memory (HBM).

  • High-speed interconnect: The two chiplets are connected by a die-to-die (D2D) interface that is six times faster than a 1D inter-chip interconnect (ICI) link. Inter-chiplet communication is managed using collective operations.

Programming model and framework exposure

The programming model for Ironwood is similar to that of TPU generations earlier than v4, such as TPU v3. The new architecture is exposed in the following ways:

  • Two devices per chip: Frameworks like JAX expose each Ironwood chip as two separate "devices," one for each chiplet.

  • 4D topology: JAX adds a fourth dimension to the topology to specify which of the two on-chip devices to use. This lets you use existing software models with minimal modification.

For more information about achieving optimal performance with the dual-chiplet architecture, see Performance recommendations for Ironwood's dual-chiplet architecture

Supported configurations

TPU7x chips have a direct connection to the nearest neighboring chips in 3 dimensions, resulting in a 3D mesh of networking connections. Slices larger than 64 chips are made up of one or more 4x4x4 "cubes" of chips.

The following table shows common 3D slice shapes that are supported for TPU7x:

Topology TPU chips Hosts VMs Cubes Scope
2x2x1 4 1 1 1/16 Single-host
2x2x2 8 2 2 1/8 Multi-host
2x2x4 16 4 4 1/4 Multi-host
2x4x4 32 8 8 1/2 Multi-host
4x4x4 64 16 16 1 Multi-host
4x4x8 128 32 32 2 Multi-host
4x8x8 256 64 64 4 Multi-host
8x8x8 512 128 128 8 Multi-host
8x8x16 1024 256 256 16 Multi-host
8x16x16 2048 512 512 32 Multi-host

TPU7x VM

Each TPU7x virtual machine (VM) contains 4 chips. Each VM has access to two NUMA nodes. For more information about NUMA nodes, see Non-uniform memory access on Wikipedia.

All TPU7x slices use full-host, 4-chip VMs. The technical specifications for a TPU7x VM are:

  • Number of vCPUs per VM: 224
  • RAM per VM: 960 GB
  • Number of NUMA nodes per VM: 2

Hyperdisk

By default, the VM boot disk for TPU7x is Hyperdisk Balanced. You can attach additional Hyperdisk Balanced disks to your TPU VM for additional storage.

For more information about Hyperdisk, see Hyperdisk overview. For more information about storage options for Cloud TPU, see Storage options for Cloud TPU data.

What's next