Create Cloud TPU VMs with MIGs

Managed instance groups (MIGs) automate the creation, configuration, and lifecycle management of a collection of VMs. MIGs provide benefits such as high availability through autohealing and regional (multi-zone) deployments, automatic scaling to handle variable loads, and simplified rolling updates for applications. For more information, see Managed instance groups.

You can use MIGs to create and manage TPU VMs for TPU versions v5p and later. You can create MIGs with a single TPU VM, independent TPU VMs (also called single-host slices) and MIGs with interconnected TPU VMs (also called multi-host slices).

Each slice in a single-host MIG has at most one TPU VM. The TPU VMs within the MIG are not connected with inter-chip interconnect (ICI) links.

A multi-host slice contains multiple TPU VMs that are interconnected with ICI links.

MIGs with single-host TPU slices

Creating a managed instance group (MIG) with multiple, independent TPU instances is beneficial for workloads that require several individual TPU VMs but don't need them to be interconnected with ICI links for distributed workloads. For example:

  • Inference serving: Each VM in the MIG can independently handle inference requests. A MIG lets you scale the number of serving instances based on demand and manage them as a group.
  • Parallel independent tasks: A MIG provides a way to manage many small, independent training jobs or other computations that can run in parallel on single TPU VMs.
  • Management: MIGs provide the following features:
    • Deployment: Define an instance template once and use the MIG to create multiple identical TPU VMs.
    • Scalability: Adjust the number of TPU VMs by resizing the MIG.
    • Rolling updates: Update the software or machine type across all VMs in a controlled manner.
  • Cost-effectiveness: For tasks that don't require the full power or interconnectivity of a large TPU slice, using multiple smaller, independent TPU slices can be more cost-effective.

For more information, see Create a MIG with single-host TPU slices.

MIGs with a multi-host slice

Unlike groups of independent TPU slices, a MIG configured for a multi-host slice manages a set of TPU VMs that are tightly coupled through ICI links. This creates a single, logical TPU slice.

Benefits and performance

MIGs for multi-host TPU slices provide the scale and performance required for intensive machine learning workloads.

  • Distributed training: Training machine learning models often requires more TPU power than a single TPU VM can provide. Larger TPU slices distribute computation across many TPU chips and VMs, with the ICI links enabling fast communication between them. This is crucial for training performance.
  • High interconnect bandwidth: The ICI network provides higher bandwidth and lower latency between TPU chips in the slice than the standard data center network (DCN). This is essential for the synchronous operations common in large model training.

Atomic lifecycle operations

To ensure the integrity of the interconnected topology, the MIG manages the entire slice as a single, indivisible unit throughout its lifecycle.

  • Creation: All VMs in the slice are provisioned together. If enough healthy, interconnected capacity aren't available for the entire requested topology, the slice isn't created.
  • Deletion: The MIG deletes the entire slice as a unit.
  • Resizing: Resizing is restricted to scaling from 0 to the full slice size, or from the full slice size back down to 0. You can't partially resize a multi-VM slice.

Configuration requirements

Configuring a multi-host MIG requires defining both the physical interconnection topology and the individual instance properties.

  • Workload policy: You must specify a workload policy with the accelerator-topology parameter (for example, 4x4, 8x8, or 4x4x4). This configures the MIG to treat the instances as a single, interconnected slice. For information about topology, see System architecture.
  • Instance template: Defines properties like machine type, disk image, and other settings for each VM within the slice.

Slice availability and failure recovery

When you use MIGs to create a multi-host TPU slice, the MIG automatically manages the slice recovery process. If a host or ICI failure occurs, then the slice transitions to REACTIVATING state. All VMs in the slice will transition into REPAIRING state, though not necessarily all at the same time. The MIG will then automatically restart the VMs together on healthy capacity to restore the slice.

However, when you use Spot VMs, preemption results in instances being terminated. The MIG doesn't automatically reactivate the slice.

Failure recovery from an instance interruption

If you delete or stop a TPU instance, or stop an instance from within the operating system, then the slice will transition to the FAILED state. In this scenario, the slice remains in the FAILED state until you recreate the slice. To recreate the slice, you must either delete and recreate the MIG, or resize the MIG to 0 and then increase its size.

For more information about the slice states, see View the status of a TPU slice.

Limitations

MIGs with TPUs have the following limitations:

  • Lifecycle operations: You can't stop, start, resume, or suspend TPU instances. To change configurations that require a restart or to stop incurring charges, you must delete the instances.

  • Regional MIG zone distribution: You must set the target distribution shape to ANY_SINGLE_ZONE.

  • Configuration updates in a MIG:

    • You can't update a MIG that forms a multi-host TPU slice due to the defined accelerator topology.
    • You can update a MIG that forms single-host TPU slices by using the automatic or selective methods. However, the updates for single-host TPU slice don't support the restart (RESTART) action. If a restart is necessary and the most disruptive action allowed is replace (REPLACE), then the updater will replace the instance; otherwise, the update attempt fails with an error.

  • For a MIG that forms a multi-host TPU slice, the following limitations also apply:

    • Target size policy: You must set the target size policy mode to BULK. After you set this mode, you can't change it.

    • Target size: In bulk mode, you can set the target size to either 0 or the number of instances that are required to form the accelerator topology.

    • Workload policy: You must specify a workload policy in which the accelerator topology is defined. After you set the workload policy, you can't change or remove the policy from the MIG.

  • Unsupported features: MIGs with TPUs don't support the following features:

What's next