Storage options for Cloud TPU data

This document describes data storage options you can use when training models on Cloud TPU.

Introduction

Cloud TPU requires data storage for:

  • Dataset downloading and preprocessing
  • Host input pipeline processing
  • Model training input
  • Model training output

The storage options for TPU application data and training datasets are:

For more information about managing storage, see the following pages:

Durable block storage

Durable block storage, also known as disks or volumes, is for data that you want to preserve after you stop, suspend, or delete your TPU VM. Durable block storage is still available even if the TPU VM crashes or fails. You can use the TPU VM boot disk or attach additional block storage to your TPU.

You might want to attach an additional disk in the following scenarios:

  • The size of your training dataset exceeds the size of the TPU boot disk.
  • You have read-only data and want faster read access using a Hyperdisk ML volume.

TPU generation and supported disk types

The following table shows the types of disks supported by each TPU generation:

TPU generation Supported disk types
TPU7x Hyperdisk Balanced, Hyperdisk ML
TPU v6e Hyperdisk Balanced, Hyperdisk ML
TPU v5p Balanced Persistent Disk, Hyperdisk ML
TPU v5e Balanced Persistent Disk, Hyperdisk ML

TPU VM boot disk

By default, each TPU VM has a single 10 GB boot disk. When you create your VMs, you can configure a larger boot disk. For more information, see Create a customized boot disk. The boot disk contains the operating system, TPU drivers, and libraries. The boot disk can also store downloaded datasets temporarily for preprocessing and model input and output data, as long as the total size of the data doesn't exceed the available space on the boot disk.

If your application requires additional storage space beyond the boot disk default, you can add one or more durable disks to your TPU VM instance. For more information, see:

Attached storage

Both Hyperdisk and Persistent Disk are durable network storage devices that your VM instances can access like physical disks in a desktop or a server. You create both types of disks independently from your VM instances, so you can keep your data even after you delete your VM.

Advantages of using Hyperdisk over Persistent Disk include customizable performance, higher IOPS and throughput limits. For more information about Hyperdisk and Persistent Disk, see Choose a disk type.

When you attach a disk to a MIG with a multi-host TPU VM slice, the system attaches the disk to each VM in that TPU slice. To prevent two or more TPU VMs from writing to a disk at the same time, you must configure all disks you attach to a multi-host TPU slice as read-only. Read-only disks are useful for storing a dataset for processing on a TPU slice. Because Hyperdisk Balanced doesn't support read-only mode, you can only attach a Hyperdisk Balanced volume to a single TPU VM instance.

For more information about using durable block storage, see Add a persistent disk to your VM and Add a Hyperdisk.

Disk backups

You might find it difficult to retrieve data from the boot disk if the TPU VM gets stuck in an "unknown" state or to recover data you deleted. Back up your data using another storage option, such as Cloud Storage buckets.

If you store data on an attached disk, you can use disk snapshots, which incrementally back up data on a disk. The TPU VM boot disk does not support disk snapshots. For more information, see About disk snapshots.

Cloud Storage buckets

Cloud Storage buckets are flexible, scalable, and durable storage options for your VM instances. If your training job does not require the lower latency of durable block storage, you can store your dataset in a Cloud Storage bucket.

The performance of Cloud Storage buckets depends on the storage class that you select and the location of the bucket relative to your instance.

Creating your Cloud Storage bucket in the same zone as your TPU VM gives you performance that is comparable to durable block storage but with higher latency and less consistent throughput characteristics.

All Cloud Storage buckets have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Cloud Storage calculates checksums for all operations to help ensure that what you read is what you wrote.

Unlike durable block storage, Cloud Storage buckets don't restrict you to the zone where your instance is located. Additionally, you can read and write data to a bucket from multiple instances simultaneously. For example, you can configure instances in multiple zones to read and write data in the same bucket rather than replicate the data to durable block storage in multiple zones.

For more information, see Connect to Cloud Storage buckets.

Cloud Storage FUSE

Cloud Storage FUSE lets you mount and access Cloud Storage buckets as local file systems. This allows applications to read and write objects in your bucket using standard file system semantics.

See the Cloud Storage FUSE documentation for details about how Cloud Storage FUSE works and a description of how Cloud Storage FUSE operations map to Cloud Storage operations. You can find additional information about how to use Cloud Storage FUSE, such as how to install the Cloud Storage FUSE CLI and mounting buckets on GitHub.

Filestore file share

Filestore file share is a fully managed network attached storage (NAS) for Compute Engine. Filestore offers compatibility with existing enterprise applications and supports any NFSv3-compatible client.

Filestore offers low latency for file operations. For latency-sensitive workloads, Filestore supports capacity up to 100 TiB and throughput of 25 GiB per second and 720K IOPS, with minimal variability in performance.

With Filestore, you can mount file shares on TPU VMs.

Managed Lustre file share

Managed Lustre is a fully managed parallel file system for data-intensive AI and HPC workloads. It provides high performance, multi-petabyte scale capacity, and POSIX compliance.

With Managed Lustre, you can mount file shares on TPU VMs. It is especially useful for handling large datasets and high throughput requirements of machine learning workloads, enabling efficient training and inference.

For more information, see the Managed Lustre documentation.

What's next