Storage options for Cloud TPU data

This document describes data storage options that can be used when training models on Cloud TPU.

Introduction

Cloud TPU requires data storage for:

  • Dataset downloading and preprocessing
  • Host input pipeline processing
  • Model training input
  • Model training output

The storage options for the Cloud TPU application data and training datasets are:

For more information about managing storage, see the following pages:

Durable block storage

Durable block storage, also known as disks or volumes, is for data that you want to preserve after you stop, suspend, or delete your TPU VM. Durable block storage is still available even if the TPU VM crashes or fails. You can use the TPU VM boot disk or attach additional block storage to your TPU.

You might want to attach an additional disk in the following scenarios:

  • The size of your training dataset exceeds the size of the TPU boot disk.
  • You have read-only data and want faster read access using a Hyperdisk ML volume.

You can attach two types of durable block storage to a Cloud TPU: Google Cloud Hyperdisk and Persistent Disk. Persistent Disk is not supported for the latest machine series, including Cloud TPU v6e. Google recommends using Google Cloud Hyperdisk for the highest performance and advanced features.

TPU VM boot disk

By default, each Cloud TPU VM has a single 100 GiB boot disk that contains the operating system. The boot disk can also be used for temporary storage of downloaded datasets for preprocessing and model input and output data, as long as the total amount doesn't exceed the available space on the boot disk.

You can't resize the boot disk on a Cloud TPU. If your application requires additional storage space beyond the boot disk default, you can add one or more durable disks to your TPU VM instance. For more information, see Attach durable block storage to a TPU VM.

Best practices for customizing Cloud TPU VM boot disks

Cloud TPU VMs provide the flexibility to customize the guest OS environment using startup scripts or by creating custom images. However, boot disk recovery for Cloud TPU VMs is limited. You cannot detach or snapshot the boot disk for offline repair, so use caution when making changes that affect the boot process. By following these best practices, you can reduce the risk of boot failures when customizing your Cloud TPU VM environments.

Key principles:

  • Minimize boot disk modifications: Whenever possible, install applications and store data on Persistent Disk or Hyperdisk volumes rather than heavily modifying the boot disk.

  • Use UUIDs for mounting: When adding entries to /etc/fstab, always use UUIDs to identify disks and partitions (UUID=...) rather than device names like /dev/sdb1, because auto-generated device names are not guaranteed to be stable across reboots.

Recommendations:

  • Error handling: Implement robust error checking and graceful failure modes in your scripts. Log detailed messages to the serial console and Cloud Logging to aid in debugging.

  • Critical dependencies: Be extremely careful when modifying files essential for booting, such as /etc/fstab, network configurations, or bootloader settings. A syntax error or incorrect entry can render the VM unbootable.

  • Secondary disks: If your script relies on secondary disks, ensure it handles cases where the disk might not be present or takes longer to attach than expected. Avoid making the boot process critically dependent on secondary disk mounts unless absolutely necessary.

    Example '/etc/fstab' entry:

    • Recommended: UUID=a1b2c3d4-e5f6-7890-1234-567890abcdef /mnt/mydata ext4 defaults,nofail 0 2
    • Not recommended: /dev/sdb1 /mnt/mydata ext4 defaults 0 2

    Using nofail can prevent the system from halting if the disk is not found, but ensure your application can handle the mount point being unavailable.

  • Package management: Be cautious when adding third-party repositories. Ensure they are trusted and compatible with the base OS image. Understand the dependencies of any packages you install and their potential impact on system libraries.

  • Disk space: Monitor boot disk usage. Extensive logging or large software installations can fill the boot disk, preventing the VM from starting.

  • Logging: Configure your applications and scripts to log verbosely to the serial console, as this is the primary tool for diagnosing boot issues on Cloud TPU VMs.

Attached storage

Both Hyperdisk and Persistent Disk are durable network storage devices that your VM instances can access like physical disks in a desktop or a server. Both types of disks are created independently from your virtual machine (VM) instances, so you can keep your data even after you delete your VM instances.

Advantages of using Hyperdisk over Persistent Disk include customizable performance, higher IOPS and throughput limits. For more information about Hyperdisk and Persistent Disk, see Choose a disk type.

For more information about using durable block storage with TPU VMs, see Attach durable block storage to a TPU VM.

Disk backups

It can be difficult to retrieve the data from the boot disk if the TPU VM gets stuck in an "unknown" state or to recover deleted data. Make sure to back up your data using another storage option, such as Cloud Storage buckets.

If you store data on an attached disk, you can use disk snapshots, which incrementally back up data on a disk. Disk snapshots aren't supported for the TPU boot disk. For more information, see About disk snapshots.

Cloud Storage buckets

Cloud Storage buckets are the most flexible, scalable, and durable storage option for your VM instances. If your training job does not require the lower latency of durable block storage, you can store your dataset in a Cloud Storage bucket.

The performance of Cloud Storage buckets depends on the storage class that you select and the location of the bucket relative to your instance.

Creating your Cloud Storage bucket in the same zone as your TPU VM gives performance that is comparable to durable block storage but with higher latency and less consistent throughput characteristics.

All Cloud Storage buckets have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Checksums are calculated for all Cloud Storage operations to help ensure that what you read is what you wrote.

Unlike durable block storage, Cloud Storage buckets are not restricted to the zone where your instance is located. Additionally, you can read and write data to a bucket from multiple instances simultaneously. For example, you can configure instances in multiple zones to read and write data in the same bucket rather than replicate the data to durable block storage in multiple zones.

For more information about connecting your TPU VM to a Cloud Storage bucket, see Connecting to Cloud Storage buckets.

Cloud Storage FUSE

Cloud Storage FUSE lets you mount and access Cloud Storage buckets as local file systems. This allows applications to read and write objects in your bucket using standard file system semantics.

See the Cloud Storage FUSE documentation for details about how Cloud Storage FUSE works and a description of how Cloud Storage FUSE operations map to Cloud Storage operations. You can find additional information about how to use Cloud Storage FUSE, such as how to install the Cloud Storage FUSE CLI and mounting buckets on GitHub.

Filestore file share

Filestore file share is a fully managed network attached storage (NAS) for Compute Engine. Filestore offers compatibility with existing enterprise applications and supports any NFSv3-compatible client.

Filestore offers low latency for file operations. For workloads that are latency sensitive, Filestore supports capacity up to 100 TiB and throughput of 25 GiB per second and 720K IOPS, with minimal variability in performance.

With Filestore, you can mount file shares on TPU VMs.

What's next