Train a model using TPU7x (Ironwood)

This document describes how to provision TPU7x resources and gives an example of deploying a training workload using MaxText and XPK.

TPU7x is the first release within the Ironwood family, Google Cloud's seventh generation TPU. The Ironwood generation is designed for large-scale AI training and inference. For more information, see TPU7x.

For more examples optimized for TPU7x, see Training Recipes for Ironwood TPU on GitHub.

Provision TPUs

You can provision and manage TPU7x using the following methods:

GKE: You can use GKE to provision and manage TPUs as a pool of accelerators for your containerized machine learning workloads. Use the Google Cloud CLI to create your GKE cluster instance manually for precise customization or expansion of existing production GKE environments. For more information, see About TPUs in GKE.
GKE and XPK: XPK is a command-line tool that simplifies cluster creation and workload execution on GKE. It's designed for ML practitioners to provision TPUs and run training jobs without needing deep Kubernetes expertise. Use XPK to quickly create GKE clusters and run workloads for proof-of-concept and testing. For more information, see the XPK GitHub repository.
GKE and TPU Cluster Director: TPU Cluster Director is available through an All Capacity mode reservation, which gives you full access to all of your reserved capacity (without hold-backs) and full visibility into the TPU hardware topology, utilization status, and health status. For more information, see All Capacity mode overview.

Deploy a training workload with MaxText and XPK

Use Accelerated Processing Kit (XPK) to create GKE clusters for proof-of-concept and testing. XPK is a command-line tool designed to simplify provisioning, managing, and running machine learning workloads.

The following sections show how to deploy a training workload using MaxText and XPK.

Before you begin

Before you start, complete the following steps:

Ensure you have a Google Cloud project with billing enabled.
Get access to TPU7x. For more information, contact your account team.
Ensure the account you're using with XPK has the roles listed in the XPK GitHub repository.

Install XPK and dependencies

Install XPK. Follow the instructions in the XPK GitHub repository.
Install Docker using instructions provided by your administrator or follow the official installation instructions. Once installed, run the following commands to configure Docker and test the installation:
```
gcloud auth configure-docker
sudo usermod -aG docker $USER # relaunch the terminal and activate venv after running this command
docker run hello-world # Test Docker
```
Set the following environment variables:
```
export PROJECT_ID=YOUR_PROJECT_ID
export ZONE=YOUR_ZONE
export CLUSTER_NAME=YOUR_CLUSTER_NAME
export ACCELERATOR_TYPE=YOUR_ACCELERATOR_TYPE
export BASE_OUTPUT_DIR="gs://YOUR_BUCKET_NAME"
```
Replace the following:
- YOUR_PROJECT_ID: Your Google Cloud project ID.
- YOUR_ZONE: The zone in which to create the cluster. For Preview, only us-central1-c is supported.
- YOUR_CLUSTER_NAME: The name of the new cluster.
- YOUR_ACCELERATOR_TYPE: The TPU version and topology. For example, tpu7x-4x4x8. For a list of supported topologies, see Supported configurations.
- YOUR_BUCKET_NAME: The name of your Cloud Storage bucket, which will be the output directory for model training.

If you don't have an existing Cloud Storage bucket, create one using the following command:

gcloud storage buckets create ${BASE_OUTPUT_DIR} \
    --project=${PROJECT_ID} \
    --location=US \
    --default-storage-class=STANDARD \
    --uniform-bucket-level-access

Create a single-NIC, single slice cluster

Choose one of the following options to create your cluster. Using a custom network with 8,896 MTU is recommended for optimal performance.

Custom network

To create a custom network with 8,896 MTU and use it for your cluster, follow these steps:

Set environment variables for the network and firewall names:
```
export NETWORK_NAME=NETWORK_NAME
export NETWORK_FW_NAME=FIREWALL_NAME
```
Replace the following:
- NETWORK_NAME: A name for the network.
- FIREWALL_NAME: A name for the network firewall rule.

Create a custom network with an MTU of 8,896:

gcloud compute networks create ${NETWORK_NAME} \
    --mtu=8896 \
    --project=${PROJECT_ID} \
    --subnet-mode=auto \
    --bgp-routing-mode=regional

Create a firewall rule that allows TCP, ICMP, and UDP traffic on your network:

gcloud compute firewall-rules create ${NETWORK_FW_NAME} \
    --network=${NETWORK_NAME} \
    --allow tcp,icmp,udp \
    --project=${PROJECT_ID}

Set an environment variable for the XPK cluster arguments to use the network you created:
```
export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME}"
```

Create the XPK cluster. The following command provisions on-demand capacity:

xpk cluster create --cluster=${CLUSTER_NAME} \
    --cluster-cpu-machine-type=n1-standard-8 \
    --num-slices=${NUM_SLICES} \
    --tpu-type=${ACCELERATOR_TYPE} \
    --zone=${ZONE} \
    --project=${PROJECT_ID} \
    --on-demand \
    --custom-cluster-arguments="${CLUSTER_ARGUMENTS}"

To use reserved capacity, replace --on-demand with --reservation=RESERVATION_NAME. To use TPU Spot VMs, replace --on-demand with --spot.

Default network

If you don't require a high-MTU network, you can create a cluster that uses the default VPC network. The following command provisions on-demand capacity:

xpk cluster create --cluster=${CLUSTER_NAME} \
    --cluster-cpu-machine-type=n1-standard-8 \
    --num-slices=${NUM_SLICES} \
    --tpu-type=${ACCELERATOR_TYPE} \
    --zone=${ZONE} \
    --project=${PROJECT_ID} \
    --on-demand

To use reserved capacity, replace --on-demand with --reservation=RESERVATION_NAME. To use TPU Spot VMs, replace --on-demand with --spot.

Build or upload the MaxText Docker image

You can either build a Docker image locally using scripts provided by MaxText or use a prebuilt image.

Build locally

The following commands copy your local directory into the container:

# Make sure you're running on a virtual environment with python3.12. If nothing is printed, you have the correct version.
[[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]] || { >&2 echo "Error: Python version must be 3.12."; false; }

# Clone MaxText
git clone https://github.com/AI-Hypercomputer/maxtext.git
cd maxtext
git checkout maxtext-tutorial-v1.0.0

# Build the Docker image
bash docker_build_dependency_image.sh MODE=stable JAX_VERSION=0.8.2

After the successful execution of the commands, you should see an image named maxtext_base_image created locally. You can use your local image directly in the xpk workload command.

Upload image (optional)

After building the Docker image locally using the instructions in the previous section, you can upload the MaxText Docker image into the registry using the following command:

export CLOUD_IMAGE_NAME="${USER}-maxtext-runner"
bash docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}

After the successful execution of this command, you should see the MaxText image in gcr.io with the name gcr.io/PROJECT_ID/CLOUD_IMAGE_NAME.

Define the MaxText training command

Prepare the command to run your training script within the Docker container.

The MaxText 1B model is a configuration within the MaxText framework designed for training a language model with approximately 1 billion parameters. Use this model to experiment with small chip scales. Performance is not optimized.

export MAXTEXT_COMMAND="JAX_PLATFORMS=tpu,cpu \
    ENABLE_PJRT_COMPATIBILITY=true \
    python3 src/MaxText/train.py src/MaxText/configs/base.yml \
        base_output_directory=${BASE_OUTPUT_DIR} \
        dataset_type=synthetic \
        per_device_batch_size=2 \
        enable_checkpointing=false \
        gcs_metrics=true \
        run_name=maxtext_xpk \
        steps=30"

Deploy the training workload

Run the xpk workload create command to deploy your training job. You must either specify the --base-docker-image flag to use the MaxText base image or specify the --docker-image flag and the image you want to use. You can choose to include the --enable-debug-logs flag to enable debug logging.

xpk workload create \
    --cluster ${CLUSTER_NAME} \
    --base-docker-image maxtext_base_image \
    --workload maxtext-1b-$(date +%H%M) \
    --tpu-type=${ACCELERATOR_TYPE} \
    --zone ${ZONE} \
    --project ${PROJECT_ID} \
    --command "${MAXTEXT_COMMAND}"
    # [--enable-debug-logs]

Workload names must be unique within the cluster. In this example, $(date +%H%M) is appended to the workload name to ensure uniqueness.