本頁面由 Cloud Translation API 翻譯而成。

在 Docker 容器中執行 TPU 工作負載

Docker 容器可將程式碼和所有必要依附元件合併在一個可發布的套件中，讓應用程式設定更輕鬆。您可以在 TPU VM 中執行 Docker 容器，簡化 Cloud TPU 應用程式的設定和共用程序。本文說明如何為 Cloud TPU 支援的每個 ML 架構設定 Docker 容器。

在 Docker 容器中訓練 PyTorch 模型

TPU 裝置

建立 Cloud TPU VM

gcloud compute tpus tpu-vm create your-tpu-name \
--zone=europe-west4-a \
--accelerator-type=v2-8 \
--version=tpu-ubuntu2204-base

使用 SSH 連線至 TPU VM

gcloud compute tpus tpu-vm ssh your-tpu-name \
--zone=europe-west4-a

確認 Google Cloud 使用者已獲派 Artifact Registry Reader 角色。詳情請參閱「授予 Artifact Registry 角色」。
使用每夜建構的 PyTorch/XLA 映像檔，在 TPU VM 中啟動容器
```
sudo docker run --net=host -ti --rm --name your-container-name --privileged us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.6.0_3.10_tpuvm_cxx11 \
bash
```
注意： 執行這項指令後，指令提示字元會變更，表示終端機已連線至正在執行的容器。
設定 TPU 執行階段

PyTorch/XLA 執行階段有兩種選項：PJRT 和 XRT。除非有使用 XRT 的理由，否則建議使用 PJRT。如要進一步瞭解不同的執行階段設定，請參閱 PJRT 執行階段說明文件。
PJRT
```
export PJRT_DEVICE=TPU
```
XRT
```
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
```

複製 PyTorch XLA 存放區

git clone --recursive https://github.com/pytorch/xla.git

訓練 ResNet50

python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

訓練指令碼執行完畢後，請務必清除資源。

輸入 exit 即可退出 Docker 容器
輸入 exit 即可退出 TPU VM

刪除 TPU VM

gcloud compute tpus tpu-vm delete your-tpu-name --zone=europe-west4-a

TPU 配量

在 TPU 配量上執行 PyTorch 程式碼時，必須同時在所有 TPU 工作站上執行程式碼。其中一種方式是使用 gcloud compute tpus tpu-vm ssh 指令，並加上 --worker=all 和 --command 旗標。以下程序說明如何建立 Docker 映像檔，方便設定每個 TPU 工作站。

建立 TPU VM

gcloud compute tpus tpu-vm create your-tpu-name \
--zone=us-central2-b \
--accelerator-type=v4-32 \
--version=tpu-ubuntu2204-base

將目前使用者新增至 Docker 群組

gcloud compute tpus tpu-vm ssh your-tpu-name \
--zone=us-central2-b \
--worker=all \
--command='sudo usermod -a -G docker $USER'

複製 PyTorch XLA 存放區

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=us-central2-b \
--command="git clone --recursive https://github.com/pytorch/xla.git"

在所有 TPU 工作站的容器中執行訓練指令碼

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=us-central2-b \
--command="docker run --rm --privileged --net=host  -v ~/xla:/xla -e PJRT_DEVICE=TPU us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.6.0_3.10_tpuvm_cxx11 python /xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1"

Docker 指令旗標：

--rm 會在容器程序終止後移除容器。
--privileged 會將 TPU 裝置公開給容器。
--net=host 會將容器的所有通訊埠繫結至 TPU VM，允許 Pod 中的主機彼此通訊。
-e 會設定環境變數。

訓練指令碼執行完畢後，請務必清除資源。

使用下列指令刪除 TPU VM：

gcloud compute tpus tpu-vm delete your-tpu-name \
--zone=us-central2-b

在 Docker 容器中訓練 JAX 模型

TPU 裝置

建立 TPU VM

gcloud compute tpus tpu-vm create your-tpu-name \
--zone=europe-west4-a \
--accelerator-type=v2-8 \
--version=tpu-ubuntu2204-base

使用 SSH 連線至 TPU VM

gcloud compute tpus tpu-vm ssh your-tpu-name  --zone=europe-west4-a

在 TPU VM 中啟動 Docker Daemon
```
sudo systemctl start docker
```

啟動 Docker 容器

sudo docker run --net=host -ti --rm --name your-container-name \
--privileged us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.6.0_3.10_tpuvm_cxx11 \
bash

安裝 JAX
```
pip install jax[tpu]
```

安裝 FLAX

pip install --upgrade clu
git clone https://github.com/google/flax.git
pip install --user -e flax

安裝 tensorflow 和 tensorflow-dataset 套件

pip install tensorflow
pip install tensorflow-datasets

執行 FLAX MNIST 訓練指令碼

cd flax/examples/mnist
python3 main.py --workdir=/tmp/mnist \
--config=configs/default.py \
--config.learning_rate=0.05 \
--config.num_epochs=5

訓練指令碼執行完畢後，請務必清除資源。

輸入 exit 即可退出 Docker 容器
輸入 exit 即可退出 TPU VM

刪除 TPU VM

gcloud compute tpus tpu-vm delete your-tpu-name --zone=europe-west4-a

TPU 配量

在 TPU 配量上執行 JAX 程式碼時，您必須同時在所有 TPU 工作站上執行 JAX 程式碼。其中一種方式是使用 gcloud compute tpus tpu-vm ssh 指令，並加上 --worker=all 和 --command 旗標。以下程序說明如何建立 Docker 映像檔，方便設定各個 TPU 工作站。

在目前目錄中建立名為 Dockerfile 的檔案，然後貼上以下文字

FROM python:3.10
RUN pip install jax[tpu]
RUN pip install --upgrade clu
RUN git clone https://github.com/google/flax.git
RUN pip install --user -e flax
RUN pip install tensorflow
RUN pip install tensorflow-datasets
WORKDIR ./flax/examples/mnist

準備 Artifact Registry

gcloud artifacts repositories create your-repo \
--repository-format=docker \
--location=europe-west4 --description="Docker repository" \
--project=your-project

gcloud artifacts repositories list \
--project=your-project

gcloud auth configure-docker europe-west4-docker.pkg.dev

建構 Docker 映像檔
```
docker build -t your-image-name .
```
將 Docker 映像檔推送至 Artifact Registry 之前，請先為映像檔新增標記。如要進一步瞭解如何使用 Artifact Registry，請參閱「使用容器映像檔」。
```
docker tag your-image-name europe-west4-docker.pkg.dev/your-project/your-repo/your-image-name:your-tag
```

將 Docker 映像檔推送至 Artifact Registry

docker push europe-west4-docker.pkg.dev/your-project/your-repo/your-image-name:your-tag

建立 TPU VM

gcloud compute tpus tpu-vm create your-tpu-name \
--zone=europe-west4-a \
--accelerator-type=v2-8 \
--version=tpu-ubuntu2204-base

在所有 TPU 工作站上，從 Artifact Registry 提取 Docker 映像檔

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command='sudo usermod -a -G docker ${USER}'

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command="gcloud auth configure-docker europe-west4-docker.pkg.dev --quiet"

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command="docker pull europe-west4-docker.pkg.dev/your-project/your-repo/your-image-name:your-tag"

在所有 TPU 工作站上執行容器

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command="docker run -ti -d --privileged --net=host --name your-container-name europe-west4-docker.pkg.dev/your-project/your-repo/your-image-name:your-tag bash"

在所有 TPU 工作站上執行訓練指令碼

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command="docker exec --privileged your-container-name python3 main.py --workdir=/tmp/mnist \
--config=configs/default.py \
--config.learning_rate=0.05 \
--config.num_epochs=5"

訓練指令碼執行完畢後，請務必清除資源。

關閉所有工作人員的容器

gcloud compute tpus tpu-vm ssh your-tpu-name --worker=all \
--zone=europe-west4-a \
--command="docker kill your-container-name"

刪除 TPU VM

gcloud compute tpus tpu-vm delete your-tpu-name \
--zone=europe-west4-a

在 Docker 容器中執行 TPU 工作負載

在 Docker 容器中訓練 PyTorch 模型

TPU 裝置

PJRT

XRT

TPU 配量

在 Docker 容器中訓練 JAX 模型

TPU 裝置

TPU 配量

後續步驟