本頁面由 Cloud Translation API 翻譯而成。

在 GPU 上透過 PyTorch、Ray 和 Google Kubernetes Engine (GKE) 訓練模型

本指南說明如何使用 Ray、PyTorch 和 Ray Operator 外掛程式，在 Google Kubernetes Engine (GKE) 上訓練模型。

關於 Ray

Ray 是開放原始碼的可擴充運算架構，適用於 AI/ML 應用程式。Ray Train 是 Ray 中的元件，專為分散式模型訓練和微調而設計。您可以使用 Ray Train API，在多部機器上擴充訓練作業，並與 PyTorch 等機器學習程式庫整合。

您可以使用 RayCluster 或 RayJob 資源部署 Ray 訓練工作。在實際工作環境中部署 Ray 工作時，基於下列原因，您應使用 RayJob 資源

RayJob 資源會建立臨時 Ray 叢集，工作完成後即可自動刪除。
RayJob 資源支援重試政策，可確保工作執行作業的穩定性。
您可以使用熟悉的 Kubernetes API 模式管理 Ray 工作。

準備環境

如要準備環境，請按照下列步驟操作：

在 Google Cloud 控制台中，按一下Google Cloud 控制台中的「啟用 Cloud Shell」，即可啟動 Cloud Shell 工作階段。系統會在 Google Cloud 控制台的底部窗格啟動工作階段。

設定環境變數：

export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central1
export COMPUTE_ZONE=us-central1-c
export CLUSTER_VERSION=CLUSTER_VERSION
export TUTORIAL_HOME=`pwd`

更改下列內容：

PROJECT_ID：您的 Google Cloud 專案 ID。
CLUSTER_VERSION：要使用的 GKE 版本。必須為 1.30.1 或之後。

複製 GitHub 存放區：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

變更為工作目錄：

cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist

建立 Python 虛擬環境：

python -m venv myenv && \
source myenv/bin/activate

安裝 Ray。

建立 GKE 叢集

建立 Autopilot 或 Standard GKE 叢集：

Autopilot

建立 Autopilot 叢集：

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

標準

建立標準叢集：

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=e2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=4

部署 RayCluster 資源

將 RayCluster 資源部署至叢集：

請查看下列資訊清單：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: pytorch-mnist-cluster
spec:
  rayVersion: '2.37.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.37.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
            requests:
              cpu: "2"
              ephemeral-storage: "9Gi"
              memory: "4Gi"
  workerGroupSpecs:
  - replicas: 4
    minReplicas: 1
    maxReplicas: 5
    groupName: worker-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.37.0
          resources:
            limits:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"
            requests:
              cpu: "4"
              ephemeral-storage: "9Gi"
              memory: "8Gi"

這個資訊清單說明 RayCluster 自訂資源。

將資訊清單套用至 GKE 叢集：
```
kubectl apply -f ray-cluster.yaml
```

確認 RayCluster 資源已準備就緒：

kubectl get raycluster

輸出結果會與下列內容相似：

NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
pytorch-mnist-cluster   2                 2                   6      20Gi     0      ready    63s

在這個輸出內容中，STATUS 資料欄中的 ready 表示 RayCluster 資源已準備就緒。

連線至 RayCluster 資源

連線至 RayCluster 資源，提交 Ray 工作。

確認 GKE 是否已建立 RayCluster 服務：

kubectl get svc pytorch-mnist-cluster-head-svc

輸出結果會與下列內容相似：

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s

建立通訊埠轉送工作階段至 Ray 首節點：

kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &

確認 Ray 用戶端可以使用 localhost 連線至 Ray 叢集：

ray list nodes --address http://localhost:8265

輸出結果會與下列內容相似：

Stats:
------------------------------
Total: 3

Table:
------------------------------
    NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
# Several lines of output omitted

訓練模型

使用 Fashion MNIST 資料集訓練 PyTorch 模型：

提交 Ray 工作，並等待工作完成：

ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py

輸出結果會與下列內容相似：

Job submission server address: http://localhost:8265

--------------------------------------------
Job 'pytorch-mnist-job' submitted successfully
--------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs pytorch-mnist-job
  Query the status of the job:
    ray job status pytorch-mnist-job
  Request the job to be stopped:
    ray job stop pytorch-mnist-job

Handling connection for 8265
Tailing logs until the job exits (disable with --no-wait):
...
...

確認工作狀態：

ray job status pytorch-mnist

輸出結果會與下列內容相似：

Job submission server address: http://localhost:8265
Status for job 'pytorch-mnist-job': RUNNING
Status message: Job is currently running.

等待 Status for job 變成 COMPLETE。這項作業可能需要 15 分鐘以上才能完成。

查看 Ray 工作記錄：

ray job logs pytorch-mnist

輸出結果會與下列內容相似：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

部署 RayJob

RayJob 自訂資源會在執行單一 Ray 工作期間，管理 RayCluster 資源的生命週期。

請查看下列資訊清單：

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: pytorch-mnist-job
spec:
  shutdownAfterJobFinishes: true
  entrypoint: python ai-ml/gke-ray/raytrain/pytorch-mnist/train.py
  runtimeEnvYAML: |
    pip:
      - torch
      - torchvision
    working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
    env_vars:
      NUM_WORKERS: "4"
      CPUS_PER_WORKER: "2"
  rayClusterSpec:
    rayVersion: '2.37.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.37.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
                requests:
                  cpu: "2"
                  ephemeral-storage: "9Gi"
                  memory: "4Gi"
    workerGroupSpecs:
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.37.0
                resources:
                  limits:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"
                  requests:
                    cpu: "4"
                    ephemeral-storage: "9Gi"
                    memory: "8Gi"

這個資訊清單說明 RayJob 自訂資源。

將資訊清單套用至 GKE 叢集：
```
kubectl apply -f ray-job.yaml
```

確認 RayJob 資源正在執行：

kubectl get rayjob

輸出結果會與下列內容相似：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
pytorch-mnist-job   RUNNING      Running             2024-06-19T15:43:32Z              2m29s

在這個輸出內容中，DEPLOYMENT STATUS 欄表示 RayJob 資源為 Running。

查看 RayJob 資源狀態：

kubectl logs -f -l job-name=pytorch-mnist-job

輸出結果會與下列內容相似：

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                  │
├──────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker       8  │
│ train_loop_config/epochs                     10  │
│ train_loop_config/lr                      0.001  │
╰─────────────────────────────────────────────────╯

# Several lines omitted

Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s
╭───────────────────────────────╮
│ Training result                │
├────────────────────────────────┤
│ checkpoint_dir_name            │
│ time_this_iter_s      25.7394  │
│ time_total_s          351.233  │
│ training_iteration         10  │
│ accuracy               0.8656  │
│ loss                  0.37827  │
╰───────────────────────────────╯

# Several lines omitted
-------------------------------
Job 'pytorch-mnist' succeeded
-------------------------------

確認 Ray 工作已完成：

kubectl get rayjob

輸出結果會與下列內容相似：

NAME                JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
pytorch-mnist-job   SUCCEEDED    Complete            2024-06-19T15:43:32Z   2024-06-19T15:51:12Z   9m6s

在這個輸出內容中，DEPLOYMENT STATUS 欄表示 RayJob 資源為 Complete。

在 GPU 上透過 PyTorch、Ray 和 Google Kubernetes Engine (GKE) 訓練模型 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

關於 Ray

準備環境

建立 GKE 叢集

Autopilot

標準

部署 RayCluster 資源

連線至 RayCluster 資源

訓練模型

部署 RayJob

在 GPU 上透過 PyTorch、Ray 和 Google Kubernetes Engine (GKE) 訓練模型