在 GKE 上使用 NVIDIA NeMo RL 微調及擴充強化學習模型

本教學課程說明如何在 Google Kubernetes Engine (GKE) 上,為強化學習 (RL) 協調分散式訓練環境。您可以使用 Ray 和 NVIDIA NeMo RL 架構設定分散式訓練環境,微調模型。

本教學課程著重於在 GKE 上,使用 Ray 和 NeMo RL 進行群組相對政策最佳化 (GRPO) 訓練管道。GRPO 是一種強化學習演算法,旨在提升模型的推理能力。這項記憶體效率高的演算法會淘汰 Critic 或價值模型,改用以群組為準的相對計算方式,簡化 RL 流程。

建議您先完成在 GKE 上使用 Verl 微調及擴充強化學習教學課程,再執行本教學課程。下列教學課程使用的叢集設定和配置,與使用 Verl 微調和調整 RL 大小的教學課程相同。

背景

以下各節簡要說明本教學課程使用的概念。

增強學習 (RL)

RL 是透過經驗、探索和回饋來訓練模型,而不是靜態模仿。預先訓練會教導模型該說什麼,而人類回饋增強學習 (RLHF) 則會教導模型如何提供實用、安全且合乎邏輯的內容。RL 可做為基礎模型與針對特定用途微調模型的橋樑。

詳情請參閱「什麼是強化學習?」一文。

群組相對政策最佳化 (GRPO)

GRPO 是 DeepSeek 普及的演算法,可移除 Critic 模型,為 LLM 對齊提供記憶體效率高的 PPO (近端策略最佳化) 替代方案。GRPO 不會使用 Critic 網路,而是針對相同提示產生一組回應,並以該組回應的平均獎勵做為基準。

詳情請參閱 GRPO

NVIDIA NeMo RL

NeMo RL 是 NVIDIA 的開放原始碼訓練後程式庫,專為可擴充的 RL 設計。NeMo RL 是更廣泛的 NeMo 架構生態系統的一部分,可讓您在單一 GPU 上進行小規模實驗,也能在數千個 GPU 上部署多個節點。

詳情請參閱 NVIDIA NeMo RL

GSM8k 資料集

在本教學課程中,您會使用 GSM8k 資料集,其中包含 8,500 個高品質、語言多元的小學數學文字題。

模型會使用 GSM8k 和 GRPO,針對同一問題生成一組 n 個不同的回覆。GRPO 會將這些回覆與群組平均值進行比較。相較於其他路徑,如果路徑持續正確且合乎邏輯,模型會獲得更多獎勵。隨著時間推移,模型會瞭解清楚說明步驟是獲得最高獎勵最可靠的方式,並有效減少低成效答案的獎勵。

詳情請參閱 GSM8k

目標

本教學課程說明如何透過下列步驟,在 GKE 上使用 NeMo RL 設定 RL:

  1. 準備環境。
  2. 設定搭載 B200 或 H200 GPU 的 GKE 叢集。
  3. 設定 KubeRay,管理分散式 Ray 叢集。
  4. 使用 Managed Lustre 取得高效能儲存空間。
  5. 執行使用 NeMo RL 的 GRPO 訓練工作。

事前準備

  • 登入 Google Cloud 帳戶。如果您是 Google Cloud新手,歡迎 建立帳戶,親自評估產品在實際工作環境中的成效。新客戶還能獲得價值 $300 美元的免費抵免額,可用於執行、測試及部署工作負載。
  • 安裝 Google Cloud CLI。

  • 若您採用的是外部識別資訊提供者 (IdP),請先使用聯合身分登入 gcloud CLI

  • 執行下列指令,初始化 gcloud CLI:

    gcloud init
  • 建立或選取 Google Cloud 專案

    選取或建立專案所需的角色

    • 選取專案:選取專案時,不需要具備特定 IAM 角色,只要您已獲授角色,即可選取任何專案。
    • 建立專案:如要建立專案,您需要具備專案建立者角色 (roles/resourcemanager.projectCreator),其中包含 resourcemanager.projects.create 權限。瞭解如何授予角色
    • 建立 Google Cloud 專案:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替換為您要建立的 Google Cloud 專案名稱。

    • 選取您建立的 Google Cloud 專案:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替換為 Google Cloud 專案名稱。

  • 確認專案已啟用計費功能 Google Cloud

  • 啟用必要的 API:

    啟用 API 時所需的角色

    如要啟用 API,您需要具備服務使用情形管理員 IAM 角色 (roles/serviceusage.serviceUsageAdmin),其中包含 serviceusage.services.enable 權限。瞭解如何授予角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 安裝 Google Cloud CLI。

  • 若您採用的是外部識別資訊提供者 (IdP),請先使用聯合身分登入 gcloud CLI

  • 執行下列指令,初始化 gcloud CLI:

    gcloud init
  • 建立或選取 Google Cloud 專案

    選取或建立專案所需的角色

    • 選取專案:選取專案時,不需要具備特定 IAM 角色,只要您已獲授角色,即可選取任何專案。
    • 建立專案:如要建立專案,您需要具備專案建立者角色 (roles/resourcemanager.projectCreator),其中包含 resourcemanager.projects.create 權限。瞭解如何授予角色
    • 建立 Google Cloud 專案:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替換為您要建立的 Google Cloud 專案名稱。

    • 選取您建立的 Google Cloud 專案:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替換為 Google Cloud 專案名稱。

  • 確認專案已啟用計費功能 Google Cloud

  • 啟用必要的 API:

    啟用 API 時所需的角色

    如要啟用 API,您需要具備服務使用情形管理員 IAM 角色 (roles/serviceusage.serviceUsageAdmin),其中包含 serviceusage.services.enable 權限。瞭解如何授予角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 將角色授予使用者帳戶。針對下列每個 IAM 角色,執行一次下列指令: roles/container.admin, roles/iam.serviceAccountAdmin, roles/storage.admin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

    更改下列內容:

    • PROJECT_ID:專案 ID。
    • USER_IDENTIFIER:使用者帳戶的 ID。 例如:myemail@example.com
    • ROLE:授予使用者帳戶的 IAM 角色。

準備環境

在本教學課程中,您將使用 Cloud Shell

  1. 前往Google Cloud 控制台

  2. 在 Google Cloud 主控台視窗頂端,按一下「啟用 Cloud Shell」按鈕。

  3. 請設定下列環境變數:

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION
    export NODE_LOCATION=NODE_LOCATION
    export CLUSTER_NAME=CLUSTER_NAME
    export GPU_TYPE=GPU_TYPE
    export MACHINE_TYPE=MACHINE_TYPE
    export GKE_VERSION=GKE_VERSION
    export KSA_NAME=generic-ksa
    export NAMESPACE=default
    export GS_BUCKET=BUCKET_NAME-${PROJECT_ID}
    export HF_TOKEN=YOUR_HUGGING_FACE_TOKEN
    

    替換下列值:

    • CLUSTER_NAME:GKE 叢集的名稱。
    • CONTROL_PLANE_LOCATION:GKE 叢集控制層的 Compute Engine 區域。
    • NODE_LOCATION:節點的位置。選取提供 NVIDIA B200 或 H200 GPU 的區域
    • GPU_TYPE:您在 Compute Engine 容量預留中預留的加速器。必須是下列其中一個值:
      • nvidia-b200:NVIDIA B200 (180 GB)
      • nvidia-h200-141gb:NVIDIA H200 (141 GB)
    • MACHINE_TYPE:要使用的機器類型:
      • 如要使用 NVIDIA B200 (180 GB) GPU,請使用 a4-highgpu-8g 以上版本。
      • 如要使用 NVIDIA H200 (141 GB) GPU,請使用 a3-ultragpu-8g 以上版本。
    • GKE_VERSION:要使用的 GKE 版本:

      • 如要使用 NVIDIA B200 (180 GB) GPU,請使用 1.32.2-gke.1422000 以上版本。
      • 如要使用 NVIDIA H200 (141 GB) GPU,請使用 1.31.4-gke.1183000 以上版本。
    • BUCKET_NAME:Cloud Storage bucket 的基本名稱。

    • YOUR_HUGGING_FACE_TOKEN:您的 Hugging Face 權杖。

  4. 為網路建立下列環境變數:

    export GVNIC_NETWORK_PREFIX="GVNIC-NAME"
    export RDMA_NETWORK_PREFIX="RDMA-NAME"
    

    替換下列值:

    • GVNIC-NAME:gVNIC 網路名稱的前置字串。你可以使用任何前置字元。
    • RDMA-NAME:遠端直接記憶體存取 (RDMA) 網路的前置字元。你可以使用任何前置字元。

設定基礎架構

在本節中,您將建立虛擬私有雲網路和 GKE 叢集。

Create a VPC network

  1. 為 gVNIC 介面建立虛擬私有雲網路:

    gcloud compute networks create ${GVNIC_NETWORK_PREFIX}-net \
        --project=${PROJECT_ID} \
        --subnet-mode=custom
    gcloud compute networks subnets create ${GVNIC_NETWORK_PREFIX}-sub \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --location=${CONTROL_PLANE_LOCATION} \
        --range=192.168.0.0/24
    gcloud compute firewall-rules create ${GVNIC_NETWORK_PREFIX}-internal \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --action=ALLOW \
        --rules=tcp:0-65535,udp:0-65535,icmp \
        --source-ranges=192.168.0.0/16
    
  2. 為 RDMA 建立虛擬私有雲網路和子網路,包括八個 GPU 的八個子網路:

    gcloud compute networks create ${RDMA_NETWORK_PREFIX}-net \
        --network-profile=${CONTROL_PLANE_LOCATION}-vpc-roce \
        --subnet-mode=custom
    
    for N in $(seq 0 7); do
      gcloud compute networks subnets create ${RDMA_NETWORK_PREFIX}-sub-$N \
        --network=${RDMA_NETWORK_PREFIX}-net \
        --location=${CONTROL_PLANE_LOCATION} \
        --range=192.168.$((N+1)).0/24 &
    done
    wait
    

建立 GKE 叢集

您可以在 GKE Autopilot 或 Standard 叢集中設定 NeMo RL。建議您使用 Autopilot 叢集,享有全代管 Kubernetes 體驗。如要為工作負載選擇最合適的 GKE 作業模式,請參閱「關於 GKE 作業模式」。

Autopilot

  1. 建立 Autopilot 叢集:

    gcloud container clusters create-auto ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-multi-networking  \
        --enable-ray-operator
    
  2. 取得叢集憑證:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION}
    
  3. 安裝 Autopilot 適用的 NCCL RDMA 安裝程式:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer-autopilot.yaml
    

標準

  1. 建立 Standard 叢集:

    gcloud container clusters create ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-dataplane-v2 \
        --enable-ip-alias \
        --enable-multi-networking \
        --addons=RayOperator \
        --num-nodes=1
    
  2. 取得叢集憑證:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION}
    
  3. 建立 GPU 節點集區:

    gcloud container node-pools create gpu-pool \
        --cluster=${CLUSTER_NAME} \
        --node-locations=${NODE_LOCATION} \
        --machine-type=${MACHINE_TYPE} \
        --accelerator=type=${GPU_TYPE},count=8 \
        --spot \
        --additional-node-network=network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7
    
  4. 安裝 NCCL RDMA 安裝程式:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml
    

設定網路對應

  1. 將下列資訊清單儲存為 network-mapping.yaml

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: gvnic-1
    spec:
      vpc: ${GVNIC_NETWORK_PREFIX}-net
      vpcSubnet: ${GVNIC_NETWORK_PREFIX}-sub
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: gvnic-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: gvnic-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-0
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-0
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-0
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-0
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-1
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-1
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-2
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-2
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-2
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-2
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-3
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-3
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-3
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-3
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-4
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-4
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-4
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-4
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-5
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-5
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-5
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-5
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-6
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-6
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-6
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-6
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-7
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-7
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-7
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-7
    
  2. 套用資訊清單:

    kubectl apply -f network-mapping.yaml
    

準備儲存空間

在本節中,您將建立 Cloud Storage bucket 和 Managed Lustre 執行個體,佈建 RL 工作負載所需的高效能儲存空間。

  1. 建立 Cloud Storage bucket:

    gcloud storage buckets create gs://${GS_BUCKET} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-hierarchical-namespace \
        --uniform-bucket-level-access
    
  2. 建立 Kubernetes 服務帳戶 (KSA),並繫結至 bucket:

    kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
    
    gcloud storage buckets add-iam-policy-binding gs://${GS_BUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    
  3. 完成下列步驟,設定 Managed Lustre:

    1. 按照「建立 Managed Lustre 執行個體」一文中的步驟,建立 Managed Lustre 執行個體。確認執行個體使用的網路與 GKE 叢集相同。
    2. 按照「存取現有的 Managed Lustre 執行個體」一文中的步驟,存取 Managed Lustre 執行個體。

部署 RayCluster

在本節中,您將複製範例存放區、準備資訊清單,並執行 launcher.sh 指令碼:

  1. 複製範例存放區:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  2. 前往工作目錄:

    cd ai-ml/nemo-rl-on-gke/nemoRL
    
  3. 檢查 values.yaml 資訊清單:

    # Copyright 2026 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    image:
      repository: "nvcr.io/nvidia/nemo-rl"
      tag: "v0.5.0" 
      pullPolicy: Always
    
    nameOverride: "kuberay"
    fullnameOverride: ""
    
    common:
      containerEnv: {}
    
    configMap:
      fluentbit:
        data:
          fluent-bit.conf: |
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/worker-*
                Tag               ray-worker
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/raylet*
                Tag               raylet
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/*
                Exclude_Path      /tmp/ray/session_latest/logs/debug_state.txt,/tmp/ray/session_latest/logs/raylet*,/tmp/ray/session_latest/logs/worker-*
                Tag               ray-misc
            [OUTPUT]
                Name              stackdriver
                Match             *
                resource          gce_instance
                labels_key        labels
    
    # --- Head Node Configuration ---
    head:
      enableInTreeAutoscaling: false
      serviceAccountName: ""
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            networking.gke.io/default-interface: 'eth0'
      containerEnv:
      - name: RAY_GROUP
        value: "head"
      resources:
        limits:
          cpu: "64"
          memory: "500G"
          nvidia.com/gpu: 0
        requests:
          cpu: "64"
          memory: "500G"
          nvidia.com/gpu: 0
      tolerations:
        # - operator: "Exists"
        #   key: "components.gke.io/gke-managed-components"
        # - key: "nvidia.com/gpu"
        #   operator: "Exists"
        #   effect: "NoSchedule"
      volumeMounts:
        - mountPath: /data
          name: lustre-data
    
      volumes:
        - name: log-volume
          emptyDir: {}
        - name: fluentbit-config-volume
          configMap:
            name: "ray-cluster-kuberay-fluentbit-config"
        - name: lustre-data
          persistentVolumeClaim:
            claimName: lustre-pvc
      sidecarContainers:
        - name: fluent-bit
          image: fluent/fluent-bit:latest
          env:
          - name: RAY_GROUP
            value: "head"
          volumeMounts:
            - name: fluentbit-config-volume
              mountPath: /fluent-bit/etc/
            - mountPath: /tmp/ray
              name: log-volume
    
      # --- HEAD POD STARTUP SCRIPT ---
      command:
        - "bash"
        - "-c"
        - |
          set -ex
          echo "--- Head Pod Setup ---"
          apt-get update
          apt-get install -y sudo netcat-openbsd pciutils
          cd /opt/nemo-rl
          /usr/bin/python -m pip install uv
          /usr/bin/python -m uv venv
          echo "Head pod setup complete. Starting Ray..."
    
          exec ${KUBERAY_GEN_RAY_START_CMD}
    
      args: []
      headService: {}
      # nodeSelector:
      #   cloud.google.com/gke-accelerator: nvidia-b200 #cloud.google.com/gke-nodepool: cpu-node-pool-llama #cpu-node-pool
    
    # --- Default Worker (Disabled) ---
    worker:
      disabled: true
    
    # --- A4 GPU Worker Groups ---
    additionalWorkerGroups:
      worker-grp-0:
        disabled: false
        replicas: 4
        annotations:
          networking.gke.io/default-interface: 'eth0'
          networking.gke.io/interfaces: |
            [
              {"interfaceName":"eth0","network":"default"},
              {"interfaceName":"eth1","network":"gvnic-1"},
              {"interfaceName":"eth2","network":"rdma-0"},
              {"interfaceName":"eth3","network":"rdma-1"},
              {"interfaceName":"eth4","network":"rdma-2"},
              {"interfaceName":"eth5","network":"rdma-3"},
              {"interfaceName":"eth6","network":"rdma-4"},
              {"interfaceName":"eth7","network":"rdma-5"},
              {"interfaceName":"eth8","network":"rdma-6"},
              {"interfaceName":"eth9","network":"rdma-7"}
            ]
        containerEnv:
          - name: RAY_GROUP
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['ray.io/group']
          - name: NCCL_NET  
            value: "gIB"
          - name: NCCL_IB_GID_INDEX
            value: "3"   
          - name: GLOO_SOCKET_IFNAME
            value: "eth0"
          - name: NCCL_CROSS_NIC
            value: "0"
          - name: NCCL_SOCKET_IFNAME
            value: "eth0"
          - name: TP_SOCKET_IFNAME # Specific to DTensor/PyTorch Distributed
            value: "eth0"
          - name: NCCL_TUNER_CONFIG_PATH
            value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
          - name: NCCL_NET_GDR_LEVEL
            value: "PIX"
          - name: LD_LIBRARY_PATH
            value: /usr/local/nvidia/lib64
        resources:
          limits:
            nvidia.com/gpu: 8
            cpu: "206"
            memory: "2400Gi"
          requests:
            nvidia.com/gpu: 8
            cpu: "206"
            memory: "2400Gi"
    
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-b200
        tolerations:
          - operator: "Exists"
            key: "nvidia.com/gpu"
          - operator: "Exists"
            key: "cloud.google.com/impending-node-termination"
          - operator: "Exists"
            key: "user-workload"
        securityContext:
          privileged: true
        volumes:
          - name: log-volume
            emptyDir: {}
          - name: shared-memory
            emptyDir:
              medium: "Memory"
              sizeLimit: 240Gi
          - name: ray-tmp
            emptyDir:
              medium: "Memory"
          - name: fluentbit-config-volume
            configMap:
              name: "ray-cluster-kuberay-fluentbit-config"
          - name: nvidia-install-dir-host
            hostPath:
              path: /home/kubernetes/bin/nvidia
          - name: gib-nccl-plugin-volume
            hostPath: 
              path: /home/kubernetes/bin/gib
          - name: lustre-data
            persistentVolumeClaim:
              claimName: lustre-pvc
        volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
          - name: shared-memory
            mountPath: /dev/shm
          - name: nvidia-install-dir-host
            mountPath: /usr/local/nvidia
          - name: gib-nccl-plugin-volume
            mountPath: /usr/local/gib
          - mountPath: /data
            name: lustre-data   
        # --- WORKER POD STARTUP SCRIPT ---
        command:
          - "bash"
          - "-c"
          - |
            set -ex
    
            echo "--- Worker Pod Setup ---"
            apt-get update
            apt-get install -y sudo netcat-openbsd pciutils
            cd /opt/nemo-rl
            /usr/bin/python -m pip install uv
            /usr/bin/python -m uv venv
    
            ldconfig /usr/local/nvidia/lib64/
            ldconfig -p | grep libcuda | sed 's/^/  /'
            export LD_LIBRARY_PATH="/usr/local/gib/lib64:$LD_LIBRARY_PATH"
            source /usr/local/gib/scripts/set_nccl_env.sh
    
            echo "Worker pod setup complete. Starting Ray..."
    
            exec ${KUBERAY_GEN_RAY_START_CMD}
    
    
        sidecarContainers:
          - name: fluent-bit
            env:
              - name: RAY_GROUP
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['ray.io/group']
            image: fluent/fluent-bit:latest
            volumeMounts:
              - name: fluentbit-config-volume
                mountPath: /fluent-bit/etc/
              - mountPath: /tmp/ray
                name: log-volume
    
    # --- Service Config ---
    service:
      type: ClusterIP
    

    根據您在本教學課程中使用的加速器,將 NCCL_TUNER_CONFIG_PATH 取代為下列任一值:

    • NVIDIA B200 (180 GB)/usr/local/gib/configs/tuner_config_a4.txtpb
    • NVIDIA H200 (141 GB)/usr/local/gib/configs/tuner_config_a3u.txtpb

    在這個資訊清單中,主要節點會管理工作並代管 Ray 資訊主頁。工作站節點會執行訓練工作。

  4. 安裝 Ray 叢集:

    export REPLICA_COUNT=2
    helm install ray-cluster . \
      --set values.additionalWorkerGroups.worker-grp-0.replicas=$REPLICA_COUNT
    

    在本教學課程中,您會使用兩個工作節點。如要變更工作站節點數量,請變更 REPLICA_COUNT 值。

  5. 如要部署 Ray 叢集,請執行 launcher.sh 指令碼:

    bash launcher.sh
    
  6. 確認工作節點和頭部節點正在執行:

    kubectl get pods
    

    輸出結果會與下列內容相似:

    NAME                                          READY STATUS RESTARTS AGE
    ray-cluster-kuberay-head-sw7dp                3/3   Running 0      33h
    ray-cluster-kuberay-worker-grp-0-worker-gkbxw 3/3   Running 0      33h
    ray-cluster-kuberay-worker-grp-0-worker-kdg62 3/3   Running 0      33h
    
  7. 確認 Ray 叢集正在執行:

    kubectl ray get clusters
    

    輸出結果會與下列內容相似:

    NAME                 NAMESPACE DESIRED WORKERS AVAILABLE WORKERS CPUS GPUS TPUS MEMORY CONDITION STATUS AGE
    ray-cluster-kuberay  default   2       2           618     17   0    1573741824k RayClusterProvisioned ready 33h
    

啟動 GRPO 工作

Ray 叢集準備就緒後,您可以將 Ray 工作提交至 GKE 上執行的 Ray 叢集。執行 RL 訓練工作時,NeMo RL 會自動下載模型。

如要提交 Ray 工作,請啟動互動式工作階段來執行工作。

  1. 如要建立與 Ray 叢集的本機連線,請執行下列指令:

      kubectl ray session ray-cluster-kuberay
    

    這項指令會在本機與 GKE 叢集中的 Ray 主節點之間啟動通訊埠轉送功能。請注意,這個工作階段啟用時,終端機將會處於忙碌狀態;如要繼續,請開啟另一個終端機執行個體。

  2. 編輯 gemma3-27b-gsm8k.sh 檔案:

    # Copyright 2026 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    #!/bin/bash
    WANDB_API_KEY='YOUR_WANDB_API_KEY' # Update this with your WANDB API key
    HF_TOKEN='YOUR_HF_TOKEN' # Update this with your HF token
    WORLD_SIZE=16
    
    # --- Step 1: Find the Ray Head Pod ---
    echo "Finding Ray head pod..."
    export HEAD_POD_NAME=$(kubectl get pods --selector=ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
    if [ -z "$HEAD_POD_NAME" ]; then
        echo "Error: No running Ray head pod found. Please check your cluster."
        exit 1
    fi
    echo "Found head pod: $HEAD_POD_NAME"
    echo ""
    
    # --- Step 2: Define the Job Script to Run ---
    # This is the script that will be executed *inside* the head pod.
    # It assumes the 'uv venv' setup from the values.yaml is already done.
    JOB_SCRIPT=$(cat <<EOF
    set -ex
    
    echo "--- Running on Ray Head Pod ($HOSTNAME) ---"
    cd /opt/nemo-rl
    
    git pull && git checkout main
    
    sed -i 's/subset: Optional\[str\] = None/subset: Optional[str] = "main"/' /opt/nemo-rl/nemo_rl/data/datasets/response_datasets/response_dataset.py
    sed -i 's/raw_dataset = load_dataset(data_path)/raw_dataset = load_dataset(data_path, "main")/' /opt/nemo-rl/nemo_rl/data/datasets/utils.py
    
    echo "Setting environment variables..."
    export WANDB_API_KEY=$WANDB_API_KEY
    export HF_TOKEN=$HF_TOKEN
    export HF_HOME=/opt/nemo-rl/
    
    ###-----Example to launch Gemma3-27B on 3 nodes (24 GPUs)----------
    uv run python examples/run_grpo_math.py \
      --config examples/configs/recipes/llm/grpo-gemma3-27b-it-8n4g-fsdp2tp4-actckpt-long.yaml \
      cluster.num_nodes=2 \
      cluster.gpus_per_node=8 \
      grpo.max_num_steps=300 \
      checkpointing.checkpoint_dir=/data/nemo_rl_gemma3_27b_3_17 \
      data.dataset_name=ResponseDataset \
      +data.train_data_path=openai/gsm8k \
      +data.val_data_path=openai/gsm8k \
      +data.val_split=test \
      +data.train_split=train \
      +data.subset="main" \
      +data.input_key="question" \
      +data.output_key="answer" \
      logger.tensorboard_enabled=False \
      logger.wandb_enabled=True \
      logger.wandb.name='nemo_rl_gemma3_27b_3_17' \
      grpo.num_prompts_per_step=16 \
      grpo.num_generations_per_prompt=64 \
      policy.generation.colocated.enabled=False \
      policy.generation.colocated.resources.num_nodes=1 \
      policy.generation.colocated.resources.gpus_per_node=8 \
      policy.generation.vllm_cfg.tensor_parallel_size=8 \
      policy.generation.vllm_cfg.gpu_memory_utilization=0.9 \
      policy.dtensor_cfg.tensor_parallel_size=8
    
    echo "--- Job Finished ---"
    EOF
    )
    
    # --- Step 3: Execute the Job ---
    echo "Submitting job to $HEAD_POD_NAME..."
    echo "$JOB_SCRIPT" | tr -d '\r' | kubectl exec -i $HEAD_POD_NAME -c ray-head -- /bin/bash
    
    echo ""
    echo "Job submission complete."
    

    gemma3-27b-gsm8k.sh 檔案中替換下列值:

    • YOUR_WANDB_API_KEY:您的 WandB API 金鑰。
    • YOUR_HF_TOKEN:您的 Hugging Face 權杖。

    在這個檔案中,您可以看到使用 GSM8k 資料集上的 gemma3-27b-it 模型執行工作的設定。為完成 GRPO 訓練管道,這項指令碼會定義下列參數:

    • num_prompts_per_step: 16num_generations_per_prompt: 64:Gemma3-27b-it 模型會為每個提示產生大量回應。在這個設定中,模型會產生 1,024 個回覆 (16 × 64 = 1,024)。
    • policy.generation.colocated.enabled=False:這項參數會停用共置生成功能,也就是說,模型不會在與訓練程序相同的節點中生成回應。在標準 RL 中,訓練和生成作業都由相同的 GPU 處理。在本 NeMo RL 設定中,您會專門將特定節點 (使用 policy.generation.colocated.resources 參數管理) 用於 vLLM 推論,而叢集的其餘部分則專注於高負載的訓練數學運算。將這些工作負載分開,可避免記憶體密集型訓練緩衝區與運算密集型推論工作負載爭搶資源。
  3. 如要提交 Job,請執行下列指令:

    bash gemma3-27b-it/gemma3-27b-gsm8k.sh
    

    工作執行時,輸出內容會顯示訓練結果、時間和效能指標。

監控 GRPO 工作狀態

Ray 完成工作後,NeMo RL 會將檢查點儲存在設定的路徑中。

  1. 安裝 apt 樹狀結構公用程式:

    apt install tree
    
  2. 如要監控 GRPO 作業的健康狀態,請檢查 Ray 頭部節點的記錄檔:

    kubectl exec -it $(kubectl get pods -l ray.io/node-type=head -o name) -c ray-head -- bash
    

    輸出結果會與下列內容相似:

    root@ray-cluster-kuberay-worker-grp-0-worker-gkbxw:/opt/nemo-rl# tree /data/nemo_rl_gemma3_27b_3_17/
    /data/nemo_rl_gemma3_27b_3_17/
    `-- step_10
        |-- config.yaml
        |-- policy
        |   |-- optimizer
        |   |   |-- __0_0.distcp
        |   |   |-- __10_0.distcp
        |   |   |-- __11_0.distcp
        |   |   |-- __12_0.distcp
        |   |   |-- __13_0.distcp
        |   |   |-- __14_0.distcp
        |   |   |-- __15_0.distcp
        |   |   |-- __1_0.distcp
        |   |   |-- __2_0.distcp
        |   |   |-- __3_0.distcp
        |   |   |-- __4_0.distcp
        |   |   |-- __5_0.distcp
        |   |   |-- __6_0.distcp
        |   |   |-- __7_0.distcp
        |   |   |-- __8_0.distcp
        |   |   `-- __9_0.distcp
        |   |-- tokenizer
        |   |   |-- chat_template.jinja
        |   |   |-- special_tokens_map.json
        |   |   |-- tokenizer.json
        |   |   `-- tokenizer_config.json
        |   `-- weights
        |       |-- __0_0.distcp
        |       |-- __10_0.distcp
        |       |-- __11_0.distcp
        |       |-- __12_0.distcp
        |       |-- __13_0.distcp
        |       |-- __14_0.distcp
        |       |-- __15_0.distcp
        |       |-- __1_0.distcp
        |       |-- __2_0.distcp
        |       |-- __3_0.distcp
        |       |-- __4_0.distcp
        |       |-- __5_0.distcp
        |       |-- __6_0.distcp
        |       |-- __7_0.distcp
        |       |-- __8_0.distcp
        |       `-- __9_0.distcp
        |-- train_dataloader.pt
        `-- training_info.json
    
    6 directories, 39 files
    

清除所用資源

為避免產生費用,請刪除資源:

helm delete ray-cluster
gcloud container clusters delete ${CLUSTER_NAME} --location=${CONTROL_PLANE_LOCATION}
gcloud storage rm -r gs://${GS_BUCKET}

後續步驟