在 GKE 上使用 NVIDIA NeMo RL 微调和扩缩强化学习

本教程介绍如何在 Google Kubernetes Engine (GKE) 上编排强化学习 (RL) 的分布式训练环境。您将使用 Ray 和 NVIDIA NeMo RL 框架设置分布式训练环境,以微调模型。

本教程重点介绍如何使用 Ray 和 NeMo RL 在 GKE 上运行群组相对策略优化 (GRPO) 训练流水线。GRPO 是一种强化学习算法,旨在提高模型的推理能力。这种 内存高效的算法通过 消除 Critic 或价值模型来简化 RL 流程,并改用基于群组的相对 计算。

在运行本教程之前,我们建议您先完成 在 GKE 上使用 verl 微调和扩缩强化学习 教程。以下教程使用与在 GKE 上使用 verl 微调和伸缩强化学习教程相同的集群设置和配置。

背景

以下部分简要介绍了本教程中使用的概念。

强化学习 (RL)

RL 通过经验、探索和反馈(而不是静态模仿)来训练模型。虽然预训练会教模型说什么,但基于人类反馈的强化学习 (RLHF) 会教模型如何提供帮助、确保安全和保持逻辑性。RL 充当基础模型与针对特定用例微调的模型之间的桥梁。

如需了解详情,请参阅 什么是强化学习?

群组相对策略优化 (GRPO)

GRPO 是一种由 DeepSeek 推广的算法,它通过移除 Critic 模型,为 LLM 对齐提供了一种内存高效的 替代方案,即近端策略优化 (PPO)。GRPO 不使用 Critic 网络,而是为同一提示生成一组响应,并使用该组的平均奖励作为基准。

如需了解详情,请参阅 GRPO

NVIDIA NeMo RL

NeMo RL 是 NVIDIA 的开源后训练库,专为可伸缩的 RL 而设计。作为更广泛的 NeMo 框架生态系统的一部分,NeMo RL 既支持在单个 GPU 上进行小规模实验,也支持跨数千个 GPU 进行多节点部署。

如需了解详情,请参阅 NVIDIA NeMo RL

GSM8k 数据集

在本教程中,您将使用 GSM8k 数据集,其中包含 8,500 个高质量、语言多样的中小学数学文字题。

通过使用 GSM8k 和 GRPO,模型会针对同一问题生成一组 n 个不同的响应。GRPO 会将这些响应与群组平均值进行比较。与群组中的其他路径相比,模型对于始终正确且逻辑合理的路径会获得更多奖励。随着时间的推移,模型会了解到,清晰地阐述步骤是最大限度提高奖励的最可靠方法,从而有效地减少低性能答案的奖励。

如需了解详情,请参阅 GSM8k

目标

本教程介绍如何通过完成以下步骤,使用 NeMo RL 在 GKE 上设置 RL:

  1. 准备环境。
  2. 设置具有 B200 或 H200 GPU 的 GKE 集群。
  3. 配置 KubeRay 以管理分布式 Ray 集群。
  4. 使用 Managed Lustre 实现高性能存储。
  5. 运行使用 NeMo RL 的 GRPO 训练作业。

准备工作

  • 登录您的 Google Cloud 账号。如果您是 Google Cloud的新用户, 请创建一个账号,以评估我们的产品在 实际场景中的表现。新客户还可获享 $300 赠金,用于 运行、测试和部署工作负载。
  • 安装 Google Cloud CLI。

  • 如果您使用的是外部身份提供方 (IdP),则必须先使用联合身份登录 gcloud CLI

  • 如需初始化 gcloud CLI,请运行以下命令:

    gcloud init
  • 创建或选择 Google Cloud 项目

    选择或创建项目所需角色

    • 选择项目:选择项目不需要特定的 IAM 角色,您可以选择已获授角色的任何项目。
    • 创建项目:如需创建项目,您需要拥有 Project Creator 角色 (roles/resourcemanager.projectCreator),该角色包含 resourcemanager.projects.create 权限。了解如何授予 角色
    • 创建 Google Cloud 项目:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替换为您要创建的 Google Cloud 项目的名称。

    • 选择您创建的 Google Cloud 项目:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替换为您的 Google Cloud 项目名称。

  • 验证是否已为您的 Google Cloud 项目启用结算功能。

  • 启用必需的 API:

    启用 API 所需的角色

    如需启用 API,您需要拥有 Service Usage Admin IAM 角色 (roles/serviceusage.serviceUsageAdmin),该角色包含 serviceusage.services.enable 权限。了解如何授予 角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 安装 Google Cloud CLI。

  • 如果您使用的是外部身份提供方 (IdP),则必须先使用联合身份登录 gcloud CLI

  • 如需初始化 gcloud CLI,请运行以下命令:

    gcloud init
  • 创建或选择 Google Cloud 项目

    选择或创建项目所需角色

    • 选择项目:选择项目不需要特定的 IAM 角色,您可以选择已获授角色的任何项目。
    • 创建项目:如需创建项目,您需要拥有 Project Creator 角色 (roles/resourcemanager.projectCreator),该角色包含 resourcemanager.projects.create 权限。了解如何授予 角色
    • 创建 Google Cloud 项目:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替换为您要创建的 Google Cloud 项目的名称。

    • 选择您创建的 Google Cloud 项目:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替换为您的 Google Cloud 项目名称。

  • 验证是否已为您的 Google Cloud 项目启用结算功能。

  • 启用必需的 API:

    启用 API 所需的角色

    如需启用 API,您需要拥有 Service Usage Admin IAM 角色 (roles/serviceusage.serviceUsageAdmin),该角色包含 serviceusage.services.enable 权限。了解如何授予 角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 向您的用户账号授予角色。对以下每个 IAM 角色运行以下命令一次:roles/container.admin, roles/iam.serviceAccountAdmin, roles/storage.admin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

    替换以下内容:

    • PROJECT_ID:您的项目 ID。
    • USER_IDENTIFIER:您的用户账号的标识符。 例如,myemail@example.com
    • ROLE:您授予用户账号的 IAM 角色。

准备环境

在本教程中,您将使用 Cloud Shell

  1. 前往 Google Cloud 控制台

  2. 在 Google Cloud 控制台窗口顶部,点击 激活 Cloud Shell 按钮。

  3. 设置以下环境变量:

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION
    export NODE_LOCATION=NODE_LOCATION
    export CLUSTER_NAME=CLUSTER_NAME
    export GPU_TYPE=GPU_TYPE
    export MACHINE_TYPE=MACHINE_TYPE
    export GKE_VERSION=GKE_VERSION
    export KSA_NAME=generic-ksa
    export NAMESPACE=default
    export GS_BUCKET=BUCKET_NAME-${PROJECT_ID}
    export HF_TOKEN=YOUR_HUGGING_FACE_TOKEN
    

    替换以下值:

    • CLUSTER_NAME:GKE 集群的名称。
    • CONTROL_PLANE_LOCATION:GKE 集群控制平面的 Compute Engine 区域。
    • NODE_LOCATION:节点的所在地。选择 NVIDIA B200 或 H200 GPU 可用的可用区。
    • GPU_TYPE:您在 Compute Engine 容量预留中预留的加速器。必须是以下值之一:
      • nvidia-b200:NVIDIA B200 (180 GB)
      • nvidia-h200-141gb:NVIDIA H200 (141 GB)
    • MACHINE_TYPE:要使用的机器类型:
      • 对于 NVIDIA B200 (180 GB) GPU,请使用 a4-highgpu-8g 或更高版本。
      • 对于 NVIDIA H200 (141 GB) GPU,请使用 a3-ultragpu-8g 或更高版本。
    • GKE_VERSION:要使用的 GKE 版本:

      • 对于 NVIDIA B200 (180 GB) GPU,请使用 1.32.2-gke.1422000 或更高版本。
      • 对于 NVIDIA H200 (141 GB) GPU,请使用 1.31.4-gke.1183000 或更高版本。
    • BUCKET_NAME:Cloud Storage 存储桶的基本名称。

    • YOUR_HUGGING_FACE_TOKEN:您的 Hugging Face 令牌。

  4. 为网络创建以下环境变量:

    export GVNIC_NETWORK_PREFIX="GVNIC-NAME"
    export RDMA_NETWORK_PREFIX="RDMA-NAME"
    

    替换以下值:

    • GVNIC-NAME:gVNIC 网络名称的前缀。您可以使用任何所需的前缀。
    • RDMA-NAME:远程直接内存访问 (RDMA) 网络的前缀。您可以使用任何所需的前缀。

设置基础架构

在本部分中,您将创建 VPC 网络和 GKE 集群。

创建 VPC 网络

  1. 为 gVNIC 接口创建 VPC 网络:

    gcloud compute networks create ${GVNIC_NETWORK_PREFIX}-net \
        --project=${PROJECT_ID} \
        --subnet-mode=custom
    gcloud compute networks subnets create ${GVNIC_NETWORK_PREFIX}-sub \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --location=${CONTROL_PLANE_LOCATION} \
        --range=192.168.0.0/24
    gcloud compute firewall-rules create ${GVNIC_NETWORK_PREFIX}-internal \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --action=ALLOW \
        --rules=tcp:0-65535,udp:0-65535,icmp \
        --source-ranges=192.168.0.0/16
    
  2. 为 RDMA 创建 VPC 网络和子网,其中包括 8 个 GPU 的 8 个子网:

    gcloud compute networks create ${RDMA_NETWORK_PREFIX}-net \
        --network-profile=${CONTROL_PLANE_LOCATION}-vpc-roce \
        --subnet-mode=custom
    
    for N in $(seq 0 7); do
      gcloud compute networks subnets create ${RDMA_NETWORK_PREFIX}-sub-$N \
        --network=${RDMA_NETWORK_PREFIX}-net \
        --location=${CONTROL_PLANE_LOCATION} \
        --range=192.168.$((N+1)).0/24 &
    done
    wait
    

创建 GKE 集群

您可以在 GKE Autopilot 或 Standard 集群中设置 NeMo RL。我们建议您使用 Autopilot 集群获得全代管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式, 请参阅 GKE 操作模式简介

Autopilot

  1. 创建 Autopilot 集群:

    gcloud container clusters create-auto ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-multi-networking  \
        --enable-ray-operator
    
  2. 获取集群的凭据:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION}
    
  3. 为 Autopilot 安装 NCCL RDMA 安装程序:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer-autopilot.yaml
    

标准版

  1. 创建 Standard 集群:

    gcloud container clusters create ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-dataplane-v2 \
        --enable-ip-alias \
        --enable-multi-networking \
        --addons=RayOperator \
        --num-nodes=1
    
  2. 获取集群的凭据:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_LOCATION}
    
  3. 创建 GPU 节点池:

    gcloud container node-pools create gpu-pool \
        --cluster=${CLUSTER_NAME} \
        --node-locations=${NODE_LOCATION} \
        --machine-type=${MACHINE_TYPE} \
        --accelerator=type=${GPU_TYPE},count=8 \
        --spot \
        --additional-node-network=network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7
    
  4. 安装 NCCL RDMA 安装程序:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml
    

配置网络映射

  1. 将以下清单保存为 network-mapping.yaml

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: gvnic-1
    spec:
      vpc: ${GVNIC_NETWORK_PREFIX}-net
      vpcSubnet: ${GVNIC_NETWORK_PREFIX}-sub
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: gvnic-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: gvnic-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-0
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-0
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-0
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-0
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-1
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-1
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-2
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-2
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-2
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-2
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-3
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-3
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-3
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-3
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-4
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-4
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-4
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-4
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-5
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-5
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-5
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-5
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-6
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-6
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-6
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-6
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-7
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-7
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-7
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-7
    
  2. 应用清单:

    kubectl apply -f network-mapping.yaml
    

准备存储空间

在本部分中,您将创建 Cloud Storage 存储分区和 Managed Lustre 实例,该实例会预配 RL 工作负载所需的高性能存储空间。

  1. 创建 Cloud Storage 存储桶:

    gcloud storage buckets create gs://${GS_BUCKET} \
        --location=${CONTROL_PLANE_LOCATION} \
        --enable-hierarchical-namespace \
        --uniform-bucket-level-access
    
  2. 创建 Kubernetes 服务账号 (KSA) 并将其绑定到存储桶:

    kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
    
    gcloud storage buckets add-iam-policy-binding gs://${GS_BUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    
  3. 按照以下步骤设置 Managed Lustre:

    1. 按照 创建 Managed Lustre 实例中的步骤创建 Managed Lustre 实例。确保该实例使用与 GKE 集群相同的网络。
    2. 按照 访问现有 Managed Lustre 实例中的步骤访问 Managed Lustre 实例。

部署 RayCluster

在本部分中,您将克隆示例代码库、准备清单并运行 launcher.sh 脚本:

  1. 克隆示例代码库:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  2. 导航到工作目录:

    cd ai-ml/nemo-rl-on-gke/nemoRL
    
  3. 检查 values.yaml 清单:

    # Copyright 2026 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    image:
      repository: "nvcr.io/nvidia/nemo-rl"
      tag: "v0.5.0" 
      pullPolicy: Always
    
    nameOverride: "kuberay"
    fullnameOverride: ""
    
    common:
      containerEnv: {}
    
    configMap:
      fluentbit:
        data:
          fluent-bit.conf: |
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/worker-*
                Tag               ray-worker
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/raylet*
                Tag               raylet
            [INPUT]
                Name              tail
                Path              /tmp/ray/session_latest/logs/*
                Exclude_Path      /tmp/ray/session_latest/logs/debug_state.txt,/tmp/ray/session_latest/logs/raylet*,/tmp/ray/session_latest/logs/worker-*
                Tag               ray-misc
            [OUTPUT]
                Name              stackdriver
                Match             *
                resource          gce_instance
                labels_key        labels
    
    # --- Head Node Configuration ---
    head:
      enableInTreeAutoscaling: false
      serviceAccountName: ""
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            networking.gke.io/default-interface: 'eth0'
      containerEnv:
      - name: RAY_GROUP
        value: "head"
      resources:
        limits:
          cpu: "64"
          memory: "500G"
          nvidia.com/gpu: 0
        requests:
          cpu: "64"
          memory: "500G"
          nvidia.com/gpu: 0
      tolerations:
        # - operator: "Exists"
        #   key: "components.gke.io/gke-managed-components"
        # - key: "nvidia.com/gpu"
        #   operator: "Exists"
        #   effect: "NoSchedule"
      volumeMounts:
        - mountPath: /data
          name: lustre-data
    
      volumes:
        - name: log-volume
          emptyDir: {}
        - name: fluentbit-config-volume
          configMap:
            name: "ray-cluster-kuberay-fluentbit-config"
        - name: lustre-data
          persistentVolumeClaim:
            claimName: lustre-pvc
      sidecarContainers:
        - name: fluent-bit
          image: fluent/fluent-bit:latest
          env:
          - name: RAY_GROUP
            value: "head"
          volumeMounts:
            - name: fluentbit-config-volume
              mountPath: /fluent-bit/etc/
            - mountPath: /tmp/ray
              name: log-volume
    
      # --- HEAD POD STARTUP SCRIPT ---
      command:
        - "bash"
        - "-c"
        - |
          set -ex
          echo "--- Head Pod Setup ---"
          apt-get update
          apt-get install -y sudo netcat-openbsd pciutils
          cd /opt/nemo-rl
          /usr/bin/python -m pip install uv
          /usr/bin/python -m uv venv
          echo "Head pod setup complete. Starting Ray..."
    
          exec ${KUBERAY_GEN_RAY_START_CMD}
    
      args: []
      headService: {}
      # nodeSelector:
      #   cloud.google.com/gke-accelerator: nvidia-b200 #cloud.google.com/gke-nodepool: cpu-node-pool-llama #cpu-node-pool
    
    # --- Default Worker (Disabled) ---
    worker:
      disabled: true
    
    # --- A4 GPU Worker Groups ---
    additionalWorkerGroups:
      worker-grp-0:
        disabled: false
        replicas: 4
        annotations:
          networking.gke.io/default-interface: 'eth0'
          networking.gke.io/interfaces: |
            [
              {"interfaceName":"eth0","network":"default"},
              {"interfaceName":"eth1","network":"gvnic-1"},
              {"interfaceName":"eth2","network":"rdma-0"},
              {"interfaceName":"eth3","network":"rdma-1"},
              {"interfaceName":"eth4","network":"rdma-2"},
              {"interfaceName":"eth5","network":"rdma-3"},
              {"interfaceName":"eth6","network":"rdma-4"},
              {"interfaceName":"eth7","network":"rdma-5"},
              {"interfaceName":"eth8","network":"rdma-6"},
              {"interfaceName":"eth9","network":"rdma-7"}
            ]
        containerEnv:
          - name: RAY_GROUP
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['ray.io/group']
          - name: NCCL_NET  
            value: "gIB"
          - name: NCCL_IB_GID_INDEX
            value: "3"   
          - name: GLOO_SOCKET_IFNAME
            value: "eth0"
          - name: NCCL_CROSS_NIC
            value: "0"
          - name: NCCL_SOCKET_IFNAME
            value: "eth0"
          - name: TP_SOCKET_IFNAME # Specific to DTensor/PyTorch Distributed
            value: "eth0"
          - name: NCCL_TUNER_CONFIG_PATH
            value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
          - name: NCCL_NET_GDR_LEVEL
            value: "PIX"
          - name: LD_LIBRARY_PATH
            value: /usr/local/nvidia/lib64
        resources:
          limits:
            nvidia.com/gpu: 8
            cpu: "206"
            memory: "2400Gi"
          requests:
            nvidia.com/gpu: 8
            cpu: "206"
            memory: "2400Gi"
    
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-b200
        tolerations:
          - operator: "Exists"
            key: "nvidia.com/gpu"
          - operator: "Exists"
            key: "cloud.google.com/impending-node-termination"
          - operator: "Exists"
            key: "user-workload"
        securityContext:
          privileged: true
        volumes:
          - name: log-volume
            emptyDir: {}
          - name: shared-memory
            emptyDir:
              medium: "Memory"
              sizeLimit: 240Gi
          - name: ray-tmp
            emptyDir:
              medium: "Memory"
          - name: fluentbit-config-volume
            configMap:
              name: "ray-cluster-kuberay-fluentbit-config"
          - name: nvidia-install-dir-host
            hostPath:
              path: /home/kubernetes/bin/nvidia
          - name: gib-nccl-plugin-volume
            hostPath: 
              path: /home/kubernetes/bin/gib
          - name: lustre-data
            persistentVolumeClaim:
              claimName: lustre-pvc
        volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
          - name: shared-memory
            mountPath: /dev/shm
          - name: nvidia-install-dir-host
            mountPath: /usr/local/nvidia
          - name: gib-nccl-plugin-volume
            mountPath: /usr/local/gib
          - mountPath: /data
            name: lustre-data   
        # --- WORKER POD STARTUP SCRIPT ---
        command:
          - "bash"
          - "-c"
          - |
            set -ex
    
            echo "--- Worker Pod Setup ---"
            apt-get update
            apt-get install -y sudo netcat-openbsd pciutils
            cd /opt/nemo-rl
            /usr/bin/python -m pip install uv
            /usr/bin/python -m uv venv
    
            ldconfig /usr/local/nvidia/lib64/
            ldconfig -p | grep libcuda | sed 's/^/  /'
            export LD_LIBRARY_PATH="/usr/local/gib/lib64:$LD_LIBRARY_PATH"
            source /usr/local/gib/scripts/set_nccl_env.sh
    
            echo "Worker pod setup complete. Starting Ray..."
    
            exec ${KUBERAY_GEN_RAY_START_CMD}
    
    
        sidecarContainers:
          - name: fluent-bit
            env:
              - name: RAY_GROUP
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['ray.io/group']
            image: fluent/fluent-bit:latest
            volumeMounts:
              - name: fluentbit-config-volume
                mountPath: /fluent-bit/etc/
              - mountPath: /tmp/ray
                name: log-volume
    
    # --- Service Config ---
    service:
      type: ClusterIP
    

    根据您在本教程中使用的加速器,将 NCCL_TUNER_CONFIG_PATH 替换为以下任意值:

    • NVIDIA B200 (180 GB)/usr/local/gib/configs/tuner_config_a4.txtpb
    • NVIDIA H200 (141 GB)/usr/local/gib/configs/tuner_config_a3u.txtpb

    在此清单中,头节点管理作业并托管 Ray 信息中心。工作器节点运行训练作业。

  4. 安装 Ray 集群:

    export REPLICA_COUNT=2
    helm install ray-cluster . \
      --set values.additionalWorkerGroups.worker-grp-0.replicas=$REPLICA_COUNT
    

    在本教程中,您将使用两个工作器节点。如果您想更改工作器节点的数量,请更改 REPLICA_COUNT 值。

  5. 如需部署 Ray 集群,请运行 launcher.sh 脚本:

    bash launcher.sh
    
  6. 验证工作器节点和头节点是否正在运行:

    kubectl get pods
    

    输出类似于以下内容:

    NAME                                          READY STATUS RESTARTS AGE
    ray-cluster-kuberay-head-sw7dp                3/3   Running 0      33h
    ray-cluster-kuberay-worker-grp-0-worker-gkbxw 3/3   Running 0      33h
    ray-cluster-kuberay-worker-grp-0-worker-kdg62 3/3   Running 0      33h
    
  7. 验证 Ray 集群是否正在运行:

    kubectl ray get clusters
    

    输出类似于以下内容:

    NAME                 NAMESPACE DESIRED WORKERS AVAILABLE WORKERS CPUS GPUS TPUS MEMORY CONDITION STATUS AGE
    ray-cluster-kuberay  default   2       2           618     17   0    1573741824k RayClusterProvisioned ready 33h
    

启动 GRPO 作业

Ray 集群准备就绪后,您可以向 GKE 上正在运行的 Ray 集群提交 Ray 作业。NeMo RL 会在执行 RL 训练作业期间自动下载模型。

如需提交 Ray 作业,请启动互动式会话以执行该作业。

  1. 如需与 Ray 集群建立本地连接,请运行以下命令:

      kubectl ray session ray-cluster-kuberay
    

    此命令会在您的本地机器与 GKE 集群中的 Ray 头节点之间启动端口转发。请注意,此会话处于活跃状态时,您的终端将 被占用;如需继续操作,请打开单独的 终端实例。

  2. 修改 gemma3-27b-gsm8k.sh 文件:

    # Copyright 2026 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    #!/bin/bash
    WANDB_API_KEY='YOUR_WANDB_API_KEY' # Update this with your WANDB API key
    HF_TOKEN='YOUR_HF_TOKEN' # Update this with your HF token
    WORLD_SIZE=16
    
    # --- Step 1: Find the Ray Head Pod ---
    echo "Finding Ray head pod..."
    export HEAD_POD_NAME=$(kubectl get pods --selector=ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
    if [ -z "$HEAD_POD_NAME" ]; then
        echo "Error: No running Ray head pod found. Please check your cluster."
        exit 1
    fi
    echo "Found head pod: $HEAD_POD_NAME"
    echo ""
    
    # --- Step 2: Define the Job Script to Run ---
    # This is the script that will be executed *inside* the head pod.
    # It assumes the 'uv venv' setup from the values.yaml is already done.
    JOB_SCRIPT=$(cat <<EOF
    set -ex
    
    echo "--- Running on Ray Head Pod ($HOSTNAME) ---"
    cd /opt/nemo-rl
    
    git pull && git checkout main
    
    sed -i 's/subset: Optional\[str\] = None/subset: Optional[str] = "main"/' /opt/nemo-rl/nemo_rl/data/datasets/response_datasets/response_dataset.py
    sed -i 's/raw_dataset = load_dataset(data_path)/raw_dataset = load_dataset(data_path, "main")/' /opt/nemo-rl/nemo_rl/data/datasets/utils.py
    
    echo "Setting environment variables..."
    export WANDB_API_KEY=$WANDB_API_KEY
    export HF_TOKEN=$HF_TOKEN
    export HF_HOME=/opt/nemo-rl/
    
    ###-----Example to launch Gemma3-27B on 3 nodes (24 GPUs)----------
    uv run python examples/run_grpo_math.py \
      --config examples/configs/recipes/llm/grpo-gemma3-27b-it-8n4g-fsdp2tp4-actckpt-long.yaml \
      cluster.num_nodes=2 \
      cluster.gpus_per_node=8 \
      grpo.max_num_steps=300 \
      checkpointing.checkpoint_dir=/data/nemo_rl_gemma3_27b_3_17 \
      data.dataset_name=ResponseDataset \
      +data.train_data_path=openai/gsm8k \
      +data.val_data_path=openai/gsm8k \
      +data.val_split=test \
      +data.train_split=train \
      +data.subset="main" \
      +data.input_key="question" \
      +data.output_key="answer" \
      logger.tensorboard_enabled=False \
      logger.wandb_enabled=True \
      logger.wandb.name='nemo_rl_gemma3_27b_3_17' \
      grpo.num_prompts_per_step=16 \
      grpo.num_generations_per_prompt=64 \
      policy.generation.colocated.enabled=False \
      policy.generation.colocated.resources.num_nodes=1 \
      policy.generation.colocated.resources.gpus_per_node=8 \
      policy.generation.vllm_cfg.tensor_parallel_size=8 \
      policy.generation.vllm_cfg.gpu_memory_utilization=0.9 \
      policy.dtensor_cfg.tensor_parallel_size=8
    
    echo "--- Job Finished ---"
    EOF
    )
    
    # --- Step 3: Execute the Job ---
    echo "Submitting job to $HEAD_POD_NAME..."
    echo "$JOB_SCRIPT" | tr -d '\r' | kubectl exec -i $HEAD_POD_NAME -c ray-head -- /bin/bash
    
    echo ""
    echo "Job submission complete."
    

    替换 gemma3-27b-gsm8k.sh 文件中的以下值:

    • YOUR_WANDB_API_KEY:您的 WandB API 密钥。
    • YOUR_HF_TOKEN:您的 Hugging Face 令牌。

    在此文件中,您可以看到在 GSM8k 数据集上使用 gemma3-27b-it 模型运行作业的配置。如需完成 GRPO 训练流水线,此脚本定义了以下参数:

    • num_prompts_per_step: 16num_generations_per_prompt: 64:Gemma3-27b-it 模型会为每个提示生成大量响应。 在此配置中,模型总共生成 1,024 个响应 (16 × 64 = 1,024)。
    • policy.generation.colocated.enabled=False:此参数会停用同地生成功能,这意味着模型不会在与训练过程相同的节点中生成响应。在标准 RL 中,相同的 GPU 会同时处理训练和生成。在此 NeMo RL 设置中,您 将专用节点(使用 policy.generation.colocated.resources 参数进行管理)专门用于 vLLM 推理,而 集群的其余部分则专注于繁重的训练数学运算。通过分离这些工作负载,您可以防止内存密集型训练缓冲区与计算密集型推理工作负载之间发生资源争用。
  3. 如需提交作业,请运行以下命令:

    bash gemma3-27b-it/gemma3-27b-gsm8k.sh
    

    作业运行时,输出会显示训练结果、计时和性能指标。

监控 GRPO 作业的运行状况

Ray 完成作业后,NeMo RL 会将检查点存储在配置的路径中。

  1. 安装 apt tree 实用程序:

    apt install tree
    
  2. 如需监控 GRPO 作业的运行状况,请检查 Ray 头节点的日志:

    kubectl exec -it $(kubectl get pods -l ray.io/node-type=head -o name) -c ray-head -- bash
    

    输出类似于以下内容:

    root@ray-cluster-kuberay-worker-grp-0-worker-gkbxw:/opt/nemo-rl# tree /data/nemo_rl_gemma3_27b_3_17/
    /data/nemo_rl_gemma3_27b_3_17/
    `-- step_10
        |-- config.yaml
        |-- policy
        |   |-- optimizer
        |   |   |-- __0_0.distcp
        |   |   |-- __10_0.distcp
        |   |   |-- __11_0.distcp
        |   |   |-- __12_0.distcp
        |   |   |-- __13_0.distcp
        |   |   |-- __14_0.distcp
        |   |   |-- __15_0.distcp
        |   |   |-- __1_0.distcp
        |   |   |-- __2_0.distcp
        |   |   |-- __3_0.distcp
        |   |   |-- __4_0.distcp
        |   |   |-- __5_0.distcp
        |   |   |-- __6_0.distcp
        |   |   |-- __7_0.distcp
        |   |   |-- __8_0.distcp
        |   |   `-- __9_0.distcp
        |   |-- tokenizer
        |   |   |-- chat_template.jinja
        |   |   |-- special_tokens_map.json
        |   |   |-- tokenizer.json
        |   |   `-- tokenizer_config.json
        |   `-- weights
        |       |-- __0_0.distcp
        |       |-- __10_0.distcp
        |       |-- __11_0.distcp
        |       |-- __12_0.distcp
        |       |-- __13_0.distcp
        |       |-- __14_0.distcp
        |       |-- __15_0.distcp
        |       |-- __1_0.distcp
        |       |-- __2_0.distcp
        |       |-- __3_0.distcp
        |       |-- __4_0.distcp
        |       |-- __5_0.distcp
        |       |-- __6_0.distcp
        |       |-- __7_0.distcp
        |       |-- __8_0.distcp
        |       `-- __9_0.distcp
        |-- train_dataloader.pt
        `-- training_info.json
    
    6 directories, 39 files
    

清理

为避免产生费用,请删除资源:

helm delete ray-cluster
gcloud container clusters delete ${CLUSTER_NAME} --location=${CONTROL_PLANE_LOCATION}
gcloud storage rm -r gs://${GS_BUCKET}

后续步骤