在 GKE 上使用 Verl 微调和扩缩强化学习

本教程介绍了如何在 Google Kubernetes Engine (GKE) 上编排用于强化学习的分布式训练环境。您将使用 Ray 和 verl(Volcano Engine 强化学习)框架设置分布式训练环境,以微调 Qwen2.5-32B-Instruct 模型。

本教程重点介绍如何使用 Ray 和 verl 在 GKE 上运行 Group Relative Policy Optimization (GRPO) 训练流水线。GRPO 是一种旨在提升模型推理能力的强化学习算法。这种节省内存的算法通过以下方式简化了强化学习 (RL) 流程:消除 Critic 或价值模型,并使用基于相对群组的计算。

如果您需要设置一个分布式训练环境,以便将数据、模型权重和训练引擎分离以提高效率,那么本教程是一个很好的起点。

背景

以下部分简要概述了本教程中使用的概念。

强化学习 (RL)

RL 通过经验、探索和反馈来训练模型,而不是静态模仿。虽然预训练可以教会模型说什么,但基于人类反馈的强化学习 (RLHF) 可以教会模型如何提供有用的、安全的、符合逻辑的回答。RL 可作为基础模型与针对特定用例微调的模型之间的桥梁。

如需了解详情,请参阅什么是强化学习?

群组相对策略优化 (GRPO)

GRPO 是一种由 DeepSeek 推广的算法,通过移除 Critic 模型,为 LLM 对齐提供了一种内存高效的近端策略优化 (PPO) 替代方案。与 Critic 网络不同,GRPO 会针对同一提示生成一组回答,并使用该组回答的平均奖励作为基准。

如需了解详情,请参阅 GRPO

Volcano Engine 强化学习 (verl)

verl 是一个高性能框架,旨在处理基于 LLM 的强化学习的复杂内存和计算模式。

如需了解详情,请参阅 verl

目标

本教程将通过完成以下步骤,向您展示如何在 GKE 上使用 verl 设置强化学习:

  1. 设置具有 B200 或 H200 GPU 的 GKE 集群。
  2. 配置 KubeRay 以管理分布式 Ray 集群。
  3. 使用 Cloud Storage FUSE 在所有节点上装载 Cloud Storage 存储桶。
  4. 使用 verl 运行 GRPO 训练作业,使 Qwen2.5-32B-Instruct 模型与 GSM8K 数据集保持一致。

准备工作

  • 登录您的 Google Cloud 账号。如果您是 Google Cloud新手,请 创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金,用于运行、测试和部署工作负载。
  • 安装 Google Cloud CLI。

  • 如果您使用的是外部身份提供方 (IdP),则必须先使用联合身份登录 gcloud CLI

  • 如需初始化 gcloud CLI,请运行以下命令:

    gcloud init
  • 创建或选择 Google Cloud 项目

    选择或创建项目所需的角色

    • 选择项目:选择项目不需要特定的 IAM 角色,您可以选择已获授角色的任何项目。
    • 创建项目:如需创建项目,您需要拥有 Project Creator 角色 (roles/resourcemanager.projectCreator),该角色包含 resourcemanager.projects.create 权限。了解如何授予角色
    • 创建 Google Cloud 项目:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替换为您要创建的 Google Cloud 项目的名称。

    • 选择您创建的 Google Cloud 项目:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替换为您的 Google Cloud 项目名称。

  • 验证是否已为您的 Google Cloud 项目启用结算功能

  • 启用所需的 API:

    启用 API 所需的角色

    如需启用 API,您需要拥有 Service Usage Admin IAM 角色 (roles/serviceusage.serviceUsageAdmin),该角色包含 serviceusage.services.enable 权限。了解如何授予角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 安装 Google Cloud CLI。

  • 如果您使用的是外部身份提供方 (IdP),则必须先使用联合身份登录 gcloud CLI

  • 如需初始化 gcloud CLI,请运行以下命令:

    gcloud init
  • 创建或选择 Google Cloud 项目

    选择或创建项目所需的角色

    • 选择项目:选择项目不需要特定的 IAM 角色,您可以选择已获授角色的任何项目。
    • 创建项目:如需创建项目,您需要拥有 Project Creator 角色 (roles/resourcemanager.projectCreator),该角色包含 resourcemanager.projects.create 权限。了解如何授予角色
    • 创建 Google Cloud 项目:

      gcloud projects create PROJECT_ID

      PROJECT_ID 替换为您要创建的 Google Cloud 项目的名称。

    • 选择您创建的 Google Cloud 项目:

      gcloud config set project PROJECT_ID

      PROJECT_ID 替换为您的 Google Cloud 项目名称。

  • 验证是否已为您的 Google Cloud 项目启用结算功能

  • 启用所需的 API:

    启用 API 所需的角色

    如需启用 API,您需要拥有 Service Usage Admin IAM 角色 (roles/serviceusage.serviceUsageAdmin),该角色包含 serviceusage.services.enable 权限。了解如何授予角色

    gcloud services enable container.googleapis.com storage.googleapis.com compute.googleapis.com
  • 向您的用户账号授予角色。对以下每个 IAM 角色运行以下命令一次: roles/container.admin, roles/iam.serviceAccountAdmin, roles/storage.admin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

    替换以下内容:

    • PROJECT_ID:您的项目 ID。
    • USER_IDENTIFIER:您的用户 账号的标识符。例如,myemail@example.com
    • ROLE:您授予用户账号的 IAM 角色。

准备环境

在本教程中,您将使用 Cloud Shell

  1. 前往 Google Cloud 控制台

  2. 点击 Google Cloud 控制台窗口顶部的激活 Cloud Shell 按钮。

  3. 设置以下环境变量:

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export GPU_TYPE=GPU_TYPE
    export CONTROL_PLANE_REGION=CONTROL_PLANE_REGION
    export NODE_ZONE=NODE_ZONE
    export CLUSTER_NAME=CLUSTER_NAME
    export KSA_NAME=CLUSTER_NAME
    export GS_BUCKET=BUCKET_NAME-${PROJECT_ID}
    export NAMESPACE=default
    export HF_TOKEN=YOUR_HUGGING_FACE_TOKEN
    export MACHINE_TYPE=MACHINE_TYPE
    export RESERVATION=RESERVATION
    

    替换以下值:

    • CONTROL_PLANE_REGION:GKE 集群控制平面的 Compute Engine 区域。
    • GPU_TYPE:您在 Compute Engine 容量预留中预留的加速器。必须是以下值之一:
      • nvidia-b200:NVIDIA B200 (180GB)
      • nvidia-h200-141gb:NVIDIA H200 (141GB)
    • NODE_ZONE:GKE 节点的可用区。选择 NVIDIA B200 或 H200 GPU 可用的可用区
    • CLUSTER_NAME:GKE 集群的名称。
    • BUCKET_NAME:Cloud Storage 存储桶的基本名称。您无需指定 gs:// 前缀。
    • YOUR_HUGGING_FACE_TOKEN:您的 Hugging Face 令牌,用于访问模型。
    • MACHINE_TYPE:要使用的机器类型:
      • 对于 NVIDIA B200 (180 GB) GPU,请使用 a4-highgpu-8g 或更高版本。
      • 对于 NVIDIA H200 (141 GB) GPU,请使用 a3-ultragpu-8g 或更高版本。
    • RESERVATION:GPU 预留的名称。
  4. 为网络创建以下环境变量:

    export GVNIC_NETWORK_PREFIX="GVNIC-NAME"
    export RDMA_NETWORK_PREFIX="RDMA-NAME"
    

    替换以下值:

    • GVNIC-NAME:gVNIC 网络名称的前缀。您可以使用任何前缀。
    • RDMA-NAME:远程直接内存访问 (RDMA) 网络的网络前缀。您可以使用任何前缀。

设置基础架构

在本部分中,您将创建一个 RDMA 网络和一个 GKE 集群。

创建 RDMA 网络和子网

  1. 为 gVNIC 接口创建 VPC 网络:

    gcloud compute networks create ${GVNIC_NETWORK_PREFIX}-net \
        --subnet-mode=custom \
        --project=${PROJECT_ID}
    gcloud compute networks subnets create ${GVNIC_NETWORK_PREFIX}-sub \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --region=${CONTROL_PLANE_REGION} \
        --range=192.168.0.0/24
    gcloud compute firewall-rules create ${GVNIC_NETWORK_PREFIX}-internal \
        --network=${GVNIC_NETWORK_PREFIX}-net \
        --action=ALLOW \
        --rules=tcp:0-65535,udp:0-65535,icmp \
        --source-ranges=192.168.0.0/16
    
  2. 创建 VPC 网络和子网,为 RDMA 创建 8 个子网,以供 8 个 GPU 使用:

    gcloud beta compute networks create ${RDMA_NETWORK_PREFIX}-net \
        --network-profile=${NODE_ZONE}-vpc-roce \
        --subnet-mode=custom
    
    for N in $(seq 0 7); do
      gcloud compute networks subnets create ${RDMA_NETWORK_PREFIX}-sub-$N \
        --network=${RDMA_NETWORK_PREFIX}-net \
        --region=${CONTROL_PLANE_REGION} \
        --range=192.168.$((N+1)).0/24 &
    done
    wait
    
  3. 克隆示例代码库:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. 导航到工作目录:

    cd ai-ml/verl-on-gke
    

创建 GKE 集群

您可以在 GKE Autopilot 或 Standard 集群中设置 verl。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式

Autopilot

  1. 创建 Autopilot 集群:

    gcloud container clusters create-auto ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_REGION} \
        --enable-multi-networking  \
        --enable-ray-operator
    
  2. 获取集群的凭据:

    gcloud container clusters get-credentials ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_REGION}
    
  3. 为 Autopilot 安装 NCCL RDMA 安装程序:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer-autopilot.yaml
    

标准

  1. 创建 Standard 集群:

    gcloud container clusters create ${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_REGION} \
        --enable-dataplane-v2 \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --enable-ip-alias \
        --enable-multi-networking \
        --addons=RayOperator,GcsFuseCsiDriver \
        --machine-type=c2-standard-16 \
        --num-nodes=1 \
        --min-nodes=1 \
        --max-nodes=5 \
        --enable-autoscaling
    
  2. 获取集群的凭据:

    gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CONTROL_PLANE_REGION}
    
  3. 创建 GPU 节点池。这些节点池使用您的预留来确保可用性。我们从 2 个节点开始:

    gcloud container node-pools create gpu-pool \
        --cluster=${CLUSTER_NAME} \
        --location=${CONTROL_PLANE_REGION} \
        --node-locations=${NODE_ZONE} \
        --machine-type=${MACHINE_TYPE} \
        --accelerator=type=${GPU_TYPE},count=8,gpu-driver-version=DEFAULT \
        --reservation-affinity=specific \
        --reservation=${RESERVATION} \
        --enable-autoscaling \
        --num-nodes=2 \
        --total-max-nodes=10 \
        --additional-node-network=network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
        --additional-node-network=network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7
    
  4. 安装用于标准集群的 NCCL RDMA 安装程序:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml
    

配置网络映射

  1. 检查 network-mapping.yaml 清单:

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: gvnic-1
    spec:
      vpc: ${GVNIC_NETWORK_PREFIX}-net
      vpcSubnet: ${GVNIC_NETWORK_PREFIX}-sub
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: gvnic-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: gvnic-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-0
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-0
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-0
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-0
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-1
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-1
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-1
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-1
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-2
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-2
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-2
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-2
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-3
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-3
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-3
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-3
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-4
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-4
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-4
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-4
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-5
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-5
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-5
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-5
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-6
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-6
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-6
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-6
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: rdma-7
    spec:
      vpc: ${RDMA_NETWORK_PREFIX}-net
      vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-7
      deviceMode: RDMA
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: rdma-7
    spec:
      type: "Device"
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: rdma-7
    
  2. 应用清单:

    envsubst < network-mapping.yaml > network-mapping-updated.yaml
    kubectl apply -f network-mapping-updated.yaml
    

准备数据和存储空间

  1. 创建 Cloud Storage 存储分区,请运行以下命令:

    gcloud storage buckets create gs://${GS_BUCKET} --location=${REGION} --enable-hierarchical-namespace --uniform-bucket-level-access
    
  2. 创建 Kubernetes 服务账号 (KSA) 并将其绑定到相应存储桶:

    kubectl create serviceaccount ${KSA_NAME} --namespace ${NAMESPACE}
    
    gcloud storage buckets add-iam-policy-binding gs://${GS_BUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    
  3. 为 Hugging Face 创建 Secret:

    kubectl create secret generic hf-secret --from-literal=hf_api_token=${HF_TOKEN}
    
  4. 检查 gcsfuse-storage.yaml 清单:

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: training-bucket-pv
    spec:
      accessModes:
      -   ReadWriteMany
      capacity:
        storage: 768Gi
      persistentVolumeReclaimPolicy: Delete
      storageClassName: gcsfuse-sc
      mountOptions:
      -   implicit-dirs
      -   metadata-cache:negative-ttl-secs:0
      -   metadata-cache:ttl-secs:0
      -   metadata-cache:stat-cache-max-size-mb:-1
      -   metadata-cache:type-cache-max-size-mb:-1
      -   file-cache:max-size-mb:-1
      -   file-cache:cache-file-for-range-read:true
      -   file-cache:enable-parallel-downloads:true
      -   read_ahead_kb=1024
      -   write:enable-streaming-writes:true
      -   write:global-max-blocks:200000
      csi:
        driver: gcsfuse.csi.storage.gke.io
        volumeHandle: ${GS_BUCKET}
        volumeAttributes:
          skipCSIBucketAccessCheck: "true"
          gcsfuseMetadataPrefetchOnMount: "true"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: training-bucket-pvc
    spec:
      accessModes:
      -   ReadWriteMany
      resources:
        requests:
          storage: 768Gi
      storageClassName: gcsfuse-sc
    
  5. 应用清单:

    envsubst < gcsfuse-storage.yaml > gcsfuse-storage-updated.yaml
    kubectl apply -f gcsfuse-storage-updated.yaml
    

准备模型和数据

您可以在本地或 GKE Pod 上运行这些命令来填充存储桶。

  1. 克隆 verl 代码库,准备虚拟环境并处理 GSM8K 数据集:

    git clone https://github.com/volcengine/verl.git
    
    VENV_DIR=.venv
    python3 -m venv $VENV_DIR
    source $VENV_DIR/bin/activate
    pip install verl
    
    python verl/examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
    
  2. 使用 Hugging Face CLI 下载 Qwen2.5-32B-Instruct 模型(这需要大约 66 Gb 的磁盘空间):

    hf download Qwen/Qwen2.5-32B-Instruct --local-dir Qwen2.5-32B-Instruct
    
  3. 将模型、数据和 verl 代码上传到您的 Cloud Storage 存储桶:

    gcloud storage cp --recursive verl gs://${GS_BUCKET}/verl
    gcloud storage cp --recursive Qwen2.5-32B-Instruct gs://${GS_BUCKET}/Qwen2.5-32B-Instruct
    gcloud storage cp --recursive ~/data/gsm8k/* gs://${GS_BUCKET}/gsm8k/
    

部署 RayCluster 自定义资源

部署 RayCluster 自定义资源,该资源通常由一个系统 Pod 和多个工作器 Pod 组成。

Autopilot

  1. 部署 RayCluster。将以下内容保存到 ray-cluster-auto.yaml

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: b200-ray-cluster
      annotations:
    spec:
      rayVersion: '2.47.0'
      headGroupSpec:
        rayStartParams:
          dashboard-host: '0.0.0.0'
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
          spec:
            serviceAccountName: ${KSA_NAME}
            nodeSelector:
              cloud.google.com/gke-spot: "true"
              cloud.google.com/machine-family: "c2"
              cloud.google.com/compute-class: Performance
            containers:
            - name: ray-head
              image: verlai/verl:vllm011.latest 
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "12"
                  memory: "32G"
                  ephemeral-storage: "9Gi"
                requests:
                  cpu: "12"
                  memory: "32G"
                  ephemeral-storage: "9Gi"
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
                - name: training-bucket-vol
                  mountPath: /data
            volumes:
              - name: ray-logs
                emptyDir: {}
              - name: training-bucket-vol
                persistentVolumeClaim:
                  claimName: training-bucket-pvc
      workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: gpu-group
        rayStartParams:
          num-cpus: "220"
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              networking.gke.io/default-interface: 'eth0'
              networking.gke.io/interfaces: |
                [
                  {"interfaceName":"eth0","network":"default"},
                  {"interfaceName":"eth1","network":"gvnic-1"},
                  {"interfaceName":"eth2","network":"rdma-0"},
                  {"interfaceName":"eth3","network":"rdma-1"},
                  {"interfaceName":"eth4","network":"rdma-2"},
                  {"interfaceName":"eth5","network":"rdma-3"},
                  {"interfaceName":"eth6","network":"rdma-4"},
                  {"interfaceName":"eth7","network":"rdma-5"},
                  {"interfaceName":"eth8","network":"rdma-6"},
                  {"interfaceName":"eth9","network":"rdma-7"}
                ]
          spec:
            initContainers:
            - name: verl-setup
              image: verlai/verl:vllm011.latest
              command: ["/bin/bash", "-c"]
              args:
                - |
                  echo "Performing local editable install..."
                  cd /data/verl && pip3 install --no-deps -e .
              volumeMounts:
              - name: training-bucket-vol
                mountPath: /data
            serviceAccountName: ${KSA_NAME}
            nodeSelector:
              cloud.google.com/gke-accelerator: ${GPU_TYPE}
              cloud.google.com/gke-accelerator-count: 8
              cloud.google.com/gke-spot: "true"
              cloud.google.com/compute-class: Performance
            tolerations:
              - key: "nvidia.com/gpu"
                operator: "Exists"
                effect: "NoSchedule"
            containers:
            - name: ray-worker
              image: verlai/verl:vllm011.latest
              env:
               - name: LD_LIBRARY_PATH
                 value: /usr/local/nvidia/lib64
              resources:
                limits:
                  cpu: "220"
                  memory: "2800Gi"
                  nvidia.com/gpu: "8"
                  ephemeral-storage: "1000Gi"
                requests:
                  cpu: "220"
                  memory: "2800Gi"
                  nvidia.com/gpu: "8"
                  ephemeral-storage: "1000Gi"
              volumeMounts:
              - name: nvidia
                mountPath: /usr/local/nvidia
                readOnly: true
              - name: gib
                mountPath: /usr/local/gib
                readOnly: true
              - name: shared-memory
                mountPath: /dev/shm
              - name: ray-tmp-storage
                mountPath: /tmp
              - name: training-bucket-vol
                mountPath: /data
            volumes:
            - name: gib
              hostPath:
                path: /home/kubernetes/bin/gib
            - name: nvidia
              hostPath:
                path: /home/kubernetes/bin/nvidia
            - name: lib64
              hostPath:
                path: /lib64
            - name: shared-memory
              emptyDir:
                medium: "Memory"
                sizeLimit: 250Gi 
            - name: sys
              hostPath:
                path: /sys
            - name: proc-sys
              hostPath:
                path: /proc/sys
            - name: ray-tmp-storage
              emptyDir: {}
            - name: training-bucket-vol
              persistentVolumeClaim:
                claimName: training-bucket-pvc
    
  2. 应用 RayCluster:

    envsubst < ray-cluster-auto.yaml > ray-cluster-auto-updated.yaml
    kubectl apply -f ray-cluster-updated.yaml
    

标准

  1. 部署 RayCluster。将以下内容保存到 ray-cluster-standard.yaml

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: b200-ray-cluster
      annotations:
    spec:
      rayVersion: '2.47.0'
      headGroupSpec:
        rayStartParams:
          dashboard-host: '0.0.0.0'
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
          spec:
            serviceAccountName: ${KSA_NAME}
            nodeSelector:
              cloud.google.com/gke-nodepool: "default-pool"
            containers:
            - name: ray-head
              image: verlai/verl:vllm011.latest 
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "12"
                  memory: "32G"
                  ephemeral-storage: "9Gi"
                requests:
                  cpu: "12"
                  memory: "32G"
                  ephemeral-storage: "9Gi"
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
                - name: training-bucket-vol
                  mountPath: /data
            volumes:
              - name: ray-logs
                emptyDir: {}
              - name: training-bucket-vol
                persistentVolumeClaim:
                  claimName: training-bucket-pvc
      workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: gpu-group
        rayStartParams:
          num-cpus: "220"
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              networking.gke.io/default-interface: 'eth0'
              networking.gke.io/interfaces: |
                [
                  {"interfaceName":"eth0","network":"default"},
                  {"interfaceName":"eth1","network":"gvnic-1"},
                  {"interfaceName":"eth2","network":"rdma-0"},
                  {"interfaceName":"eth3","network":"rdma-1"},
                  {"interfaceName":"eth4","network":"rdma-2"},
                  {"interfaceName":"eth5","network":"rdma-3"},
                  {"interfaceName":"eth6","network":"rdma-4"},
                  {"interfaceName":"eth7","network":"rdma-5"},
                  {"interfaceName":"eth8","network":"rdma-6"},
                  {"interfaceName":"eth9","network":"rdma-7"}
                ]
          spec:
            initContainers:
            - name: verl-setup
              image: verlai/verl:vllm011.latest
              command: ["/bin/bash", "-c"]
              args:
                - |
                  echo "Performing local editable install..."
                  cd /data/verl && pip3 install --no-deps -e .
              volumeMounts:
              - name: training-bucket-vol
                mountPath: /data
            serviceAccountName: ${KSA_NAME}
            nodeSelector:
              cloud.google.com/gke-accelerator: ${GPU_TYPE}
            tolerations:
              - key: "nvidia.com/gpu"
                operator: "Exists"
                effect: "NoSchedule"
            containers:
            - name: ray-worker
              image: verlai/verl:vllm011.latest
              env:
               - name: LD_LIBRARY_PATH
                 value: /usr/local/nvidia/lib64
              resources:
                limits:
                  cpu: "220"
                  memory: "2800Gi"
                  nvidia.com/gpu: "8"
                  ephemeral-storage: "1000Gi"
                requests:
                  cpu: "220"
                  memory: "2800Gi"
                  nvidia.com/gpu: "8"
                  ephemeral-storage: "1000Gi"
              volumeMounts:
              - name: nvidia
                mountPath: /usr/local/nvidia
              - name: gib
                mountPath: /usr/local/gib
              - name: shared-memory
                mountPath: /dev/shm
              - name: ray-tmp-storage
                mountPath: /tmp
              - name: training-bucket-vol
                mountPath: /data
            volumes:
            - name: gib
              hostPath:
                path: /home/kubernetes/bin/gib
            - name: nvidia
              hostPath:
                path: /home/kubernetes/bin/nvidia
            - name: lib64
              hostPath:
                path: /lib64
            - name: shared-memory
              emptyDir:
                medium: "Memory"
                sizeLimit: 250Gi 
            - name: sys
              hostPath:
                path: /sys
            - name: proc-sys
              hostPath:
                path: /proc/sys
            - name: ray-tmp-storage
              emptyDir: {}
            - name: training-bucket-vol
              persistentVolumeClaim:
                claimName: training-bucket-pvc
    
  2. 应用 RayCluster:

    envsubst < ray-cluster-standard.yaml > ray-cluster-updated.yaml
    kubectl apply -f ray-cluster-updated.yaml
    

启动 GRPO 作业

  1. 设置到 Ray 信息中心节点的端口转发。请使用单独的终端窗口执行此操作,因为此命令在运行期间会阻止终端。使用 Ctrl+C 停止该进程:

    kubectl port-forward svc/b200-ray-cluster-head-svc 8265:8265
    
  2. 检查 runtime-env.yaml 清单:

    # Copyright 2026 Google LLC. All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    py_modules: ["."]
    working_dir": "."
    py_executable": "uv run"
    setup_hook: runtime_env.uv_runtime_env_hook.hook 
    env_vars:
      PYTHONPATH: "/data/verl"
      LD_LIBRARY_PATH: "/usr/local/nvidia/lib64"
      NCCL_DEBUG: "INFO"
      NUM_WORKERS: "2"
      CPUS_PER_WORKER: "192"
      GPUS_PER_WORKER: "8"
      NCCL_NET_PLUGIN: "/usr/local/gib/lib64/libnccl-net_internal.so"
      NCCL_CROSS_NIC: "0"
      NCCL_NET_GDR_LEVEL: "PIX"
      NCCL_P2P_NET_CHUNKSIZE: "131072"
      NCCL_NVLS_CHUNKSIZE: "524288"
      NCCL_IB_ADAPTIVE_ROUTING: "1"
      NCCL_IB_QPS_PER_CONNECTION: "4"
      NCCL_IB_TC: "52"
      NCCL_IB_FIFO_TC: "84"
      NCCL_TUNER_CONFIG_PATH: "/usr/local/gib/configs/tuner_config_a4.txtpb" 
      HF_HOME: "/data/huggingface_cache"
      GLOO_SOCKET_IFNAME: "eth0" 
    pip:
      packages:
        - torch 
        - torchvision
    

    如果您使用 H200 GPU,请将 NCCL_TUNER_CONFIG_PATH 更改为 /usr/local/gib/configs/tuner_config_a3u.txtpb

    Ray 客户端使用此文件。您无需将此清单应用于集群。

  3. 使用 ray job submit 提交作业:

    ray -- job submit \
    --address "http://localhost:8265" \
    --runtime-env runtime-env.yaml \
    -- \
    bash -c "
        cd /data/verl && PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
        data.train_files=/data/gsm8k/train.parquet \
        data.val_files=/data/gsm8k/test.parquet \
        data.train_batch_size=256 \
        data.max_prompt_length=512 \
        data.max_response_length=512 \
        actor_rollout_ref.model.path=/data/Qwen2.5-32B-Instruct \
        actor_rollout_ref.actor.optim.lr=1e-5 \
        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=64 \
        actor_rollout_ref.rollout.name=vllm \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
        actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
        actor_rollout_ref.actor.strategy=fsdp2 \
        algorithm.kl_ctrl.kl_coef=0.001 \
        trainer.logger=console \
        trainer.val_before_train=False \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=2 \
        trainer.save_freq=10 \
        trainer.test_freq=10 \
        trainer.default_local_dir=/data/verl/checkpoints \
        algorithm.adv_estimator=grpo \
        actor_rollout_ref.rollout.n=8 \
        trainer.total_epochs=2"
    

    在 Ray 信息中心或输出中监控日志。寻找表示学习的 critic/score/mean 以增加。

  4. 训练结束后,可以在 gs://$GS_BUCKET/verl/checkpoints 中找到经过训练的模型的检查点。

清理

为避免产生费用,请删除以下资源:

kubectl delete raycluster b200-ray-cluster # change to variables
gcloud container clusters delete ${CLUSTER_NAME} --location=${CONTROL_PLANE_REGION}
gcloud storage rm -r gs://${GS_BUCKET}

后续步骤