通过 KubeRay,使用 GKE 上的 TPU 应用 LLM

本教程介绍如何使用 Google Kubernetes Engine (GKE) 上的张量处理单元 (TPU) 以及 Ray Operator 插件vLLM 部署框架来部署大语言模型 (LLM)。

在本教程中,您可以在 TPU v5e 或 TPU Trillium (v6e) 上应用 LLM 模型,如下所示:

本指南适用于想要使用 Kubernetes 容器编排功能,在采用 vLLM 的 TPU 上使用 Ray 应用模型的生成式 AI 客户、新的和现有的 GKE 用户、机器学习工程师、MLOps (DevOps) 工程师或平台管理员。

背景

本部分介绍本指南中使用的关键技术。

GKE 托管式 Kubernetes 服务

Google Cloud 提供各种各样的服务,包括 GKE,该服务非常适合用于部署和管理 AI/机器学习工作负载。GKE 是一项托管式 Kubernetes 服务,可简化容器化应用的部署、扩缩和管理。GKE 提供必要的基础设施(包括可伸缩资源、分布式计算和高效网络),以满足 LLM 的计算需求。

如需详细了解关键 Kubernetes 概念,请参阅开始了解 Kubernetes。如需详细了解 GKE 以及它如何帮助您扩缩、自动执行和管理 Kubernetes,请参阅 GKE 概览

Ray Operator

GKE 上的 Ray Operator 插件提供了一个端到端 AI/机器学习平台,用于部署、训练和微调机器学习工作负载。在本教程中,您将使用 Ray 中的框架 Ray Serve,通过 Hugging Face 应用热门 LLM。

TPU

TPU 是 Google 定制开发的应用专用集成电路 (ASIC),用于加速机器学习和使用 TensorFlowPyTorchJAX 等框架构建的 AI 模型。

本教程介绍如何在 TPU v5e 或 TPU Trillium (v6e) 节点上应用 LLM 模型,这些节点根据每个模型要求配置 TPU 拓扑,从而以低延迟响应提示。

vLLM

vLLM 是一个经过高度优化的开源 LLM 服务框架,可提高 TPU 上的服务吞吐量,具有如下功能:

  • 具有 PagedAttention 且经过优化的 Transformer(转换器)实现
  • 连续批处理,可提高整体服务吞吐量
  • 多个 GPU 上的张量并行处理和分布式服务

如需了解详情,请参阅 vLLM 文档

目标

本教程介绍以下步骤:

  1. 创建具有 TPU 节点池的 GKE 集群。
  2. 使用单主机 TPU 切片部署 RayCluster 自定义资源。GKE 会将 RayCluster 自定义资源部署为 Kubernetes Pod。
  3. 应用 LLM。
  4. 与模型交互。

您可以选择性地配置 Ray Serve 框架支持的以下模型部署资源和方法:

  • 部署 RayService 自定义资源。
  • 通过模型组合实现多个模型的组合。

准备工作

在开始之前,请确保您已执行以下任务:

  • 启用 Google Kubernetes Engine API。
  • 启用 Google Kubernetes Engine API
  • 如果您要使用 Google Cloud CLI 执行此任务,请安装初始化 gcloud CLI。 如果您之前安装了 gcloud CLI,请运行 gcloud components update 命令以获取最新版本。较早版本的 gcloud CLI 可能不支持运行本文档中的命令。
  • 如果您还没有 Hugging Face 账号,请创建一个。
  • 确保您拥有 Hugging Face 令牌
  • 确保您有权访问要使用的 Hugging Face 模型。通常需要在 Hugging Face 模型页面上签署一个协议并向模型所有者申请使用权,从而获得访问权限。
  • 确保您拥有以下 IAM 角色
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

准备环境

  1. 检查您的 Google Cloud 项目中有足够的配额来使用单主机 TPU v5e 或单主机 TPU Trillium (v6e)。如需管理配额,请参阅 TPU 配额

  2. 在 Google Cloud 控制台中,启动 Cloud Shell 实例:
    打开 Cloud Shell

  3. 克隆示例代码库:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. 导航到工作目录:

    cd ai-ml/gke-ray/rayserve/llm
    
  5. 为 GKE 集群创建设置默认环境变量:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
    export TOKENIZER_MODE=mistral
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Llama-3.1-70B"
    export MAX_MODEL_LEN=8192
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    替换以下内容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 访问令牌。
    • REGION:您拥有 TPU 配额的区域。确保您要使用的 TPU 版本在此区域可用。如需了解详情,请参阅 GKE 中的 TPU 可用性
    • ZONE:具有可用 TPU 配额的可用区。
    • VLLM_IMAGE:vLLM TPU 映像。您可以使用公共 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像,也可以构建自己的 TPU 映像
  6. 拉取 vLLM 容器映像:

    sudo usermod -aG docker ${USER}
    newgrp docker
    docker pull ${VLLM_IMAGE}
    

创建集群

您可以使用 Ray Operator 插件,在 GKE Autopilot 或 Standard 集群中通过 Ray 在 GPU 上应用 LLM。

最佳做法

使用 Autopilot 集群可获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式

使用 Cloud Shell 创建 Autopilot 或 Standard 集群:

Autopilot

  1. 创建启用 Ray Operator 插件的 GKE Autopilot 集群:

    gcloud container clusters create-auto ${CLUSTER_NAME}  \
        --enable-ray-operator \
        --release-channel=rapid \
        --location=${COMPUTE_REGION}
    

标准

  1. 创建启用了 Ray Operator 插件的 Standard 集群:

    gcloud container clusters create ${CLUSTER_NAME} \
        --release-channel=rapid \
        --location=${COMPUTE_ZONE} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --machine-type="n1-standard-4" \
        --addons=RayOperator,GcsFuseCsiDriver
    
  2. 创建单主机 TPU 切片节点池:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct5lp-hightpu-8t 机器类型的 TPU v5e 节点池。

    Mistral-7B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct5lp-hightpu-8t 机器类型的 TPU v5e 节点池。

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct6e-standard-8t \
        --num-nodes=1
    

    GKE 会创建一个具有 ct6e-standard-8t 机器类型的 TPU v6e 节点池。

配置 kubectl 以与您的集群通信

如需配置 kubectl 以与您的集群通信,请运行以下命令:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

标准

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

为 Hugging Face 凭据创建 Kubernetes Secret

如需创建包含 Hugging Face 令牌的 Kubernetes Secret,请运行以下命令:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

创建 Cloud Storage 存储桶

为了缩短 vLLM 部署启动时间并尽可能减少每个节点需要的磁盘空间,请使用 Cloud Storage FUSE CSI 驱动程序将下载的模型和编译缓存装载到 Ray 节点。

在 Cloud Shell 中,运行以下命令:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

此命令会创建一个 Cloud Storage 存储桶,用于存储您从 Hugging Face 下载的模型文件。

设置 Kubernetes ServiceAccount 以访问存储桶

  1. 创建 Kubernetes ServiceAccount:

    kubectl create serviceaccount ${KSA_NAME} \
        --namespace ${NAMESPACE}
    
  2. 向 Kubernetes ServiceAccount 授予 Cloud Storage 存储桶的读写权限:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    

    GKE 会为 LLM 创建以下资源:

    1. 存储下载的模型和编译缓存的 Cloud Storage 存储桶。Cloud Storage FUSE CSI 驱动程序会读取存储桶的内容。
    2. 启用文件缓存的卷和 Cloud Storage FUSE 的并行下载功能。
    最佳实践

    根据模型内容(例如权重文件)的预期大小,使用由 tmpfsHyperdisk / Persistent Disk 提供支持的文件缓存。在本教程中,您将使用由 RAM 提供支持的 Cloud Storage FUSE 文件缓存。

部署 RayCluster 自定义资源

部署 RayCluster 自定义资源,该资源通常由一个系统 Pod 和多个工作器 Pod 组成。

Llama-3-8B-Instruct

完成以下步骤,创建 RayCluster 自定义资源以部署 Llama 3 8B 指令调优模型:

  1. 检查 ray-cluster.tpu-v5e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 命令会替换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

Mistral-7B

完成以下步骤,创建 RayCluster 自定义资源以部署 Mistral-7B 模型:

  1. 检查 ray-cluster.tpu-v5e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 命令会替换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

Llama 3.1 70B

完成以下步骤,创建 RayCluster 自定义资源以部署 Llama 3.1 70B 模型:

  1. 检查 ray-cluster.tpu-v6e-singlehost.yaml 清单:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 应用清单:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 命令会替换清单中的环境变量。

GKE 会创建一个 RayCluster 自定义资源,其中 workergroup2x4 拓扑中包含 TPU v6e 单主机。

连接到 RayCluster 自定义资源

创建 RayCluster 自定义资源后,您可以连接到 RayCluster 资源并开始应用模型。

  1. 验证 GKE 是否已创建 RayCluster 服务:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
        --output wide
    

    输出类似于以下内容:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
    vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###
    

    等待 STATUS 变为 ready,并且 HEAD POD IPHEAD SERVICE IP 列具有 IP 地址。

  2. 建立与 Ray 头节点的 port-forwarding 会话:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 10001:10001"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &
    
  3. 验证 Ray 客户端可以连接到远程 RayCluster 自定义资源:

    docker run --net=host -it ${VLLM_IMAGE} \
    ray list nodes --address http://localhost:8265
    

    输出类似于以下内容:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
    Stats:
    ------------------------------
    Total: 2
    
    Table:
    ------------------------------
        NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
    0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                               memory: #.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               node:__internal_head__: 1.0
                                                                                               object_store_memory: #.### GiB
    1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                               TPU: 8.0
                                                                                               TPU-v#e-8-head: 1.0
                                                                                               accelerator_type:TPU-V#E: 1.0
                                                                                               memory: ###.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               object_store_memory: ##.### GiB
                                                                                               tpu-group-0: 1.0
    

使用 vLLM 部署模型

如需使用 vLLM 部署特定模型,请按照以下说明操作。

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

查看 Ray 信息中心

您可以通过 Ray 信息中心查看 Ray Serve 部署和相关日志。

  1. 点击 Cloud Shell 任务栏右上角的 “网页预览”图标 网页预览按钮。
  2. 点击更改端口,然后将端口号设置为 8265
  3. 点击更改并预览
  4. 在 Ray 信息中心内,点击 Serve 标签页。

在 Serve 部署的状态变为 HEALTHY 后,模型便可开始处理输入。

应用模型

本指南重点介绍支持文本生成的模型,这是一种可根据提示创建文本内容的方法。

Llama-3-8B-Instruct

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Mistral-7B

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llama 3.1 70B

  1. 设置到服务器的端口转发:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 向 Serve 端点发送提示:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

其他配置

您可以选择性地配置 Ray Serve 框架支持的以下模型部署资源和方法:

部署 RayService

您可以使用 RayService 自定义资源部署本教程中的相同模型。

  1. 删除您在本教程中创建的 RayCluster 自定义资源:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
    
  2. 创建 RayService 自定义资源以部署模型:

    Llama-3-8B-Instruct

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

      GKE 会创建一个 RayService,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

    Mistral-7B

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

      GKE 会创建一个 RayService,其中 workergroup2x4 拓扑中包含 TPU v5e 单主机。

    Llama 3.1 70B

    1. 检查 ray-service.tpu-v6e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 命令会替换清单中的环境变量。

    GKE 会创建一个 RayCluster 自定义资源,其中部署了 Ray Serve 应用并创建了后续的 RayService 自定义资源。

  3. 验证 RayService 资源的状态:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu
    

    等待 Service 状态变为 Running

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          1
    
  4. 检索 RayCluster 头服务的名称:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
        --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    
  5. 建立与 Ray 头节点的 port-forwarding 会话以查看 Ray 信息中心:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    
  6. 查看 Ray 信息中心

  7. 应用模型

  8. 清理 RayService 资源:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu
    

通过模型组合实现多个模型的组合

模型组合是一种可将多个模型组合到单个应用中的方法。

在本部分中,您将使用 GKE 集群将 Llama 3 8B IT 和 Gemma 7B IT 这两个模型组合到单个应用中:

  • 第一个模型是回答提示中问题的智能助理模型。
  • 第二个模型是摘要器模型。助理型模型的输出会链接到摘要器模型的输入中。最终结果是助理型模型所提供回答的摘要版本。
  1. 完成以下步骤,获取对 Gemma 模型的访问权限:

    1. 登录 Kaggle 平台,签署许可同意协议,并获取 Kaggle API 令牌。在本教程中,您会将 Kubernetes Secret 用于 Kaggle 凭据。
    2. 访问 Kaggle.com 上的模型同意页面
    3. 如果您尚未登录 Kaggle,请进行登录。
    4. 点击申请访问权限
    5. Choose Account for Consent(选择进行同意的账号)部分中,选择 Verify via Kaggle Account(通过 Kaggle 账号验证),以使用您的 Kaggle 账号进行同意。
    6. 接受模型条款及条件
  2. 设置环境:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
    export SUMMARIZER_MODEL_ID=google/gemma-7b-it
    
  3. 对于 Standard 集群,创建一个额外的单主机 TPU 切片节点池:

    gcloud container node-pools create tpu-2 \
      --location=${COMPUTE_ZONE} \
      --cluster=${CLUSTER_NAME} \
      --machine-type=MACHINE_TYPE \
      --num-nodes=1
    

    MACHINE_TYPE 替换为以下某个机器类型:

    • ct5lp-hightpu-8t,用于预配 TPU v5e。
    • ct6e-standard-8t,用于预配 TPU v6e。

    Autopilot 集群会自动预配所需的节点。

  4. 根据您要使用的 TPU 版本部署 RayService 资源:

    TPU v5e

    1. 检查 ray-service.tpu-v5e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

    TPU v6e

    1. 检查 ray-service.tpu-v6e-singlehost.yaml 清单:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 应用清单:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      
  5. 等待 RayService 资源的状态变为 Running

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
    

    输出类似于以下内容:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          2
    

    在此输出中,RUNNING 状态表示 RayService 资源已准备就绪。

  6. 确认 GKE 为 Ray Serve 应用创建了该 Service:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc
    

    输出类似于以下内容:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
    vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
    
  7. 建立与 Ray 头节点的 port-forwarding 会话:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &
    
  8. 向模型发送请求:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
    

    输出类似于以下内容:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}
    

构建和部署 TPU 映像

本教程使用来自 vLLM 的托管 TPU 映像。vLLM 提供一个 Dockerfile.tpu 映像,该映像会在包含 TPU 依赖项的所需 PyTorch XLA 映像的基础上构建 vLLM。不过,您也可以构建和部署自己的 TPU 映像,以便对 Docker 映像的内容进行更精细的控制。

  1. 创建一个 Docker 仓库以存储本指南中的容器映像:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
    gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev
    
  2. 克隆 vLLM 仓库:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  3. 构建映像:

    docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu
    
  4. 使用您的 Artifact Registry 名称标记 TPU 映像:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
    docker tag vllm-tpu ${VLLM_IMAGE}
    

    TAG 替换为您要定义的标记的名称。如果您未指定标记,Docker 将应用默认的最新标记。

  5. 将该映像推送到 Artifact Registry。

    docker push ${VLLM_IMAGE}
    

逐个删除资源

如果您使用的是现有项目,并且不想将其删除,则可逐个删除不再需要的资源。

  1. 删除 RayCluster 自定义资源:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu
    
  2. 删除 Cloud Storage 存储桶:

    gcloud storage rm -r gs://${GSBUCKET}
    
  3. 删除 Artifact Registry 代码库:

    gcloud artifacts repositories delete vllm-tpu \
        --location=${COMPUTE_REGION}
    
  4. 删除集群:

    gcloud container clusters delete ${CLUSTER_NAME} \
        --location=LOCATION
    

    LOCATION 替换为以下某个环境变量:

    • 对于 Autopilot 集群,使用 COMPUTE_REGION
    • 对于 Standard 集群,使用 COMPUTE_ZONE

删除项目

如果您在新的 Google Cloud 项目中部署了教程,但不再需要该项目,请完成以下步骤来将其删除:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

后续步骤