透過 KubeRay 在 GKE 上使用 TPU 提供 LLM

本教學課程說明如何使用 Ray Operator 外掛程式vLLM 服務框架,在 Google Kubernetes Engine (GKE) 上,透過張量處理單元 (TPU) 提供大型語言模型 (LLM) 服務。

在本教學課程中,您可以在 TPU v5e 或 TPU Trillium (v6e) 上提供 LLM 模型,方法如下:

本指南適用於生成式 AI 客戶、新舊 GKE 使用者、機器學習工程師、MLOps (DevOps) 工程師,或對使用 Kubernetes 容器自動化調度管理功能,透過 vLLM 在 TPU 上使用 Ray 服務模型感興趣的平台管理員。

背景

本節說明本指南中使用的重要技術。

GKE 代管 Kubernetes 服務

Google Cloud 提供各種服務,包括 GKE,非常適合部署及管理 AI/機器學習工作負載。GKE 是代管 Kubernetes 服務,可簡化容器化應用程式的部署、擴充及管理作業。GKE 提供必要的基礎架構,包括可擴充的資源、分散式運算和高效率網路,可處理大型語言模型的運算需求。

如要進一步瞭解 Kubernetes 的重要概念,請參閱「開始學習 Kubernetes」。如要進一步瞭解 GKE,以及如何協助您自動處理、管理 Kubernetes 及調度資源,請參閱 GKE 總覽

Ray 運算子

GKE 上的 Ray Operator 外掛程式提供端對端 AI/機器學習平台,可服務、訓練及微調機器學習工作負載。在本教學課程中,您將使用 Ray Serve (Ray 中的框架),從 Hugging Face 提供熱門的 LLM。

TPU

TPU 是 Google 開發的客製化特殊應用積體電路 (ASIC),用於加速機器學習和 AI 模型,這些模型是使用 TensorFlowPyTorchJAX 等架構建構而成。

本教學課程說明如何在 TPU v5e 或 TPU Trillium (v6e) 節點上提供 LLM 模型,並根據每個模型的需求設定 TPU 拓撲,以低延遲時間提供提示。

vLLM

vLLM 是經過高度最佳化的開放原始碼 LLM 服務提供架構,可提升 TPU 的服務處理量,並提供下列功能:

  • 使用 PagedAttention 實作最佳化轉換器
  • 持續批次處理,提升整體放送輸送量
  • 在多個 GPU 上進行張量平行處理和分散式服務

詳情請參閱 vLLM 說明文件

目標

本教學課程包含下列步驟:

  1. 建立含有 TPU 節點集區的 GKE 叢集。
  2. 部署具有單一主機 TPU 配量的 RayCluster 自訂資源。GKE 會將 RayCluster 自訂資源部署為 Kubernetes Pod。
  3. 提供 LLM。
  4. 與模型互動。

您可以視需要設定 Ray Serve 架構支援的下列提供模型資源和技術:

  • 部署 RayService 自訂資源。
  • 使用模型組合功能,組合多個模型。

事前準備

開始之前,請確認您已完成下列工作:

  • 啟用 Google Kubernetes Engine API。
  • 啟用 Google Kubernetes Engine API
  • 如要使用 Google Cloud CLI 執行這項工作,請安裝初始化 gcloud CLI。如果您先前已安裝 gcloud CLI,請執行 gcloud components update 指令,取得最新版本。較舊的 gcloud CLI 版本可能不支援執行本文件中的指令。
  • 如果您尚未建立 Hugging Face 帳戶,請先完成此步驟。
  • 確認您擁有 Hugging Face 權杖
  • 確認您有權存取要使用的 Hugging Face 模型。您通常需要簽署協議,並在 Hugging Face 模型頁面向模型擁有者申請存取權,才能取得這項權限。
  • 確認您具備下列 IAM 角色
    • roles/container.admin
    • roles/iam.serviceAccountAdmin
    • roles/container.clusterAdmin
    • roles/artifactregistry.writer

準備環境

  1. 確認 Google Cloud 專案有足夠的配額,可供單一主機 TPU v5e 或單一主機 TPU Trillium (v6e) 使用。如要管理配額,請參閱「TPU 配額」。

  2. 在 Google Cloud 控制台中啟動 Cloud Shell 執行個體:
    開啟 Cloud Shell

  3. 複製範例存放區:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. 前往工作目錄:

    cd ai-ml/gke-ray/rayserve/llm
    
  5. 設定 GKE 叢集建立作業的預設環境變數:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
    export TOKENIZER_MODE=mistral
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Llama-3.1-70B"
    export MAX_MODEL_LEN=8192
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    更改下列內容:

    • HUGGING_FACE_TOKEN:您的 Hugging Face 存取權杖。
    • REGION:您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
    • ZONE:具有可用 TPU 配額的可用區。
    • VLLM_IMAGE:vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔,或自行建構 TPU 映像檔
  6. 拉取 vLLM 容器映像檔:

    sudo usermod -aG docker ${USER}
    newgrp docker
    docker pull ${VLLM_IMAGE}
    

建立叢集

您可以使用 Ray Operator 外掛程式,在 GKE Autopilot 或標準叢集中,透過 Ray 在 TPU 上提供 LLM。

最佳做法

使用 Autopilot 叢集,享有全代管 Kubernetes 體驗。如要選擇最適合工作負載的 GKE 作業模式,請參閱「選擇 GKE 作業模式」。

使用 Cloud Shell 建立 Autopilot 或 Standard 叢集:

Autopilot

  1. 建立啟用 Ray Operator 外掛程式的 GKE Autopilot 叢集:

    gcloud container clusters create-auto ${CLUSTER_NAME}  \
        --enable-ray-operator \
        --release-channel=rapid \
        --location=${COMPUTE_REGION}
    

標準

  1. 建立啟用 Ray Operator 外掛程式的標準叢集:

    gcloud container clusters create ${CLUSTER_NAME} \
        --release-channel=rapid \
        --location=${COMPUTE_ZONE} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --machine-type="n1-standard-4" \
        --addons=RayOperator,GcsFuseCsiDriver
    
  2. 建立單一主機 TPU 配量節點集區:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

    Mistral-7B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct6e-standard-8t \
        --num-nodes=1
    

    GKE 會建立機器類型為 ct6e-standard-8t 的 TPU v6e 節點集區。

設定 kubectl 與叢集通訊

如要設定 kubectl 與叢集通訊,請執行下列指令:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

標準

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

建立 Hugging Face 憑證的 Kubernetes Secret

如要建立包含 Hugging Face 權杖的 Kubernetes Secret,請執行下列指令:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

建立 Cloud Storage bucket

如要縮短 vLLM 部署作業的啟動時間,並盡量減少每個節點所需的磁碟空間,請使用 Cloud Storage FUSE CSI 驅動程式,將下載的模型和編譯快取掛接到 Ray 節點。

在 Cloud Shell 中執行下列指令:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

這個指令會建立 Cloud Storage bucket,用來儲存從 Hugging Face 下載的模型檔案。

設定 Kubernetes ServiceAccount 以存取 bucket

  1. 建立 Kubernetes ServiceAccount:

    kubectl create serviceaccount ${KSA_NAME} \
        --namespace ${NAMESPACE}
    
  2. 授予 Kubernetes ServiceAccount Cloud Storage 值區的讀寫權限:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    

    GKE 會為 LLM 建立下列資源:

    1. Cloud Storage 值區,用於儲存下載的模型和編譯快取。Cloud Storage FUSE CSI 驅動程式會讀取 bucket 的內容。
    2. 啟用檔案快取的磁碟區,以及 Cloud Storage FUSE 的平行下載功能。
    最佳做法

    視模型內容 (例如權重檔案) 的預期大小,使用 tmpfsHyperdisk / Persistent Disk 支援的檔案快取。在本教學課程中,您將使用以 RAM 為後端的 Cloud Storage FUSE 檔案快取。

部署 RayCluster 自訂資源

部署 RayCluster 自訂資源,通常包含一個系統 Pod 和多個工作站 Pod。

Llama-3-8B-Instruct

如要部署 Llama 3 8B 指令微調模型,請完成下列步驟,建立 RayCluster 自訂資源:

  1. 檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Mistral-7B

完成下列步驟,建立 RayCluster 自訂資源來部署 Mistral-7B 模型:

  1. 檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Llama 3.1 70B

完成下列步驟,建立 RayCluster 自訂資源來部署 Llama 3.1 70B 模型:

  1. 檢查 ray-cluster.tpu-v6e-singlehost.yaml 資訊清單:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
              cloud.google.com/gke-tpu-topology: 2x4
  2. 套用資訊清單:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源,其中包含 workergroup,並在 2x4 拓撲中包含 TPU v6e 單一主機。

連線至 RayCluster 自訂資源

建立 RayCluster 自訂資源後,您就可以連線至 RayCluster 資源,並開始提供模型服務。

  1. 確認 GKE 是否已建立 RayCluster 服務:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
        --output wide
    

    輸出結果會與下列內容相似:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
    vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###
    

    請等到 STATUS 變成 ready,且 HEAD POD IPHEAD SERVICE IP 欄位顯示 IP 位址。

  2. 建立與 Ray 首節點的 port-forwarding 工作階段:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 10001:10001"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &
    
  3. 確認 Ray 用戶端可以連線至遠端 RayCluster 自訂資源:

    docker run --net=host -it ${VLLM_IMAGE} \
    ray list nodes --address http://localhost:8265
    

    輸出結果會與下列內容相似:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
    Stats:
    ------------------------------
    Total: 2
    
    Table:
    ------------------------------
        NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
    0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                               memory: #.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               node:__internal_head__: 1.0
                                                                                               object_store_memory: #.### GiB
    1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                               TPU: 8.0
                                                                                               TPU-v#e-8-head: 1.0
                                                                                               accelerator_type:TPU-V#E: 1.0
                                                                                               memory: ###.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               object_store_memory: ##.### GiB
                                                                                               tpu-group-0: 1.0
    

使用 vLLM 部署模型

如要使用 vLLM 部署特定模型,請按照這些操作說明操作。

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

查看 Ray 資訊主頁

您可以透過 Ray 資訊主頁查看 Ray Serve 部署作業和相關記錄。

  1. 按一下 Cloud Shell 工作列右上方的「Web Preview」「網頁預覽」圖示按鈕。
  2. 按一下「變更通訊埠」,然後將通訊埠編號設為 8265
  3. 按一下「變更並預覽」
  4. 在 Ray 資訊主頁中,按一下「Serve」分頁標籤。

當「服務」部署作業的狀態為 HEALTHY 時,模型即可開始處理輸入內容。

提供模型

本指南著重介紹支援文字生成的模型,這項技術可根據提示詞建立文字內容。

Llama-3-8B-Instruct

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「提供服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Mistral-7B

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「提供服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llama 3.1 70B

  1. 設定通訊埠轉送至伺服器:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. 將提示傳送至「提供服務」端點:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

額外設定

您可以視需要設定 Ray Serve 架構支援的下列提供模型資源和技術:

部署 RayService

您可以使用 RayService 自訂資源,部署本教學課程中的相同模型。

  1. 刪除您在本教學課程中建立的 RayCluster 自訂資源:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
    
  2. 建立 RayService 自訂資源,以部署模型:

    Llama-3-8B-Instruct

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

      GKE 會建立 RayService,其中包含 workergroup,而 workergroup 拓撲中的 TPU v5e 單一主機。2x4

    Mistral-7B

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

      GKE 會建立 RayService,其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

    Llama 3.1 70B

    1. 檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      envsubst 指令會取代資訊清單中的環境變數。

    GKE 會建立 RayCluster 自訂資源,並在其中部署 Ray Serve 應用程式,然後建立後續的 RayService 自訂資源。

  3. 驗證 RayService 資源的狀態:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu
    

    等待服務狀態變更為 Running

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          1
    
  4. 擷取 RayCluster 標頭服務的名稱:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
        --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    
  5. 建立 port-forwarding 工作階段,連線至 Ray 首節點,查看 Ray 資訊主頁:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    
  6. 查看 Ray 資訊主頁

  7. 提供模型

  8. 清理 RayService 資源:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu
    

使用模型組合功能撰寫多個模型

模型組合是一種技術,可將多個模型組合成單一應用程式。

在本節中,您會使用 GKE 叢集,將 Llama 3 8B IT 和 Gemma 7B IT 這兩個模型組合成單一應用程式:

  • 第一個模型是助理模型,會回答提示中提出的問題。
  • 第二個模型是摘要模型。助理模型輸出內容會串連至摘要模型輸入內容。最終結果是助理模型回覆的摘要版本。
  1. 如要存取 Gemma 模型,請完成下列步驟:

    1. 登入 Kaggle 平台、簽署授權同意書,然後取得 Kaggle API 權杖。在本教學課程中,您會使用 Kubernetes Secret 儲存 Kaggle 憑證。
    2. 前往 Kaggle.com 的模型同意聲明頁面
    3. 登入 Kaggle (如果尚未登入)。
    4. 按一下「要求存取權」
    5. 在「選擇同意聲明適用的帳戶」部分,選取「透過 Kaggle 帳戶驗證」,使用 Kaggle 帳戶授予同意聲明。
    6. 接受模型條款及細則
  2. 設定環境:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
    export SUMMARIZER_MODEL_ID=google/gemma-7b-it
    
  3. 如果是標準叢集,請建立額外的單一主機 TPU 配量節點集區:

    gcloud container node-pools create tpu-2 \
      --location=${COMPUTE_ZONE} \
      --cluster=${CLUSTER_NAME} \
      --machine-type=MACHINE_TYPE \
      --num-nodes=1
    

    MACHINE_TYPE 替換為下列任一機器類型:

    • ct5lp-hightpu-8t 來佈建 TPU v5e。
    • ct6e-standard-8t 佈建 TPU v6e。

    Autopilot 叢集會自動佈建必要節點。

  4. 根據要使用的 TPU 版本部署 RayService 資源:

    TPU v5e

    1. 檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

    TPU v6e

    1. 檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. 套用資訊清單:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      
  5. 等待 RayService 資源的狀態變更為 Running

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
    

    輸出結果會與下列內容相似:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          2
    

    在這個輸出內容中,RUNNING 狀態表示 RayService 資源已就緒。

  6. 確認 GKE 是否已為 Ray Serve 應用程式建立 Service:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc
    

    輸出結果會與下列內容相似:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
    vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
    
  7. 建立與 Ray 首節點的 port-forwarding 工作階段:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &
    
  8. 向模型傳送要求:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
    

    輸出結果會與下列內容相似:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}
    

建構及部署 TPU 映像檔

本教學課程使用 vLLM 的代管 TPU 映像檔。vLLM 提供 Dockerfile.tpu 映像檔,可根據包含 TPU 依附元件的必要 PyTorch XLA 映像檔建構 vLLM。不過,您也可以自行建構及部署 TPU 映像檔,進一步控管 Docker 映像檔的內容。

  1. 建立 Docker 存放區,用於儲存本指南的容器映像檔:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
    gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev
    
  2. 複製 vLLM 存放區:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  3. 建構映像檔:

    docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu
    
  4. 使用 Artifact Registry 名稱標記 TPU 映像檔:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
    docker tag vllm-tpu ${VLLM_IMAGE}
    

    TAG 替換為要定義的標記名稱。如未指定,Docker 會套用預設的最新標記。

  5. 將映像檔推送至 Artifact Registry:

    docker push ${VLLM_IMAGE}
    

刪除個別資源

如果您使用現有專案,但不想刪除專案,可以刪除個別資源。

  1. 刪除 RayCluster 自訂資源:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu
    
  2. 刪除 Cloud Storage bucket:

    gcloud storage rm -r gs://${GSBUCKET}
    
  3. 刪除 Artifact Registry 存放區:

    gcloud artifacts repositories delete vllm-tpu \
        --location=${COMPUTE_REGION}
    
  4. 刪除叢集:

    gcloud container clusters delete ${CLUSTER_NAME} \
        --location=LOCATION
    

    LOCATION 替換為下列任一環境變數:

    • 如果是 Autopilot 叢集,請使用 COMPUTE_REGION
    • 如果是 Standard 叢集,請使用 COMPUTE_ZONE

刪除專案

如果您是在新 Google Cloud 專案中部署本教學課程,且已不再需要該專案,請完成下列步驟來刪除專案:

  1. 前往 Google Cloud 控制台的「Manage resources」(管理資源) 頁面。

    前往「Manage resources」(管理資源)

  2. 在專案清單中選取要刪除的專案,然後點選「Delete」(刪除)
  3. 在對話方塊中輸入專案 ID,然後按一下 [Shut down] (關閉) 以刪除專案。

後續步驟