Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

透過 KubeRay 在 GKE 上使用 TPU 提供 LLM

自動駕駛標準

本教學課程說明如何使用 Ray Operator 外掛程式和 vLLM 服務框架，在 Google Kubernetes Engine (GKE) 上，透過張量處理單元 (TPU) 提供大型語言模型 (LLM) 服務。

在本教學課程中，您可以在 TPU v5e 或 TPU Trillium (v6e) 上提供 LLM 模型，方法如下：

在單一主機 TPU v5e 上執行 Llama 3 8B 指令。
在單一主機 TPU v5e 上執行 Mistral 7B instruct v0.3。
在單一主機 TPU Trillium (v6e) 上使用 Llama 3.1 70B。

本指南適用於生成式 AI 客戶、新舊 GKE 使用者、機器學習工程師、MLOps (DevOps) 工程師或平台管理員，他們有興趣使用 Kubernetes 容器自動化調度管理功能，透過 vLLM 在 TPU 上使用 Ray 提供模型。

背景

本節說明本指南中使用的重要技術。

GKE 代管式 Kubernetes 服務

Google Cloud 提供各種服務，包括 GKE，非常適合部署及管理 AI/機器學習工作負載。GKE 是代管 Kubernetes 服務，可簡化容器化應用程式的部署、擴充及管理作業。GKE 提供必要的基礎架構，包括可擴充的資源、分散式運算和高效率網路，可處理大型語言模型的運算需求。

如要進一步瞭解 Kubernetes 的重要概念，請參閱「開始學習 Kubernetes」。如要進一步瞭解 GKE，以及如何協助您自動處理、管理 Kubernetes 及調度資源，請參閱 GKE 總覽。

Ray 運算子

GKE 上的 Ray Operator 外掛程式提供端對端 AI/機器學習平台，可服務、訓練及微調機器學習工作負載。在本教學課程中，您將使用 Ray 中的 Ray Serve 框架，從 Hugging Face 提供熱門的 LLM。

TPU

TPU 是 Google 開發的客製化特殊應用積體電路 (ASIC)，用於加速機器學習和 AI 模型，這些模型是使用 TensorFlow、PyTorch 和 JAX 等架構建構而成。

本教學課程說明如何在 TPU v5e 或 TPU Trillium (v6e) 節點上提供 LLM 模型，並根據每個模型提供提示的低延遲需求，設定 TPU 拓撲。

vLLM

vLLM 是經過高度最佳化的開放原始碼 LLM 服務提供架構，可提升 TPU 的服務處理量，並提供下列功能：

使用 PagedAttention 實作最佳化轉換器
持續批次處理，提升整體放送輸送量
多個 GPU 上的張量平行處理和分散式服務

詳情請參閱 vLLM 說明文件。

目標

本教學課程包含下列步驟：

建立含有 TPU 節點集區的 GKE 叢集。
部署具有單一主機 TPU 配量的 RayCluster 自訂資源。GKE 會將 RayCluster 自訂資源部署為 Kubernetes Pod。
提供大型語言模型。
與模型互動。

您可以視需要設定 Ray Serve 架構支援的下列模型服務資源和技術：

部署 RayService 自訂資源。
使用模型組合功能，組合多個模型。

事前準備

開始之前，請務必先完成下列工作：

啟用 Google Kubernetes Engine API。

啟用 Google Kubernetes Engine API

如要使用 Google Cloud CLI 執行這項工作，請安裝並初始化 gcloud CLI。如果您先前已安裝 gcloud CLI，請執行 gcloud components update 指令，取得最新版本。較舊的 gcloud CLI 版本可能不支援執行本文件中的指令。
注意：如果是現有的 gcloud CLI 安裝項目，請務必設定 compute/region 屬性。如果您主要使用區域叢集，請改為設定 compute/zone。設定預設位置後，即可避免 gcloud CLI 發生下列錯誤：One of [--zone, --region] must be supplied: Please specify location。如果叢集位置與您設定的預設位置不同，您可能需要在特定指令中指定位置。

如果您尚未建立 Hugging Face 帳戶，請先完成此步驟。
請確認您擁有 Hugging Face 權杖。
確認您有權存取要使用的 Hugging Face 模型。您通常需要簽署協議，並在 Hugging Face 模型頁面向模型擁有者申請存取權，才能取得這項權限。
確認您具備下列 IAM 角色：
- roles/container.admin
- roles/iam.serviceAccountAdmin
- roles/container.clusterAdmin
- roles/artifactregistry.writer

準備環境

確認 Google Cloud 專案有足夠的配額，可供單一主機 TPU v5e 或單一主機 TPU Trillium (v6e) 使用。如要管理配額，請參閱「TPU 配額」。
在 Google Cloud 控制台中啟動 Cloud Shell 執行個體：
開啟 Cloud Shell

複製範例存放區：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples

前往工作目錄：
```
cd ai-ml/gke-ray/rayserve/llm
```

設定 GKE 叢集建立作業的預設環境變數：

Llama-3-8B-Instruct

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

更改下列內容：

HUGGING_FACE_TOKEN：您的 Hugging Face 存取權杖。
REGION：您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
ZONE：具有可用 TPU 配額的可用區。
VLLM_IMAGE：vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔，或自行建構 TPU 映像檔。

Mistral-7B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
export TOKENIZER_MODE=mistral
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

更改下列內容：

HUGGING_FACE_TOKEN：您的 Hugging Face 存取權杖。
REGION：您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
ZONE：具有可用 TPU 配額的可用區。
VLLM_IMAGE：vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔，或自行建構 TPU 映像檔。

Llama 3.1 70B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Llama-3.1-70B"
export MAX_MODEL_LEN=8192
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

更改下列內容：

HUGGING_FACE_TOKEN：您的 Hugging Face 存取權杖。
REGION：您擁有 TPU 配額的區域。確認您要使用的 TPU 版本是否適用於這個區域。詳情請參閱「GKE 中的 TPU 可用性」。
ZONE：具有可用 TPU 配額的可用區。
VLLM_IMAGE：vLLM TPU 映像檔。您可以使用公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 映像檔，或自行建構 TPU 映像檔。

拉取 vLLM 容器映像檔：

sudo usermod -aG docker ${USER}
newgrp docker
docker pull ${VLLM_IMAGE}

建立叢集

您可以使用 Ray Operator 外掛程式，在 GKE Autopilot 或標準叢集中，透過 Ray 在 TPU 上提供 LLM。

最佳做法：

使用 Autopilot 叢集，享有全代管 Kubernetes 體驗。如要選擇最適合工作負載的 GKE 作業模式，請參閱「選擇 GKE 作業模式」。

使用 Cloud Shell 建立 Autopilot 或 Standard 叢集：

Autopilot

建立啟用 Ray Operator 外掛程式的 GKE Autopilot 叢集：

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --release-channel=rapid \
    --location=${COMPUTE_REGION}

標準

建立啟用 Ray Operator 外掛程式的標準叢集：

gcloud container clusters create ${CLUSTER_NAME} \
    --release-channel=rapid \
    --location=${COMPUTE_ZONE} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --machine-type="n1-standard-4" \
    --addons=RayOperator,GcsFuseCsiDriver

建立單一主機 TPU 配量節點集區：

Llama-3-8B-Instruct

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

Mistral-7B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE 會建立機器類型為 ct5lp-hightpu-8t 的 TPU v5e 節點集區。

Llama 3.1 70B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct6e-standard-8t \
    --num-nodes=1

GKE 會建立機器類型為 ct6e-standard-8t 的 TPU v6e 節點集區。

設定 kubectl 與叢集通訊

如要設定 kubectl 與叢集通訊，請執行下列指令：

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

標準

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

建立 Hugging Face 憑證的 Kubernetes Secret

如要建立包含 Hugging Face 權杖的 Kubernetes Secret，請執行下列指令：

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

建立 Cloud Storage bucket

如要縮短 vLLM 部署作業的啟動時間，並盡量減少每個節點所需的磁碟空間，請使用 Cloud Storage FUSE CSI 驅動程式，將下載的模型和編譯快取掛接至 Ray 節點。

在 Cloud Shell 中執行下列指令：

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

這個指令會建立 Cloud Storage bucket，用來儲存從 Hugging Face 下載的模型檔案。

設定 Kubernetes ServiceAccount 來存取 bucket

建立 Kubernetes ServiceAccount：

kubectl create serviceaccount ${KSA_NAME} \
    --namespace ${NAMESPACE}

授予 Kubernetes ServiceAccount Cloud Storage 值區的讀寫權限：
```
gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
    --role "roles/storage.objectUser"
```
GKE 會為 LLM 建立下列資源：
1. Cloud Storage bucket，用於儲存下載的模型和編譯快取。A Cloud Storage FUSE CSI 驅動程式會讀取 bucket 的內容。
2. 啟用檔案快取的磁碟區，以及 Cloud Storage FUSE 的平行下載功能。
最佳做法：
視模型內容 (例如權重檔案) 的預期大小，使用 tmpfs 或 Hyperdisk / Persistent Disk 支援的檔案快取。在本教學課程中，您將使用以 RAM 為後端的 Cloud Storage FUSE 檔案快取。

部署 RayCluster 自訂資源

部署 RayCluster 自訂資源，通常包含一個系統 Pod 和多個工作站 Pod。

Llama-3-8B-Instruct

完成下列步驟，建立 RayCluster 自訂資源，部署 Llama 3 8B 指令微調模型：

檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源，其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Mistral-7B

完成下列步驟，建立 RayCluster 自訂資源來部署 Mistral-7B 模型：

檢查 ray-cluster.tpu-v5e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源，其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Llama 3.1 70B

完成下列步驟，建立 RayCluster 自訂資源來部署 Llama 3.1 70B 模型：

檢查 ray-cluster.tpu-v6e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源，其中包含 workergroup，並在 2x4 拓撲中包含 TPU v6e 單一主機。

連線至 RayCluster 自訂資源

建立 RayCluster 自訂資源後，您就可以連線至 RayCluster 資源，並開始提供模型服務。

確認 GKE 是否已建立 RayCluster 服務：

kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
    --output wide

輸出結果會與下列內容相似：

NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###

請等到 STATUS 變成 ready，且 HEAD POD IP 和 HEAD SERVICE IP 欄位顯示 IP 位址。

建立與 Ray 首節點的 port-forwarding 工作階段：

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 10001:10001"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &

確認 Ray 用戶端可以連線至遠端 RayCluster 自訂資源：

docker run --net=host -it ${VLLM_IMAGE} \
ray list nodes --address http://localhost:8265

輸出結果會與下列內容相似：

======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                           memory: #.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           node:__internal_head__: 1.0
                                                                                           object_store_memory: #.### GiB
1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                           TPU: 8.0
                                                                                           TPU-v#e-8-head: 1.0
                                                                                           accelerator_type:TPU-V#E: 1.0
                                                                                           memory: ###.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           object_store_memory: ##.### GiB
                                                                                           tpu-group-0: 1.0

使用 vLLM 部署模型

如要使用 vLLM 部署特定模型，請按照這些操作說明操作。

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

查看 Ray 資訊主頁

您可以透過 Ray 資訊主頁查看 Ray Serve 部署作業和相關記錄。

按一下 Cloud Shell 工作列右上方的「Web Preview」按鈕。
按一下「變更通訊埠」，然後將通訊埠編號設為 8265。
按一下「變更並預覽」。
在 Ray 資訊主頁中，按一下「Serve」分頁標籤。

當「服務」部署作業的狀態為 HEALTHY 時，模型即可開始處理輸入內容。

提供模型

本指南著重介紹支援文字生成的模型，這項技術可根據提示詞建立文字內容。

Llama-3-8B-Instruct

設定通訊埠轉送至伺服器：

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

將提示傳送至「服務」端點：

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

展開下列部分，查看輸出內容範例。

{"prompt": "What
are the top 5 most popular programming languages? Be brief.", "text": " (Note:
This answer may change over time.)\n\nAccording to the TIOBE Index, a widely
followed measure of programming language popularity, the top 5 languages
are:\n\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nThese rankings are
based on a combination of search engine queries, web traffic, and online
courses. Keep in mind that other sources may have slightly different rankings.
(Source: TIOBE Index, August 2022)", "token_ids": [320, 9290, 25, 1115, 4320,
1253, 2349, 927, 892, 9456, 11439, 311, 279, 350, 3895, 11855, 8167, 11, 264,
13882, 8272, 6767, 315, 15840, 4221, 23354, 11, 279, 1948, 220, 20, 15823,
527, 1473, 16, 13, 13210, 198, 17, 13, 13325, 198, 18, 13, 8102, 198, 19, 13,
356, 23792, 20, 13, 356, 27585, 9673, 33407, 527, 3196, 389, 264, 10824, 315,
2778, 4817, 20126, 11, 3566, 9629, 11, 323, 2930, 14307, 13, 13969, 304, 4059,
430, 1023, 8336, 1253, 617, 10284, 2204, 33407, 13, 320, 3692, 25, 350, 3895,
11855, 8167, 11, 6287, 220, 2366, 17, 8, 128009]}

Mistral-7B

設定通訊埠轉送至伺服器：

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

將提示傳送至「服務」端點：

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

展開下列部分，查看輸出內容範例。

{"prompt": "What are the top 5 most popular programming languages? Be brief.",
"text": "\n\n1. JavaScript: Widely used for web development, particularly for
client-side scripting and building dynamic web page content.\n\n2. Python:
Known for its simplicity and readability, it's widely used for web
development, machine learning, data analysis, and scientific computing.\n\n3.
Java: A general-purpose programming language used in a wide range of
applications, including Android app development, web services, and
enterprise-level applications.\n\n4. C#: Developed by Microsoft, it's often
used for Windows desktop apps, game development (Unity), and web development
(ASP.NET).\n\n5. TypeScript: A superset of JavaScript that adds optional
static typing and other features for large-scale, maintainable JavaScript
applications.", "token_ids": [781, 781, 29508, 29491, 27049, 29515, 1162,
1081, 1491, 2075, 1122, 5454, 4867, 29493, 7079, 1122, 4466, 29501, 2973,
7535, 1056, 1072, 4435, 11384, 5454, 3652, 3804, 29491, 781, 781, 29518,
29491, 22134, 29515, 1292, 4444, 1122, 1639, 26001, 1072, 1988, 3205, 29493,
1146, 29510, 29481, 13343, 2075, 1122, 5454, 4867, 29493, 6367, 5936, 29493,
1946, 6411, 29493, 1072, 11237, 22031, 29491, 781, 781, 29538, 29491, 12407,
29515, 1098, 3720, 29501, 15460, 4664, 17060, 4610, 2075, 1065, 1032, 6103,
3587, 1070, 9197, 29493, 3258, 13422, 1722, 4867, 29493, 5454, 4113, 29493,
1072, 19123, 29501, 5172, 9197, 29491, 781, 781, 29549, 29491, 1102, 29539,
29515, 9355, 1054, 1254, 8670, 29493, 1146, 29510, 29481, 3376, 2075, 1122,
9723, 25470, 14189, 29493, 2807, 4867, 1093, 2501, 1240, 1325, 1072, 5454,
4867, 1093, 2877, 29521, 29491, 12466, 1377, 781, 781, 29550, 29491, 6475,
7554, 29515, 1098, 26434, 1067, 1070, 27049, 1137, 14401, 12052, 1830, 25460,
1072, 1567, 4958, 1122, 3243, 29501, 6473, 29493, 9855, 1290, 27049, 9197,
29491, 2]}

Llama 3.1 70B

設定通訊埠轉送至伺服器：

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

將提示傳送至「服務」端點：

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

展開下列部分，查看輸出內容範例。

{"prompt": "What are
the top 5 most popular programming languages? Be brief.", "text": " This is a
very subjective question, but there are some general guidelines to follow when
selecting a language. For example, if you\u2019re looking for a language
that\u2019s easy to learn, you might want to consider Python. It\u2019s one of
the most popular languages in the world, and it\u2019s also relatively easy to
learn. If you\u2019re looking for a language that\u2019s more powerful, you
might want to consider Java. It\u2019s a more complex language, but it\u2019s
also very popular. Whichever language you choose, make sure you do your
research and pick one that\u2019s right for you.\nThe most popular programming
languages are:\nWhy is C++ so popular?\nC++ is a powerful and versatile
language that is used in many different types of software. It is also one of
the most popular programming languages, with a large community of developers
who are always creating new and innovative ways to use it. One of the reasons
why C++ is so popular is because it is a very efficient language. It allows
developers to write code that is both fast and reliable, which is essential
for many types of software. Additionally, C++ is very flexible, meaning that
it can be used for a wide range of different purposes. Finally, C++ is also
very popular because it is easy to learn. There are many resources available
online and in books that can help anyone get started with learning the
language.\nJava is a versatile language that can be used for a variety of
purposes. It is one of the most popular programming languages in the world and
is used by millions of people around the globe. Java is used for everything
from developing desktop applications to creating mobile apps and games. It is
also a popular choice for web development. One of the reasons why Java is so
popular is because it is a platform-independent language. This means that it
can be used on any type of computer or device, regardless of the operating
system. Java is also very versatile and can be used for a variety of different
purposes.", "token_ids": [1115, 374, 264, 1633, 44122, 3488, 11, 719, 1070,
527, 1063, 4689, 17959, 311, 1833, 994, 27397, 264, 4221, 13, 1789, 3187, 11,
422, 499, 3207, 3411, 369, 264, 4221, 430, 753, 4228, 311, 4048, 11, 499,
2643, 1390, 311, 2980, 13325, 13, 1102, 753, 832, 315, 279, 1455, 5526, 15823,
304, 279, 1917, 11, 323, 433, 753, 1101, 12309, 4228, 311, 4048, 13, 1442,
499, 3207, 3411, 369, 264, 4221, 430, 753, 810, 8147, 11, 499, 2643, 1390,
311, 2980, 8102, 13, 1102, 753, 264, 810, 6485, 4221, 11, 719, 433, 753, 1101,
1633, 5526, 13, 1254, 46669, 4221, 499, 5268, 11, 1304, 2771, 499, 656, 701,
3495, 323, 3820, 832, 430, 753, 1314, 369, 499, 627, 791, 1455, 5526, 15840,
15823, 527, 512, 10445, 374, 356, 1044, 779, 5526, 5380, 34, 1044, 374, 264,
8147, 323, 33045, 4221, 430, 374, 1511, 304, 1690, 2204, 4595, 315, 3241, 13,
1102, 374, 1101, 832, 315, 279, 1455, 5526, 15840, 15823, 11, 449, 264, 3544,
4029, 315, 13707, 889, 527, 2744, 6968, 502, 323, 18699, 5627, 311, 1005, 433,
13, 3861, 315, 279, 8125, 3249, 356, 1044, 374, 779, 5526, 374, 1606, 433,
374, 264, 1633, 11297, 4221, 13, 1102, 6276, 13707, 311, 3350, 2082, 430, 374,
2225, 5043, 323, 15062, 11, 902, 374, 7718, 369, 1690, 4595, 315, 3241, 13,
23212, 11, 356, 1044, 374, 1633, 19303, 11, 7438, 430, 433, 649, 387, 1511,
369, 264, 7029, 2134, 315, 2204, 10096, 13, 17830, 11, 356, 1044, 374, 1101,
1633, 5526, 1606, 433, 374, 4228, 311, 4048, 13, 2684, 527, 1690, 5070, 2561,
2930, 323, 304, 6603, 430, 649, 1520, 5606, 636, 3940, 449, 6975, 279, 4221,
627, 15391, 3S74, 264, 33045, 4221, 430, 649, 387, 1511, 369, 264, 8205, 315,
10096, 13, 1102, 374, 832, 315, 279, 1455, 5526, 15840, 15823, 304, 279, 1917,
323, 374, 1511, 555, 11990, 315, 1274, 2212, 279, 24867, 13, 8102, 374, 1511,
369, 4395, 505, 11469, 17963, 8522, 311, 6968, 6505, 10721, 323, 3953, 13,
1102, 374, 1101, 264, 5526, 5873, 369, 3566, 4500, 13, 3861, 315, 279, 8125,
3249, 8102, 374, 779, 5526, 374, 1606, 433, 374, 264, 5452, 98885, 4221, 13,
1115, 3445, 430, 433, 649, 387, 1511, 389, 904, 955, 315, 6500, 477, 3756, 11,
15851, 315, 279, 10565, 1887, 13, 8102, 374, 1101, 1633, 33045, 323, 649, 387,
1511, 369, 264, 8205, 315, 2204, 10096, 13, 128001]}

額外設定

您可以視需要設定 Ray Serve 架構支援的下列模型服務資源和技術：

部署 RayService 自訂資源。在本教學課程的上述步驟中，您使用的是 RayCluster，而非 RayService。建議您在正式環境中使用 RayService。
使用模型組合功能撰寫多個模型。設定 Ray Serve 架構支援的模型多工和模型組合。模型組合可讓您串連多個 LLM 的輸入和輸出內容，並將模型視為單一應用程式進行擴充。
建構及部署專屬 TPU 映像檔。如果您需要更精細地控管 Docker 映像檔的內容，建議使用這個選項。

部署 RayService

您可以使用 RayService 自訂資源，部署本教學課程中的相同模型。

刪除您在本教學課程中建立的 RayCluster 自訂資源：
```
kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
```

建立 RayService 自訂資源，以部署模型：

Llama-3-8B-Instruct

檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayService，其中包含 workergroup，而 workergroup 拓撲中的 TPU v5e 單一主機。2x4

Mistral-7B

檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayService，其中包含 2x4 拓撲中的 TPU v5e 單一主機。workergroup

Llama 3.1 70B

檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

envsubst 指令會取代資訊清單中的環境變數。

GKE 會建立 RayCluster 自訂資源，並在其中部署 Ray Serve 應用程式，然後建立後續的 RayService 自訂資源。

驗證 RayService 資源的狀態：

kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu

等待服務狀態變更為 Running：

NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          1

擷取 RayCluster 標頭服務的名稱：
```
SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
    --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
```
注意： 如果系統未擷取 RayCluster 標頭服務值，請執行 kubectl get services --namespace ${NAMESPACE} 指令，手動更新 SERVICE_NAME 值。

建立 port-forwarding 工作階段，連線至 Ray 首節點，即可查看 Ray 資訊主頁：

pkill -f "kubectl .* port-forward .* 8265:8265"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &

查看 Ray 資訊主頁。
提供模型。

清除 RayService 資源：

kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu

使用模型組合功能撰寫多個模型

模型組合是一種技術，可將多個模型組合成單一應用程式。

在本節中，您會使用 GKE 叢集，將兩個模型 (Llama 3 8B IT 和 Gemma 7B IT) 組合成單一應用程式：

第一個模型是助理模型，可回答提示中提出的問題。
第二個模型是摘要模型。助理模型輸出內容會串連至摘要模型輸入內容。最終結果是助理模型回覆的摘要版本。

如要存取 Gemma 模型，請完成下列步驟：
1. 登入 Kaggle 平台，簽署授權同意書，然後取得 Kaggle API 權杖。在本教學課程中，您會使用 Kubernetes Secret 儲存 Kaggle 憑證。
2. 前往 Kaggle.com 的模型同意聲明頁面。
3. 登入 Kaggle (如果尚未登入)。
4. 按一下「要求存取權」。
5. 在「選擇同意聲明適用的帳戶」部分，選取「透過 Kaggle 帳戶驗證」，使用 Kaggle 帳戶授予同意聲明。
6. 接受模型條款及細則。

設定環境：

export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
export SUMMARIZER_MODEL_ID=google/gemma-7b-it

如果是標準叢集，請建立額外的單一主機 TPU 配量節點集區：
```
gcloud container node-pools create tpu-2 \
  --location=${COMPUTE_ZONE} \
  --cluster=${CLUSTER_NAME} \
  --machine-type=MACHINE_TYPE \
  --num-nodes=1
```
將 MACHINE_TYPE 替換為下列任一機器類型：
- ct5lp-hightpu-8t 來佈建 TPU v5e。
- ct6e-standard-8t 佈建 TPU v6e。
Autopilot 叢集會自動佈建必要節點。

根據要使用的 TPU 版本部署 RayService 資源：

TPU v5e

檢查 ray-service.tpu-v5e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

TPU v6e

檢查 ray-service.tpu-v6e-singlehost.yaml 資訊清單：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

套用資訊清單：

envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

等待 RayService 資源的狀態變更為 Running：
```
kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
```
輸出結果會與下列內容相似：
```
NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          2
```
在這個輸出內容中，RUNNING 狀態表示 RayService 資源已就緒。

確認 GKE 是否已為 Ray Serve 應用程式建立 Service：

kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc

輸出結果會與下列內容相似：

NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###

建立與 Ray 首節點的 port-forwarding 工作階段：

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &

向模型傳送要求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

輸出結果會與下列內容相似：

  {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}

建構及部署 TPU 映像檔

本教學課程使用 vLLM 的代管 TPU 映像檔。vLLM 提供 Dockerfile.tpu 映像檔，可根據包含 TPU 依附元件的必要 PyTorch XLA 映像檔建構 vLLM。不過，您也可以自行建構及部署 TPU 映像檔，進一步控管 Docker 映像檔的內容。

建立 Docker 存放區，用於儲存本指南的容器映像檔：

gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev

複製 vLLM 存放區：

git clone https://github.com/vllm-project/vllm.git
cd vllm

建構映像檔：

docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu

使用 Artifact Registry 名稱標記 TPU 映像檔：
```
export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
docker tag vllm-tpu ${VLLM_IMAGE}
```
將 TAG 替換為要定義的標記名稱。如未指定，Docker 會套用預設的最新標記。
將映像檔推送至 Artifact Registry：
```
docker push ${VLLM_IMAGE}
```

刪除個別資源

如果您使用現有專案，但不想刪除專案，可以刪除個別資源。

刪除 RayCluster 自訂資源：

kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu

刪除 Cloud Storage bucket：
```
gcloud storage rm -r gs://${GSBUCKET}
```

刪除 Artifact Registry 存放區：

gcloud artifacts repositories delete vllm-tpu \
    --location=${COMPUTE_REGION}

刪除叢集：
```
gcloud container clusters delete ${CLUSTER_NAME} \
    --location=LOCATION
```
將 LOCATION 替換為下列任一環境變數：
- 如果是 Autopilot 叢集，請使用 COMPUTE_REGION。
- 如果是 Standard 叢集，請使用 COMPUTE_ZONE。

刪除專案

如果您是在新 Google Cloud 專案中部署本教學課程，且已不再需要該專案，請完成下列步驟來刪除專案：

前往 Google Cloud 控制台的「Manage resources」(管理資源) 頁面。
前往「Manage resources」(管理資源)
在專案清單中選取要刪除的專案，然後點選「Delete」(刪除)。
在對話方塊中輸入專案 ID，然後按一下 [Shut down] (關閉) 以刪除專案。

後續步驟

瞭解如何運用 GKE 平台的自動化調度管理功能，執行最佳化的 AI/機器學習工作負載。
如要瞭解如何在 GKE 上使用 Ray Serve，請查看 GitHub 中的程式碼範例。
如要瞭解如何收集及查看在 GKE 上執行的 Ray 叢集指標，請完成「收集及查看 GKE 上 Ray 叢集的記錄和指標」一文中的步驟。

透過 KubeRay 在 GKE 上使用 TPU 提供 LLM 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

背景

GKE 代管式 Kubernetes 服務

Ray 運算子

TPU

vLLM

目標

事前準備

準備環境

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

建立叢集

Autopilot

標準

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

設定 kubectl 與叢集通訊

Autopilot

標準

建立 Hugging Face 憑證的 Kubernetes Secret

建立 Cloud Storage bucket

設定 Kubernetes ServiceAccount 來存取 bucket

部署 RayCluster 自訂資源

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

連線至 RayCluster 自訂資源

使用 vLLM 部署模型

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

查看 Ray 資訊主頁

提供模型

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

額外設定

部署 RayService

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

使用模型組合功能撰寫多個模型

TPU v5e

TPU v6e

建構及部署 TPU 映像檔

刪除個別資源

刪除專案

後續步驟

透過 KubeRay 在 GKE 上使用 TPU 提供 LLM