透過 TorchServe 在 GKE 上提供可擴充的 LLM

Autopilot

本教學課程說明如何使用 TorchServe 架構，將可擴充的機器學習 (ML) 模型部署至 Google Kubernetes Engine (GKE) 叢集並提供服務。您提供預先訓練的 PyTorch 模型，根據使用者要求生成預測結果。部署模型後，您會取得預測網址，應用程式會使用該網址傳送預測要求。這個方法可讓您獨立調整模型和網頁應用程式的規模。在 Autopilot 上部署機器學習工作負載和應用程式時，GKE 會選擇最有效率的基礎機型和大小來執行工作負載。

本教學課程的適用對象為機器學習 (ML) 工程師、平台管理員和操作員，以及有興趣使用 GKE Autopilot 減少節點設定、擴充和升級管理負擔的資料和 AI 專家。如要進一步瞭解我們在 Google Cloud 內容中提及的常見角色和範例工作，請參閱「常見的 GKE 使用者角色和工作」。

閱讀本頁面之前，請先熟悉 GKE Autopilot 模式。

關於教學課程應用程式

這個應用程式是使用 Fast Dash 架構建立的小型 Python 網頁應用程式。您可以使用應用程式將預測要求傳送至 T5 模型。這項應用程式會擷取使用者輸入的文字和語言組合，並將資訊傳送給模型。模型會翻譯文字，並將結果傳回應用程式，再由應用程式向使用者顯示結果。如要進一步瞭解 Fast Dash，請參閱Fast Dash 說明文件。

目標

從 Hugging Face 存放區準備預先訓練的 T5 模型，方法是將模型封裝為容器映像檔，然後推送至 Artifact Registry，以供服務使用
將模型部署至 Autopilot 叢集
部署與模型通訊的 Fast Dash 應用程式
根據 Prometheus 指標自動調度模型資源

費用

在本文件中，您會使用下列 Google Cloud的計費元件：

您可以使用 Pricing Calculator，根據預測用量估算費用。

初次使用 Google Cloud 的使用者可能符合免費試用期資格。

完成本文所述工作後，您可以刪除建立的資源，避免繼續計費，詳情請參閱「清除所用資源」。

事前準備

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

注意：如果您先前已安裝 gcloud CLI，請執行 gcloud components update，確保您使用的是最新版本。

若您採用的是外部識別資訊提供者 (IdP)，請先使用聯合身分登入 gcloud CLI。

執行下列指令，初始化 gcloud CLI：

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable container.googleapis.com storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

Install the Google Cloud CLI.

注意：如果您先前已安裝 gcloud CLI，請執行 gcloud components update，確保您使用的是最新版本。

若您採用的是外部識別資訊提供者 (IdP)，請先使用聯合身分登入 gcloud CLI。

執行下列指令，初始化 gcloud CLI：

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable container.googleapis.com storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

準備環境

複製範例存放區，然後開啟教學課程目錄：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/t5-model-serving

建立叢集

執行下列指令：

gcloud container clusters create-auto ml-cluster \
    --release-channel=RELEASE_CHANNEL \
    --cluster-version=CLUSTER_VERSION \
    --location=us-central1

更改下列內容：

RELEASE_CHANNEL：叢集的發布管道。必須是 rapid、regular 或 stable。選擇含有 GKE 1.28.3-gke.1203000 以上版本的管道，即可使用 L4 GPU。如要查看特定管道可用的版本，請參閱「查看發布管道的預設版本和可用版本」。
CLUSTER_VERSION：要使用的 GKE 版本。必須為 1.28.3-gke.1203000 或之後。

這項作業需要幾分鐘才能完成。

建立 Artifact Registry 存放區

在與叢集相同的區域中，建立 Docker 格式的新 Artifact Registry 標準存放區：

gcloud artifacts repositories create models \
    --repository-format=docker \
    --location=us-central1 \
    --description="Repo for T5 serving image"

確認存放區名稱：

gcloud artifacts repositories describe models \
    --location=us-central1

輸出結果會與下列內容相似：

Encryption: Google-managed key
Repository Size: 0.000MB
createTime: '2023-06-14T15:48:35.267196Z'
description: Repo for T5 serving image
format: DOCKER
mode: STANDARD_REPOSITORY
name: projects/PROJECT_ID/locations/us-central1/repositories/models
updateTime: '2023-06-14T15:48:35.267196Z'

封裝模型

在本節中，您將使用 Cloud Build 將模型和服務架構封裝在單一容器映像檔中，並將產生的映像檔推送至 Artifact Registry 存放區。

查看容器映像檔的 Dockerfile：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG BASE_IMAGE=pytorch/torchserve:0.12.0-cpu

FROM alpine/git

ARG MODEL_NAME=t5-small
ARG MODEL_REPO=https://huggingface.co/${MODEL_NAME}
ENV MODEL_NAME=${MODEL_NAME}
ENV MODEL_VERSION=${MODEL_VERSION}

RUN git clone "${MODEL_REPO}" /model

FROM ${BASE_IMAGE}

ARG MODEL_NAME=t5-small
ARG MODEL_VERSION=1.0
ENV MODEL_NAME=${MODEL_NAME}
ENV MODEL_VERSION=${MODEL_VERSION}

COPY --from=0 /model/. /home/model-server/
COPY handler.py \
     model.py \
     requirements.txt \
     setup_config.json /home/model-server/

RUN  torch-model-archiver \
     --model-name="${MODEL_NAME}" \
     --version="${MODEL_VERSION}" \
     --model-file="model.py" \
     --serialized-file="pytorch_model.bin" \
     --handler="handler.py" \
     --extra-files="config.json,spiece.model,tokenizer.json,setup_config.json" \
     --runtime="python" \
     --export-path="model-store" \
     --requirements-file="requirements.txt"

FROM ${BASE_IMAGE}

ENV PATH /home/model-server/.local/bin:$PATH
ENV TS_CONFIG_FILE /home/model-server/config.properties
# CPU inference will throw a warning cuda warning (not error)
# Could not load dynamic library 'libnvinfer_plugin.so.7'
# This is expected behaviour. see: https://stackoverflow.com/a/61137388
ENV TF_CPP_MIN_LOG_LEVEL 2

COPY --from=1 /home/model-server/model-store/ /home/model-server/model-store
COPY config.properties /home/model-server/

這個 Dockerfile 定義了下列多階段建構程序：

從 Hugging Face 存放區下載模型構件。
使用 PyTorch Serving Archive 工具封裝模型。這會建立模型封存 (.mar) 檔案，推論伺服器會使用這個檔案載入模型。
使用 PyTorch Serve 建構最終映像檔。

使用 Cloud Build 建構及推送映像檔：

gcloud builds submit model/ \
    --region=us-central1 \
    --config=model/cloudbuild.yaml \
    --substitutions=_LOCATION=us-central1,_MACHINE=gpu,_MODEL_NAME=t5-small,_MODEL_VERSION=1.0

建構程序需要幾分鐘才能完成。如果使用的模型大小大於 t5-small，建構程序可能需要相當長的時間。

確認映像檔位於存放區中：

gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/models

將 PROJECT_ID 替換為您的 Google Cloud專案 ID。

輸出結果會與下列內容相似：

IMAGE                                                     DIGEST         CREATE_TIME          UPDATE_TIME
us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small     sha256:0cd...  2023-06-14T12:06:38  2023-06-14T12:06:38

將封裝模型部署至 GKE

如要部署映像檔，本教學課程會使用 Kubernetes 部署。Deployment 是 Kubernetes API 物件，可讓您執行多個 Pod 副本，並將這些副本分配到叢集中的節點。

修改範例存放區中的 Kubernetes 資訊清單，以符合您的環境。

查看推論工作負載的資訊清單：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: t5-inference
  labels:
    model: t5
    version: v1.0
    machine: gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      model: t5
      version: v1.0
      machine: gpu
  template:
    metadata:
      labels:
        model: t5
        version: v1.0
        machine: gpu
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
      containers:
        - name: inference
          image: us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small:1.0-gpu
          imagePullPolicy: IfNotPresent
          args: ["torchserve", "--start", "--foreground"]
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "3000m"
              memory: 16Gi
              ephemeral-storage: 10Gi
            requests:
              nvidia.com/gpu: "1"
              cpu: "3000m"
              memory: 16Gi
              ephemeral-storage: 10Gi
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 8081
              name: management
            - containerPort: 8082
              name: metrics
          readinessProbe:
            httpGet:
              path: /ping
              port: http
            initialDelaySeconds: 120
            failureThreshold: 10
          livenessProbe:
            httpGet:
              path: /models/t5-small
              port: management
            initialDelaySeconds: 150
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: t5-inference
  labels:
    model: t5
    version: v1.0
    machine: gpu
spec:
  type: ClusterIP
  selector:
    model: t5
    version: v1.0
    machine: gpu
  ports:
    - port: 8080
      name: http
      targetPort: http
    - port: 8081
      name: management
      targetPort: management
    - port: 8082
      name: metrics
      targetPort: metrics

將 PROJECT_ID 替換為您的 Google Cloud專案 ID：
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/serving-gpu.yaml"
```
這可確保 Deployment 規格中的容器映像檔路徑，與 Artifact Registry 中 T5 模型映像檔的路徑相符。

建立 Kubernetes 資源：

kubectl create -f kubernetes/serving-gpu.yaml

如要確認模型是否已順利部署，請執行下列步驟：

取得 Deployment 和 Service 的狀態：

kubectl get -f kubernetes/serving-gpu.yaml

等待輸出內容顯示就緒的 Pod，如下所示。視圖片大小而定，第一次提取圖片可能需要幾分鐘。

NAME                            READY   UP-TO-DATE    AVAILABLE   AGE
deployment.apps/t5-inference    1/1     1             0           66s

NAME                    TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                       AGE
service/t5-inference    ClusterIP   10.48.131.86    <none>        8080/TCP,8081/TCP,8082/TCP    66s

為 t5-inference 服務開啟本機通訊埠：

kubectl port-forward svc/t5-inference 8080

開啟新的終端機視窗，然後將測試要求傳送至服務：
```
curl -v -X POST -H 'Content-Type: application/json' -d '{"text": "this is a test sentence", "from": "en", "to": "fr"}' "http://localhost:8080/predictions/t5-small/1.0"
```
如果測試要求失敗且 Pod 連線關閉，請檢查記錄：
```
kubectl logs deployments/t5-inference
```
如果輸出內容類似下列內容，表示 TorchServe 無法安裝部分模型依附元件：
```
org.pytorch.serve.archive.model.ModelException: Custom pip package installation failed for t5-small
```
如要解決這個問題，請重新啟動部署作業：
```
kubectl rollout restart deployment t5-inference
```
Deployment 控制器會建立新的 Pod。重複上述步驟，在新 Pod 上開啟通訊埠。

使用網頁應用程式存取已部署的模型

如要透過 Fast Dash 網路應用程式存取已部署的模型，請完成下列步驟：

在 Artifact Registry 中，以容器映像檔的形式建構及推送 Fast Dash 網頁應用程式：

gcloud builds submit client-app/ \
    --region=us-central1 \
    --config=client-app/cloudbuild.yaml

在文字編輯器中開啟 kubernetes/application.yaml，然後將 image: 欄位中的 PROJECT_ID 替換為您的專案 ID。您也可以執行下列指令：
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/application.yaml"
```
建立 Kubernetes 資源：
```
kubectl create -f kubernetes/application.yaml
```
部署作業和服務可能需要一段時間才能完全佈建。

如要檢查狀態，請執行下列指令：

kubectl get -f kubernetes/application.yaml

請等待輸出內容顯示就緒的 Pod，如下所示：

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/fastdash   1/1     1            0           1m

NAME               TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/fastdash   NodePort   203.0.113.12    <none>        8050/TCP         1m

網頁應用程式現在正在執行，但尚未公開發布到外部 IP 位址。如要存取網頁應用程式，請開啟本機通訊埠：
```
kubectl port-forward service/fastdash 8050
```
在瀏覽器中開啟網頁介面：
- 如果您使用本機殼層，請開啟瀏覽器並前往 http://127.0.0.1:8050。
- 如果您使用 Cloud Shell，請按一下「Web preview」(網頁預覽)，然後選取「Change port」(變更通訊埠)。指定通訊埠 8050。
如要向 T5 模型傳送要求，請在網頁介面的「TEXT」、「FROM LANG」和「TO LANG」欄位中指定值，然後按一下「Submit」。如需可用語言清單，請參閱 T5 說明文件。

為模型啟用自動調度資源功能

本節說明如何根據 Google Cloud Managed Service for Prometheus 的指標，為模型啟用自動調整資源配置功能，方法如下：

安裝自訂指標 Stackdriver 轉接器
套用 PodMonitoring 和 HorizontalPodAutoscaling 設定

根據預設，搭載 1.25 以上版本的 Autopilot 叢集會啟用 Google Cloud Managed Service for Prometheus。

安裝自訂指標 Stackdriver 轉接器

這個介面卡可讓叢集使用 Prometheus 的指標，做出 Kubernetes 自動調整大小的決策。

部署轉接程式：

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

建立供介面卡使用的 IAM 服務帳戶：

gcloud iam service-accounts create monitoring-viewer

將專案的 monitoring.viewer 角色和 iam.workloadIdentityUser 角色授予 IAM 服務帳戶：

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/monitoring.viewer
gcloud iam service-accounts add-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"

將 PROJECT_ID 替換為您的 Google Cloud專案 ID。

為介面的 Kubernetes ServiceAccount 加入註解，允許該帳戶模擬 IAM 服務帳戶：

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
    --namespace custom-metrics \
    iam.gke.io/gcp-service-account=monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com

重新啟動轉接器，以傳播變更：

kubectl rollout restart deployment custom-metrics-stackdriver-adapter \
    --namespace=custom-metrics

套用 PodMonitoring 和 HorizontalPodAutoscaling 設定

PodMonitoring 是 Google Cloud Managed Service for Prometheus 自訂資源，可啟用特定命名空間中的指標擷取和目標擷取。

在與 TorchServe Deployment 相同的命名空間中部署 PodMonitoring 資源：
```
kubectl apply -f kubernetes/pod-monitoring.yaml
```

查看 HorizontalPodAutoscaler 資訊清單：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: t5-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: t5-inference
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metric:
        name: prometheus.googleapis.com|ts_queue_latency_microseconds|counter
      target:
        type: AverageValue
        averageValue: "30000"

HorizontalPodAutoscaler 會根據要求佇列的累計時間長度，調整 T5 模型 Pod 數量。自動調度資源的依據是 ts_queue_latency_microseconds 指標，該指標會顯示累積佇列時間 (以微秒為單位)。

建立 HorizontalPodAutoscaler：
```
kubectl apply -f kubernetes/hpa.yaml
```

使用負載產生器驗證自動調度資源

如要測試自動調度資源設定，請為服務應用程式產生負載。本教學課程會使用 Locust 負載產生器，將要求傳送至模型的預測端點。

建立負載產生器：
```
kubectl apply -f kubernetes/loadgenerator.yaml
```
等待負載產生器 Pod 準備就緒。
在本機公開負載產生器網頁介面：
```
kubectl port-forward svc/loadgenerator 8080
```
如果看到錯誤訊息，請在 Pod 執行時重試。
在瀏覽器中開啟負載產生器網頁介面：
- 如果您使用本機殼層，請開啟瀏覽器並前往 http://127.0.0.1:8080。
- 如果您使用 Cloud Shell，請按一下「Web preview」(網頁預覽)，然後選取「Change port」(變更通訊埠)。輸入通訊埠 8080。
按一下「圖表」分頁標籤，即可查看一段時間內的成效。
開啟新的終端機視窗，然後監控水平 Pod 自動配置器的副本數量：
```
kubectl get hpa -w
```
負載增加時，備用資源數量也會增加。擴充作業可能需要約十分鐘。隨著新副本啟動，Locust 圖表中的成功要求數量會增加。
```
NAME           REFERENCE                 TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
t5-inference   Deployment/t5-inference   71352001470m/7M   1         5        1           2m11s
```

建議

建構模型時，請使用與服務相同的基礎 Docker 映像檔版本。
如果模型有特殊的套件依附元件，或依附元件的大小較大，請建立基本 Docker 映像檔的自訂版本。
查看模型依附元件套件的樹狀結構版本。確認套件相依元件支援彼此的版本。舉例來說，Panda 2.0.3 以上版本支援 NumPy 1.20.3 以上版本。
在 GPU 節點上執行需要大量 GPU 的模型，在 CPU 上執行需要大量 CPU 的模型。這有助於提升模型服務的穩定性，並確保您有效運用節點資源。

觀察模型成效

如要觀察模型效能，您可以在 Cloud Monitoring 中使用 TorchServe 資訊主頁整合功能。您可以在這個資訊主頁中查看重要成效指標，例如權杖輸送量、要求延遲時間和錯誤率。

如要使用 TorchServe 資訊主頁，您必須在 GKE 叢集中啟用 Google Cloud Managed Service for Prometheus，這項服務會從 TorchServe 收集指標。TorchServe 預設會以 Prometheus 格式公開指標，因此您不需要安裝額外的匯出工具。

接著，您可以使用 TorchServe 資訊主頁查看指標。如要瞭解如何使用 Google Cloud Managed Service for Prometheus 收集模型指標，請參閱 Cloud Monitoring 說明文件中的 TorchServe 可觀測性指南。

清除所用資源

為避免因為本教學課程所用資源，導致系統向 Google Cloud 收取費用，請刪除含有相關資源的專案，或者保留專案但刪除個別資源。

刪除專案

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

刪除個別資源

刪除 Kubernetes 資源：

kubectl delete -f kubernetes/loadgenerator.yaml
kubectl delete -f kubernetes/hpa.yaml
kubectl delete -f kubernetes/pod-monitoring.yaml
kubectl delete -f kubernetes/application.yaml
kubectl delete -f kubernetes/serving-gpu.yaml
kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

刪除 GKE 叢集：

gcloud container clusters delete "ml-cluster" \
    --location="us-central1" --quiet

刪除 IAM 服務帳戶和 IAM 政策繫結：

gcloud projects remove-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/monitoring.viewer
gcloud iam service-accounts remove-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"
gcloud iam service-accounts delete monitoring-viewer

刪除 Artifact Registry 中的映像檔。也可以選擇刪除整個存放區。如需操作說明，請參閱 Artifact Registry 說明文件中的「刪除映像檔」。

元件總覽

本節說明本教學課程使用的元件，例如模型、網路應用程式、架構和叢集。

關於 T5 模型

本教學課程使用預先訓練的多語言 T5 模型。T5 是一種文字轉文字的 Transformer，可將文字從一種語言轉換為另一種語言。在 T5 中，輸入和輸出一律為文字字串，這與只能輸出類別標籤或輸入範圍的 BERT 樣式模型不同。T5 模型也可用於摘要、問答或文字分類等工作。這項模型是以大量文字訓練而成，來源包括巨型乾淨檢索語料庫 (C4) 和 Wiki-DPR。

詳情請參閱 T5 模型說明文件。

Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li 和 Peter J. Liu 在《Journal of Machine Learning Research》中發表的Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer，介紹了 T5 模型。

T5 模型支援各種模型大小，複雜程度各異，可滿足特定用途的需求。本教學課程使用預設大小 t5-small，但您也可以選擇其他大小。下列 T5 大小是依據 Apache 2.0 授權發布：

t5-small： 6,000 萬個參數
t5-base： 2.2 億個參數
t5-large： 7.7 億個參數。下載 3 GB 的檔案。
t5-3b：30 億個參數。下載 11 GB 的檔案。
t5-11b： 110 億個參數。下載 45 GB 的檔案。

如要瞭解其他可用的 T5 模型，請參閱 Hugging Face 存放區。

關於 TorchServe

TorchServe 是提供 PyTorch 模型的彈性工具。這個程式庫可直接支援所有主要深度學習架構，包括 PyTorch、TensorFlow 和 ONNX。TorchServe 可用於在正式環境中部署模型，或用於快速設計原型和實驗。

後續步驟

提供具備多 GPU 的 LLM。
查看 Google Cloud 的參考架構、圖表和最佳做法。歡迎瀏覽我們的 Cloud Architecture Center。

透過 TorchServe 在 GKE 上提供可擴充的 LLM 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

關於教學課程應用程式

目標

費用

事前準備

準備環境

建立叢集

建立 Artifact Registry 存放區

封裝模型

將封裝模型部署至 GKE

使用網頁應用程式存取已部署的模型

為模型啟用自動調度資源功能

安裝自訂指標 Stackdriver 轉接器

套用 PodMonitoring 和 HorizontalPodAutoscaling 設定

使用負載產生器驗證自動調度資源

建議

觀察模型成效

清除所用資源

刪除專案

刪除個別資源

元件總覽

關於 T5 模型

關於 TorchServe

後續步驟

透過 TorchServe 在 GKE 上提供可擴充的 LLM