本頁面由 Cloud Translation API 翻譯而成。

透過 JetStream 和 PyTorch 在 GKE 上使用 TPU 提供 LLM

自動駕駛標準

本指南說明如何透過 PyTorch，使用 Google Kubernetes Engine (GKE) 上的張量處理單元 (TPU) 和 JetStream，提供大型語言模型 (LLM) 服務。在本指南中，您會將模型權重下載至 Cloud Storage，並使用執行 JetStream 的容器，將模型權重部署至 GKE Autopilot 或 Standard 叢集。

如果您需要在 JetStream 上部署模型時，運用 Kubernetes 功能提供的擴充性、復原能力和成本效益，這份指南是個不錯的起點。

本指南適用於使用 PyTorch 的生成式 AI 客戶、GKE 新手或現有使用者、機器學習工程師、MLOps (DevOps) 工程師，以及有興趣使用 Kubernetes 容器協調功能來提供 LLM 服務的平台管理員。

背景

透過 JetStream 在 GKE 上使用 TPU 提供 LLM，您就能建構完善且可用於正式環境的服務解決方案，同時享有代管型 Kubernetes 的所有優點，包括成本效益、擴充性和高可用性。本節說明本教學課程中使用的主要技術。

關於 TPU

TPU 是 Google 開發的客製化特殊應用積體電路 (ASIC)，用於加速機器學習和 AI 模型，這些模型是使用 TensorFlow、PyTorch 和 JAX 等架構建構而成。

在 GKE 中使用 TPU 之前，建議您先完成下列學習路徑：

如要瞭解目前可用的 TPU 版本，請參閱 Cloud TPU 系統架構。
瞭解 GKE 中的 TPU。

本教學課程涵蓋各種 LLM 模型的服務。GKE 會在單一主機 TPUv5e 節點上部署模型，並根據模型需求設定 TPU 拓撲，以低延遲方式提供提示。

關於 JetStream

JetStream 是 Google 開發的開放原始碼推論服務架構，JetStream 可在 TPU 和 GPU 上執行高效能、高處理量和記憶體最佳化推論作業。JetStream 提供進階效能最佳化功能，包括持續批次處理、KV 快取最佳化和量化技術，可簡化 LLM 部署作業。JetStream 可讓 PyTorch/XLA 和 JAX TPU 服務達到最佳效能。

持續批次處理

持續批次處理技術可將傳入的推論要求動態分組為批次，進而縮短延遲時間並提高總處理量。

KV 快取量化

KV 快取量化會壓縮注意力機制中使用的鍵值快取，進而減少記憶體需求。

Int8 權重量化

Int8 權重量化會將模型權重的精確度從 32 位元浮點數降低至 8 位元整數，進而加快運算速度並減少記憶體用量。

如要進一步瞭解這些最佳化作業，請參閱 JetStream PyTorch 和 JetStream MaxText 專案存放區。

關於 PyTorch

PyTorch 是由 Meta 開發的開放原始碼機器學習架構，現已納入 Linux 基金會旗下。PyTorch 提供高階功能，例如張量運算和深層類神經網路。

目標

準備 GKE Autopilot 或 Standard 叢集，並根據模型特性使用建議的 TPU 拓撲。
在 GKE 上部署 JetStream 元件。
取得及發布模型。
提供已發布的模型並與之互動。

架構

本節說明本教學課程使用的 GKE 架構。此架構包含 GKE Autopilot 或 Standard 叢集，可佈建 TPU 並代管 JetStream 元件，以部署及提供模型服務。

下圖顯示這個架構的元件：

GKE 叢集架構，其中包含 JetStream-PyTorch 和 JetStream HTTP 元件的單一主機 TPU 節點集區。

這個架構包含下列元件：

GKE Autopilot 或 Standard 區域叢集。
兩個單一主機 TPU 配量節點集區，用於代管 JetStream 部署作業。
Service 元件會將傳入流量分散到所有 JetStream HTTP 副本。
JetStream HTTP 是 HTTP 伺服器，可接受要求做為 JetStream 必要格式的封裝函式，並傳送至 JetStream 的 GRPC 用戶端。
JetStream-PyTorch 是 JetStream 伺服器，可透過持續批次處理執行推論。

事前準備

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
選取所需專案。
按一下「Grant access」(授予存取權)。
在「New principals」(新增主體) 欄位中，輸入您的使用者 ID。這通常是指 Google 帳戶的電子郵件地址。
在「Select a role」(選取角色) 清單中，選取角色。
如要授予其他角色，請按一下「Add another role」(新增其他角色)，然後新增其他角色。
按一下「Save」(儲存)。

確認您有足夠的配額，可使用八個 TPU v5e PodSlice Lite 晶片。在本教學課程中，您將使用隨選執行個體。
如果沒有 Hugging Face 權杖，請建立權杖。

取得模型存取權

存取 Hugging Face 上的各種模型，並部署至 GKE

Gemma 7B-it

如要存取 Gemma 模型並部署至 GKE，請先簽署授權同意聲明協議。

前往 Hugging Face 的 Gemma 模型同意聲明頁面
登入 Hugging Face (如果尚未登入)。
詳閱並接受範本條款及細則。

Llama 3 8B

如要存取 Llama 3 模型，並部署至 GKE，請先簽署授權同意聲明協議。

在 Hugging Face 上存取 Llama 3 模型同意頁面
登入 Hugging Face (如果尚未登入)。
詳閱並接受範本條款及細則。

準備環境

在本教學課程中，您將使用 Cloud Shell 管理Google Cloud上託管的資源。Cloud Shell 已預先安裝本教學課程所需的軟體，包括 kubectl 和 gcloud CLI。

如要使用 Cloud Shell 設定環境，請按照下列步驟操作：

在 Google Cloud 控制台中，按一下Google Cloud 控制台中的「啟用 Cloud Shell」，啟動 Cloud Shell 工作階段。系統會在 Google Cloud 控制台的底部窗格啟動工作階段。
設定預設環境變數：
```
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export CONTROL_PLANE_LOCATION=CONTROL_PLANE_LOCATION
export NODE_LOCATION=NODE_LOCATION
export CLUSTER_VERSION=CLUSTER_VERSION
export BUCKET_NAME=BUCKET_NAME
```
替換下列值：
- PROJECT_ID：您的 Google Cloud 專案 ID。
- CLUSTER_NAME：GKE 叢集名稱。
- CONTROL_PLANE_LOCATION：叢集控制層的 Compute Engine 區域。該區域必須包含可使用 TPU v5e 機器類型的可用區 (例如 us-west1、us-west4、us-central1、us-east1、us-east5 或 europe-west4)。如果是 Autopilot 叢集，請確保所選區域有足夠的 TPU v5e 區域資源。
- (僅限 Standard 叢集) NODE_LOCATION：提供 TPU 資源的區域 (例如 us-west4-a)。如果是 Autopilot 叢集，則不需要指定這個值。
- CLUSTER_VERSION：GKE 版本，必須支援您想使用的機器類型。請注意，預設 GKE 版本可能無法用於目標 TPU。如需各 TPU 機型適用的最低 GKE 版本清單，請參閱「GKE 中的 TPU 支援情形」。
- BUCKET_NAME：Cloud Storage bucket 的名稱，用於儲存 JAX 編譯快取。

建立及設定 Google Cloud 資源

請按照下列操作說明建立必要資源。

建立 GKE 叢集

您可以在 GKE Autopilot 或 Standard 叢集的 TPU 上提供 Gemma 服務。建議您使用 Autopilot 叢集，享有全代管 Kubernetes 體驗。如要為工作負載選擇最合適的 GKE 作業模式，請參閱「選擇 GKE 作業模式」。

Autopilot

建立 Autopilot GKE 叢集：

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=CONTROL_PLANE_LOCATION \
    --cluster-version=CLUSTER_VERSION

將 CLUSTER_VERSION 替換為適當的叢集版本。如果是 Autopilot GKE 叢集，請使用一般發布管道版本。

標準

建立使用 Workload Identity Federation for GKE 的地區 GKE Standard 叢集：

gcloud container clusters create CLUSTER_NAME \
    --enable-ip-alias \
    --machine-type=e2-standard-4 \
    --num-nodes=2 \
    --cluster-version=CLUSTER_VERSION \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --location=CONTROL_PLANE_LOCATION

建立叢集可能需要幾分鐘的時間。

將 CLUSTER_VERSION 替換為適當的叢集版本。

建立具有 2x4 拓撲和兩個節點的 TPU v5e節點集區：

gcloud container node-pools create tpu-nodepool \
  --cluster=CLUSTER_NAME \
  --machine-type=ct5lp-hightpu-8t \
  --project=PROJECT_ID \
  --num-nodes=2 \
  --location=CONTROL_PLANE_LOCATION \
  --node-locations=NODE_LOCATION

在 Cloud Shell 中產生 Hugging Face CLI 權杖

如果沒有，請產生新的 Hugging Face 權杖：

依序點選「Your Profile」(你的個人資料) >「Settings」(設定) >「Access Tokens」(存取權杖)。
按一下「New Token」。
指定您選擇的名稱，以及至少 Read 的角色。
按一下 [產生憑證]。
編輯存取權杖的權限，取得模型 Hugging Face 存放區的讀取權限。
將產生的權杖複製到剪貼簿。

為 Hugging Face 憑證建立 Kubernetes Secret

在 Cloud Shell 中執行下列操作：

設定 kubectl 與叢集通訊：

gcloud container clusters get-credentials CLUSTER_NAME --location=CONTROL_PLANE_LOCATION

建立 Secret 來儲存 Hugging Face 憑證：

kubectl create secret generic huggingface-secret \
    --from-literal=HUGGINGFACE_TOKEN=HUGGINGFACE_TOKEN

將 HUGGINGFACE_TOKEN 替換為您的 Hugging Face 權杖。

使用 Workload Identity Federation for GKE 設定工作負載存取權

將 Kubernetes ServiceAccount 指派給應用程式，並將該 Kubernetes ServiceAccount 設定為 IAM 服務帳戶。

為應用程式建立 IAM 服務帳戶：

gcloud iam service-accounts create wi-jetstream

為 IAM 服務帳戶新增 IAM 政策繫結，以便管理 Cloud Storage：

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:wi-jetstream@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/storage.objectUser

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:wi-jetstream@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/storage.insightsCollectorService

在兩個服務帳戶之間新增 IAM 政策繫結，允許 Kubernetes ServiceAccount 模擬 IAM 服務帳戶。這個繫結可讓 Kubernetes ServiceAccount 做為 IAM 服務帳戶：

gcloud iam service-accounts add-iam-policy-binding wi-jetstream@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[default/default]"

使用 IAM 服務帳戶的電子郵件地址註解 Kubernetes 服務帳戶：

kubectl annotate serviceaccount default \
    iam.gke.io/gcp-service-account=wi-jetstream@PROJECT_ID.iam.gserviceaccount.com

部署 JetStream

部署 JetStream 容器，提供模型服務。本教學課程使用 Kubernetes Deployment 資訊清單。Deployment 是 Kubernetes API 物件，可讓您執行多個 Pod 副本，並將這些副本分散到叢集中的節點。

將下列資訊清單儲存為 jetstream-pytorch-deployment.yaml：

Gemma 7B-it

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jetstream-pytorch-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: jetstream-pytorch-server
  template:
    metadata:
      labels:
        app: jetstream-pytorch-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: jetstream-pytorch-server
        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-pytorch-server:v0.2.4
        args:
        - --model_id=google/gemma-7b-it
        - --override_batch_size=30
        - --enable_model_warmup=True
        volumeMounts:
        - name: huggingface-credentials
          mountPath: /huggingface
          readOnly: true
        ports:
        - containerPort: 9000
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
        startupProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          initialDelaySeconds: 90
          failureThreshold: 50
        livenessProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          failureThreshold: 30
        readinessProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          failureThreshold: 30
      - name: jetstream-http
        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.3
        ports:
        - containerPort: 8000
      volumes:
      - name: huggingface-credentials
        secret:
          defaultMode: 0400
          secretName: huggingface-secret
---
apiVersion: v1
kind: Service
metadata:
  name: jetstream-svc
spec:
  selector:
    app: jetstream-pytorch-server
  ports:
  - protocol: TCP
    name: jetstream-http
    port: 8000
    targetPort: 8000

Llama 3 8B

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jetstream-pytorch-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: jetstream-pytorch-server
  template:
    metadata:
      labels:
        app: jetstream-pytorch-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: jetstream-pytorch-server
        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-pytorch-server:v0.2.4
        args:
        - --model_id=meta-llama/Meta-Llama-3-8B
        - --override_batch_size=30
        - --enable_model_warmup=True
        volumeMounts:
        - name: huggingface-credentials
          mountPath: /huggingface
          readOnly: true
        ports:
        - containerPort: 9000
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
        startupProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          initialDelaySeconds: 90
          failureThreshold: 50
        livenessProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          failureThreshold: 30
        readinessProbe:
          httpGet:
            path: /healthcheck
            port: 8000
            scheme: HTTP
          periodSeconds: 60
          failureThreshold: 30
      - name: jetstream-http
        image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.3
        ports:
        - containerPort: 8000
      volumes:
      - name: huggingface-credentials
        secret:
          defaultMode: 0400
          secretName: huggingface-secret
---
apiVersion: v1
kind: Service
metadata:
  name: jetstream-svc
spec:
  selector:
    app: jetstream-pytorch-server
  ports:
  - protocol: TCP
    name: jetstream-http
    port: 8000
    targetPort: 8000

資訊清單會設定下列重要屬性：

model_id：Hugging Face 的模型名稱 (google/gemma-7b-it、meta-llama/Meta-Llama-3-8B) (請參閱支援的模型)。
override_batch_size：每個裝置的解碼批次大小，其中一個 TPU 晶片等於一個裝置。預設值為 30。
enable_model_warmup：啟用這項設定後，模型伺服器啟動後會進行模型暖機。這個值的預設值為 False。

您可以視需要設定下列屬性：

max_input_length：輸入序列長度上限。預設值為 1024。
max_output_length：輸出解碼長度上限，這個值預設為 1024。
quantize_weights：檢查點是否已量化。這個值預設為 0；請設為 1，啟用 int8 量化。
internal_jax_compilation_cache：JAX 編譯快取的目錄。這個值的預設值為 ~/jax_cache；如要進行遠端快取，請將這個值設為 gs://BUCKET_NAME/jax_cache。

在資訊清單中，系統會設定啟動探查，確保模型載入完畢並完成暖機後，模型伺服器會標示為 Ready。設定有效性和完備性探測，確保模型伺服器運作正常。

套用資訊清單：

kubectl apply -f jetstream-pytorch-deployment.yaml

驗證 Deployment：

kubectl get deployment

輸出結果會與下列內容相似：

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
jetstream-pytorch-server          0/2     2            0           ##s

如果是 Autopilot 叢集，系統可能需要幾分鐘才能佈建必要的 TPU 資源。

查看 JetStream-PyTorch 伺服器記錄，確認模型權重已載入，且模型暖機已完成。伺服器可能需要幾分鐘才能完成這項作業。
```
kubectl logs deploy/jetstream-pytorch-server -f -c jetstream-pytorch-server
```
輸出結果會與下列內容相似：
```
Started jetstream_server....
2024-04-12 04:33:37,128 - root - INFO - ---------Generate params 0 loaded.---------
```

確認 Deployment 是否已準備就緒：

kubectl get deployment

輸出結果會與下列內容相似：

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
jetstream-pytorch-server          2/2     2            2           ##s

healthcheck 端點可能需要幾分鐘才能完成註冊。

提供模型

在本節中，您將與模型互動。

設定通訊埠轉送

您可以透過在上一個步驟中建立的 ClusterIP 服務存取 JetStream 部署作業。ClusterIP 服務只能從叢集內部存取。因此，如要從叢集外部存取 Service，請完成下列步驟：

如要建立通訊埠轉送工作階段，請執行下列指令：

kubectl port-forward svc/jetstream-svc 8000:8000

使用 curl 與模型互動

開啟新的終端機並執行下列指令，確認您可以存取 JetStream HTTP 伺服器：

curl --request POST \
--header "Content-type: application/json" \
-s \
localhost:8000/generate \
--data \
'{
    "prompt": "What are the top 5 programming languages",
    "max_tokens": 200
}'

由於模型暖機，初始要求可能需要幾秒鐘才能完成。輸出結果會與下列內容相似：

{
    "response": " for data science in 2023?\n\n**1. Python:**\n- Widely used for data science due to its readability, extensive libraries (pandas, scikit-learn), and integration with other tools.\n- High demand for Python programmers in data science roles.\n\n**2. R:**\n- Popular choice for data analysis and visualization, particularly in academia and research.\n- Extensive libraries for statistical modeling and data wrangling.\n\n**3. Java:**\n- Enterprise-grade platform for data science, with strong performance and scalability.\n- Widely used in data mining and big data analytics.\n\n**4. SQL:**\n- Essential for data querying and manipulation, especially in relational databases.\n- Used for data analysis and visualization in various industries.\n\n**5. Scala:**\n- Scalable and efficient for big data processing and machine learning models.\n- Popular in data science for its parallelism and integration with Spark and Spark MLlib."
}

您已成功完成下列操作：

在 GKE 上使用 TPU 部署 JetStream-PyTorch 模型伺服器。
提供模型並與模型互動。

觀察模型成效

如要觀察模型效能，您可以在 Cloud Monitoring 中使用 JetStream 資訊主頁整合功能。您可以在這個資訊主頁中查看重要成效指標，例如權杖輸送量、要求延遲時間和錯誤率。

如要使用 JetStream 資訊主頁，您必須在 GKE 叢集中啟用 Google Cloud Managed Service for Prometheus，這項服務會從 JetStream 收集指標。

JetStream

排解問題

如果收到 Empty reply from server 訊息，可能是因為容器尚未完成下載模型資料。再次檢查 Pod 的記錄，確認是否出現 Connected 訊息，表示模型已準備好提供服務。
如果看到 Connection refused，請確認連接埠轉送功能是否已啟用。

清除所用資源

為避免因為本教學課程所用資源，導致系統向 Google Cloud 收取費用，請刪除含有相關資源的專案，或者保留專案但刪除個別資源。

刪除已部署的資源

如要避免系統向您的 Google Cloud 帳戶收取本指南所建立資源的費用，請執行下列指令並按照提示操作：

gcloud container clusters delete CLUSTER_NAME --location=CONTROL_PLANE_LOCATION

gcloud iam service-accounts delete wi-jetstream@PROJECT_ID.iam.gserviceaccount.com

後續步驟

瞭解如何在 GKE 上執行 Gemma 模型，以及如何運用 GKE 平台的自動化調度管理功能，執行最佳化的 AI/機器學習工作負載。
進一步瞭解 GKE 中的 TPU。
探索 JetStream GitHub 存放區。
探索 Vertex AI Model Garden。

透過 JetStream 和 PyTorch 在 GKE 上使用 TPU 提供 LLM 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

背景

關於 TPU

關於 JetStream

關於 PyTorch

目標

架構

事前準備

Check for the roles

Grant the roles

取得模型存取權

Gemma 7B-it

Llama 3 8B

準備環境

建立及設定 Google Cloud 資源

建立 GKE 叢集

Autopilot

標準

在 Cloud Shell 中產生 Hugging Face CLI 權杖

為 Hugging Face 憑證建立 Kubernetes Secret

使用 Workload Identity Federation for GKE 設定工作負載存取權

部署 JetStream

Gemma 7B-it

Llama 3 8B

提供模型

設定通訊埠轉送

使用 curl 與模型互動

觀察模型成效

排解問題

清除所用資源

刪除已部署的資源

後續步驟

透過 JetStream 和 PyTorch 在 GKE 上使用 TPU 提供 LLM