透過預先設定的架構，在 GKE 上使用 TPU 提供開放原始碼 LLM

自動駕駛標準

本頁面說明如何使用預先設定的GKE 推論參考架構，在 GKE 上快速部署及提供熱門的開放式大型語言模型 (LLM)，並透過 TPU 進行推論。這個方法會使用基礎架構即程式碼 (IaC)，並以 CLI 指令碼包裝 Terraform，建立標準化、安全且可擴充的 GKE 環境，專為 AI 推論工作負載設計。

本指南說明如何使用 vLLM 服務架構，在 GKE 上透過單一主機 TPU 節點部署及提供 LLM。本指南提供部署下列開放模型的說明和設定：

本指南適用於機器學習 (ML) 工程師，以及有興趣探索 Kubernetes 容器自動化調度管理功能，以提供開放模型進行推論的資料和 AI 專家。如要進一步瞭解內容中提及的常見角色和範例工作，請參閱「常見的 GKE 使用者角色和工作」。 Google Cloud

事前準備

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin, roles/container.clusterAdmin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
選取所需專案。
按一下「Grant access」(授予存取權)。
在「New principals」(新增主體) 欄位中，輸入您的使用者 ID。這通常是指 Google 帳戶的電子郵件地址。
按一下「選取角色」，然後搜尋角色。
如要授予其他角色，請按一下「Add another role」(新增其他角色)，然後新增其他角色。
按一下「Save」(儲存)。

建立 Hugging Face 帳戶。
請確認專案有足夠的 TPU 配額 (GKE Standard) 或 TPU 配額 (GKE Autopilot)。詳情請參閱「規劃 GKE 中的 TPU」。

取得模型存取權

在各自的 Hugging Face 模型頁面上，接受要使用的任何封閉式模型 (例如 Gemma) 的授權條款。

如要透過 Hugging Face 存取模型，您需要Hugging Face 權杖。

如果沒有權杖，請按照下列步驟產生新權杖：

依序點選「你的個人資料」 >「設定」 >「存取權杖」。
選取「New Token」。
指定所選的「名稱」，以及至少「讀取」的「角色」。
選取「產生憑證」。
將產生的權杖複製到剪貼簿。

佈建 GKE 推論環境

在本節中，您將部署必要的基礎架構來提供模型服務。

啟動 Cloud Shell

本指南使用 Cloud Shell 執行指令。Cloud Shell 已預先安裝必要工具，包括 gcloud、kubectl 和 git。

在 Google Cloud 控制台中啟動 Cloud Shell 執行個體：

開啟 Cloud Shell

系統會在 Google Cloud 控制台的底部窗格啟動工作階段。

部署基礎架構

如要佈建 GKE 叢集和存取 Hugging Face 模型所需的資源，請按照下列步驟操作：

在 Cloud Shell 中，複製下列存放區：

git clone https://github.com/GoogleCloudPlatform/accelerated-platforms --branch hf-model-vllm-tpu-tutorial && \
cd accelerated-platforms && \
export ACP_REPO_DIR="$(pwd)"

設定環境變數：
```
export TF_VAR_platform_default_project_id=PROJECT_ID
export HF_TOKEN_READ=HF_TOKEN
```
替換下列值：
- PROJECT_ID：您的 Google Cloud 專案 ID。
- HF_TOKEN：您先前產生的 Hugging Face 權杖。
本指南需要使用 Terraform 1.8.0 以上版本。Cloud Shell 預設安裝 Terraform v1.5.7。

如要在 Cloud Shell 中更新 Terraform 版本，可以執行下列指令碼。這個指令碼會安裝 tfswitch 工具，並在主目錄中安裝 Terraform 1.8.0 版。按照指令碼的指示設定必要的環境變數，或將 --modify-rc-file 旗標傳送至指令碼。
```
"${ACP_REPO_DIR}/tools/bin/install_terraform.sh" && \
export PATH=${HOME}/bin:${HOME}/.local/bin:${PATH}
```
執行下列部署指令碼。部署指令碼會啟用必要的 Google Cloud API，並為本指南佈建必要的基礎架構。包括新的 VPC 網路、具有私有節點的 GKE 叢集，以及其他支援資源。指令碼可能需要幾分鐘才能完成。

您可以在 GKE Autopilot 或 Standard 叢集中，使用 TPU 服務模型。Autopilot 叢集提供全代管的 Kubernetes 體驗。如要進一步瞭解如何為工作負載選擇最合適的 GKE 作業模式，請參閱「關於 GKE 作業模式」。
Autopilot
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-tpu-model/deploy-ap.sh"
```
標準
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-tpu-model/deploy-standard.sh"
```
這個指令碼執行完畢後，您就會擁有可執行推論工作負載的 GKE 叢集。

執行下列指令，從共用設定檔設定環境變數：

source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

部署指令碼會在 Secret Manager 中建立密鑰，用來儲存 Hugging Face 權杖。部署叢集前，您必須手動將權杖新增至這個密鑰。在 Cloud Shell 中執行下列指令，將權杖新增至 Secret Manager。
```
echo ${HF_TOKEN_READ} | gcloud secrets versions add ${huggingface_hub_access_token_read_secret_manager_secret_name} \
--data-file=- \
--project=${huggingface_secret_manager_project_id}
```

部署開放式模型

現在可以下載及部署模型。

選取型號

為要部署的模型設定環境變數：

Gemma 3 1B-it

export ACCELERATOR_TYPE="v5e"
export HF_MODEL_ID="google/gemma-3-1b-it"

Gemma 3 4B-it

export ACCELERATOR_TYPE="v5e"
export HF_MODEL_ID="google/gemma-3-4b-it"

Gemma 3 27B-it

export ACCELERATOR_TYPE="v5e"
export HF_MODEL_ID="google/gemma-3-27b-it"

如需其他設定，包括其他模型變體和 TPU 類型，請參閱 accelerated-platforms GitHub 存放區中的資訊清單。

下載模型

從部署作業取得環境變數。這些環境變數包含您佈建基礎架構時的必要設定詳細資料。

source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

執行下列指令碼，設定 Hugging Face 模型下載資源，將模型下載至 Cloud Storage：

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"

套用 Hugging Face 模型下載資源：

kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

監控 Hugging Face 模型下載作業，直到完成為止。

until kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} wait job/${HF_MODEL_ID_HASH}-hf-model-to-gcs --for=condition=complete --timeout=10s >/dev/null; do
    clear
    kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/${HF_MODEL_ID_HASH}-hf-model-to-gcs | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'
    echo -e "\nhf-model-to-gcs logs(last 10 lines):"
    kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} logs job/${HF_MODEL_ID_HASH}-hf-model-to-gcs --container=hf-model-to-gcs --tail 10
done

確認 Hugging Face 模型下載工作已完成。

kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/${HF_MODEL_ID_HASH}-hf-model-to-gcs | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'

刪除 Hugging Face 模型下載資源。

kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

部署模型

從部署作業取得環境變數。

source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

確認已設定 Hugging Face 模型名稱。
```
echo "HF_MODEL_NAME=${HF_MODEL_NAME}"
```

設定 vLLM 資源。

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-tpu/vllm/configure_vllm.sh"

將推論工作負載部署至 GKE 叢集。

kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-tpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

測試部署作業

監控推論工作負載部署作業，直到部署完成為止。

until kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} wait deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} --for=condition=available --timeout=10s >/dev/null; do
    clear
    kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} get deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1     1            1'
    echo -e "\nfetch-safetensors logs(last 10 lines):"
    kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} logs deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} --container=fetch-safetensors --tail 10
    echo -e "\ninference-server logs(last 10 lines):"
    kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} logs deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} --container=inference-server --tail 10
done

確認推論工作負載部署作業是否可用。

kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} get deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1     1            1'
echo -e "\nfetch-safetensors logs(last 10 lines):"
kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} logs deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} --container=fetch-safetensors --tail 10
echo -e "\ninference-server logs(last 10 lines):"
kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} logs deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} --container=inference-server --tail 10

執行下列指令碼，設定通訊埠轉送功能，並將範例要求傳送至模型。

kubectl --namespace=${ira_online_tpu_kubernetes_namespace_name} port-forward service/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} 8000:8000 >/dev/null &
PF_PID=$!
while ! echo -e '\x1dclose\x0d' | telnet localhost 8000 >/dev/null 2>&1; do
    sleep 0.1
done
curl http://127.0.0.1:8000/v1/chat/completions \
--data '{
"model": "/gcs/'${HF_MODEL_ID}'",
"messages": [ { "role": "user", "content": "What is GKE?" } ]
}' \
--header "Content-Type: application/json" \
--request POST \
--show-error \
--silent | jq
kill -9 ${PF_PID}

模型應會傳回 JSON 回覆，回答您的問題。

清除所用資源

為避免產生費用，請刪除您建立的所有資源。

刪除推論工作負載：

kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-tpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

清除資源：

Autopilot

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-tpu-model/teardown-ap.sh"

標準

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-tpu-model/teardown-standard.sh"

後續步驟

進一步瞭解 GKE 的 AI/機器學習模型推論功能。
使用 GKE Inference Quickstart 工具分析模型推論效能和成本。
探索用於建構此架構的 accelerated-platforms GitHub 存放區。

透過預先設定的架構，在 GKE 上使用 TPU 提供開放原始碼 LLM 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

事前準備

Check for the roles

Grant the roles

取得模型存取權

佈建 GKE 推論環境

啟動 Cloud Shell

部署基礎架構

Autopilot

標準

部署開放式模型

選取型號

Gemma 3 1B-it

Gemma 3 4B-it

Gemma 3 27B-it

下載模型

部署模型

測試部署作業

清除所用資源

Autopilot

標準

後續步驟

透過預先設定的架構，在 GKE 上使用 TPU 提供開放原始碼 LLM