Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

通过 Optimum TPU 使用 GKE 中的 TPU 应用开源模型

标准

本教程介绍如何通过来自 Hugging Face 的 Optimum TPU 服务框架，使用 Google Kubernetes Engine (GKE) 中的张量处理单元 (TPU) 来应用大语言模型 (LLM) 开源模型。在本教程中，您将从 Hugging Face 下载开源模型，并使用运行 Optimum TPU 的容器在 GKE Standard 集群上部署这些模型。

如果您在部署和应用 AI/机器学习工作负载时需要利用托管式 Kubernetes 的精细控制、可伸缩性、弹性、可移植性和成本效益，那么本指南提供了一个起点。

本教程适用于 Hugging Face 生态系统中的生成式 AI 客户、GKE 的新用户或现有用户、机器学习工程师、MLOps (DevOps) 工程师或对使用 Kubernetes 容器编排功能应用 LLM 感兴趣的平台管理员。

Google Cloud GKE、Vertex AI 和 Compute Engine 等产品支持各种服务库，例如 JetStream、vLLM 和其他合作伙伴产品。例如，您可以使用 JetStream 从项目获取最新优化。如果您偏好 Hugging Face 选项，可以使用 Optimum TPU。

Optimum TPU 支持以下功能：

连续批处理
令牌流式处理
使用 Transformer 进行贪婪搜索和多项采样。

目标

根据模型特征准备一个具有推荐 TPU 拓扑的 GKE Standard 集群。
在 GKE 上部署 Optimum TPU。
通过 curl，使用 Optimum TPU 提供支持的模型。

准备工作

登录您的 Google Cloud 账号。如果您是 Google Cloud新手，请创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

如果您还没有 Hugging Face 账号，请创建一个。
确保您的项目具有足够的配额，以便用于 GKE 中的 Cloud TPU。

所需的角色

如需获得配置集群和工作负载所需的权限，请让管理员向您授予项目的以下 IAM 角色：

Service Account Admin (roles/iam.serviceAccountAdmin)
管理 GKE 集群：Kubernetes Engine Admin (roles/container.admin)
构建映像并将其推送到 Artifact Registry：Artifact Registry Administrator (roles/artifactregistry.admin)

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

您也可以通过自定义角色或其他预定义角色来获取所需的权限。

准备环境

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 预安装有本教程所需的软件，包括 kubectl 和 gcloud CLI。

如需使用 Cloud Shell 设置您的环境，请按照以下步骤操作：

在 Google Cloud 控制台中，激活 Cloud Shell。

激活 Cloud Shell

Cloud Shell 会话随即会在 Google Cloud 控制台的底部启动，并显示命令行提示符。Cloud Shell 是一个已安装 Google Cloud CLI 且已为当前项目设置值的 Shell 环境。该会话可能需要几秒钟时间来完成初始化。
设置默认环境变量：
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
export HF_TOKEN=HF_TOKEN
```
替换以下值：
- PROJECT_ID：您的 Google Cloud 项目 ID。
- CLUSTER_NAME：GKE 集群的名称。
- REGION_NAME：GKE 集群、Cloud Storage 存储桶和 TPU 节点所在的区域。该区域包含可以使用 TPU v5e 机器类型的可用区（例如 us-west1、us-west4、us-central1、us-east1、us-east5 或 europe-west4）。
- （仅限标准集群）ZONE：可以使用 TPU 资源的可用区（例如 us-west4-a）。对于 Autopilot 集群，您无需指定可用区，只需指定区域。
- HF_TOKEN：您的 HuggingFace 令牌。

克隆 Optimum TPU 代码库：

git clone https://github.com/huggingface/optimum-tpu.git

获取对模型的访问权限

您可以使用 Gemma 2B 或 Llama3 8B 模型。本教程重点介绍这两种模型，但 Optimum TPU 支持更多模型。

Gemma 2B

如需获取对 Gemma 模型的访问权限以便部署到 GKE，您必须先签署许可同意协议，然后生成 Huggging Face 访问令牌。

您必须签署同意协议才能使用 Gemma。请按照以下说明操作：

访问模型同意页面。
使用您的 Hugging Face 账号验证同意情况。
接受模型条款。

生成一个访问令牌

如果您还没有 Hugging Face 令牌，请生成一个新令牌：

点击您的个人资料 > 设置 > 访问令牌。
点击新建令牌。
指定您选择的名称和一个至少为 Read 的角色。
点击 Generate a token（生成令牌）。
将生成的令牌复制到剪贴板。

Llama3 8B

您必须签署同意协议，才能使用 Hugging Face 代码库中的 Llama3 8b

生成一个访问令牌

如果您还没有 Hugging Face 令牌，请生成一个新令牌：

点击您的个人资料 > 设置 > 访问令牌。
选择新建令牌 (New Token)。
指定您选择的名称和一个至少为 Read 的角色。
选择生成令牌。
将生成的令牌复制到剪贴板。

创建 GKE 集群

创建具有 1 个 CPU 节点的 GKE Standard 集群：

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=REGION_NAME

创建 TPU 节点池

创建具有 1 个节点和 8 个芯片的 v5e TPU 节点池：

gcloud container node-pools create tpunodepool \
    --location=REGION_NAME \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --node-locations=ZONE \
    --cluster=CLUSTER_NAME

如果 TPU 资源可用，GKE 会预配节点池。如果 TPU 资源暂时不可用，输出结果会显示 GCE_STOCKOUT 错误消息。如需排查资源可用性错误，请参阅“TPU 资源不足，无法满足 TPU 请求”。

构建容器

运行 make 命令以构建映像

cd optimum-tpu && make tpu-tgi

将映像推送到 Artifact Registry

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

为 Hugging Face 凭据创建 Kubernetes Secret

创建包含 Hugging Face 令牌的 Kubernetes Secret：

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

部署 Optimum TPU

如需部署 Optimum TPU，本教程将使用 Kubernetes Deployment。Deployment 是一个 Kubernetes API 对象，可让您运行在集群节点中分布的多个 Pod 副本。

Gemma 2B

将以下 Deployment 清单保存为 optimum-tpu-gemma-2b-2x4.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

此清单描述了一个在 TCP 端口 8080 上采用内部负载均衡器的 Optimum TPU 部署。

应用清单

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

将以下清单保存为 optimum-tpu-llama3-8b-2x4.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

此清单描述了一个在 TCP 端口 8080 上采用内部负载均衡器的 Optimum TPU 部署。

应用清单

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

查看正在运行的部署的日志：

kubectl logs -f -l app=tgi-tpu

输出应类似如下所示：

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

确保模型已完全下载，然后再继续下一部分。

应用模型

设置到模型的端口转发：

kubectl port-forward svc/service 8080:8080

使用 curl 与模型服务器互动

验证已部署的模型：

在新的终端会话中，使用 curl 与模型聊天：

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

输出应类似如下所示：

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用，请运行以下命令：

gcloud container clusters delete CLUSTER_NAME \
  --location=REGION_NAME

后续步骤

浏览 Optimum TPU 文档。
了解如何在 GKE 上运行 Gemma 模型，以及如何使用 GKE 平台编排功能运行经过优化的 AI/机器学习工作负载。
详细了解 GKE 中的 TPU。

通过 Optimum TPU 使用 GKE 中的 TPU 应用开源模型 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

目标

准备工作

所需的角色

准备环境

获取对模型的访问权限

Gemma 2B

签署许可同意协议

生成一个访问令牌

Llama3 8B

生成一个访问令牌

创建 GKE 集群

创建 TPU 节点池

构建容器

将映像推送到 Artifact Registry

为 Hugging Face 凭据创建 Kubernetes Secret

部署 Optimum TPU

Gemma 2B

Llama3 8B

应用模型

使用 curl 与模型服务器互动

清理

删除已部署的资源

后续步骤

通过 Optimum TPU 使用 GKE 中的 TPU 应用开源模型