通过 vLLM，使用 GKE 中的 TPU Trillium 提供 LLM

Standard Autopilot

本教程介绍如何通过 vLLM 部署框架，使用 Google Kubernetes Engine (GKE) 上的张量处理单元 (TPU) 来部署大语言模型 (LLM)。在本教程中，您将部署 Llama 3.1 70b、使用 TPU Trillium，并利用 vLLM 服务器指标设置 Pod 横向自动扩缩。

如果您在部署和应用 AI/机器学习工作负载时需要利用托管式 Kubernetes 的精细控制、可伸缩性、弹性、可移植性和成本效益，那么本文档是一个很好的起点。

背景

通过在 GKE 上使用 TPU Trillium，您可以实现一个能直接用于生产环境的强大服务解决方案，享受托管式 Kubernetes 的所有优势，包括高效的可伸缩性和更高的可用性。本部分介绍本指南中使用的关键技术。

TPU Trillium

TPU 是 Google 定制开发的应用专用集成电路 (ASIC)。TPU 用于加速使用 TensorFlow、PyTorch 和 JAX 等框架构建的机器学习和 AI 模型。本教程使用 TPU Trillium，这是 Google 的第六代 TPU。

使用 GKE 中的 TPU 之前，我们建议您完成以下学习路线：

了解 TPU Trillium 系统架构。
了解 GKE 中的 TPU。

vLLM

vLLM 是一个经过高度优化的开源 LLM 部署框架。vLLM 可提高 TPU 上的服务吞吐量，具有以下功能：

具有 PagedAttention 且经过优化的 Transformer 实现
连续批处理，可提高整体服务吞吐量。
多个 TPU 上的张量并行处理和分布式服务。

如需了解详情，请参阅 vLLM 文档。

Cloud Storage FUSE

Cloud Storage FUSE 提供从您的 GKE 集群到 Cloud Storage 的访问权限，以获取驻留在对象存储桶中的模型权重。在本教程中，创建的 Cloud Storage 存储桶最初将为空。当 vLLM 启动时，GKE 会从 Hugging Face 下载模型并将权重缓存到 Cloud Storage 存储桶。在 Pod 重启或部署纵向扩容时，后续模型加载将从 Cloud Storage 存储桶下载缓存的数据，并利用并行下载来获得最佳性能。

如需了解详情，请参阅 Cloud Storage FUSE CSI 驱动程序文档。

目标

本教程适用于希望使用 GKE 编排功能部署 LLM 的 MLOps 或 DevOps 工程师或平台管理员。

本教程介绍以下步骤：

根据模型特征创建一个具有推荐 TPU Trillium 拓扑的 GKE 集群。
在集群中的节点池上部署 vLLM 框架。
使用 vLLM 框架通过负载均衡器部署 Llama 3.1 70b。
使用 vLLM 服务器指标设置 Pod 横向自动扩缩。
部署模型

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/iam.securityAdmin, roles/artifactregistry.writer, roles/container.clusterAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  前往 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 在选择角色列表中，选择一个角色。
6. 如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
7. 点击 Save（保存）。

通过 vLLM，使用 GKE 中的 TPU Trillium 提供 LLM

背景

TPU Trillium

vLLM

Cloud Storage FUSE

目标

准备工作

Check for the roles

Grant the roles

准备环境

获取对模型的访问权限

生成一个访问令牌

启动 Cloud Shell

创建 GKE 集群

Autopilot

Standard

配置 kubectl 以与您的集群通信

为 Hugging Face 凭据创建 Kubernetes Secret

创建 Cloud Storage 存储桶

设置 Kubernetes ServiceAccount 以访问存储桶

部署 vLLM 模型服务器

应用模型

设置自定义自动扩缩器

在 vLLM 端点上创建负载

验证 Google Cloud Managed Service for Prometheus 是否注入指标

部署 Pod 横向自动扩缩器配置

清理

删除已部署的资源

后续步骤

通过 vLLM，使用 GKE 中的 TPU Trillium 提供 LLM 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

背景

TPU Trillium

vLLM

Cloud Storage FUSE

目标

准备工作

Check for the roles

Grant the roles

准备环境

获取对模型的访问权限

生成一个访问令牌

启动 Cloud Shell

创建 GKE 集群

Autopilot

Standard

配置 kubectl 以与您的集群通信

为 Hugging Face 凭据创建 Kubernetes Secret

创建 Cloud Storage 存储桶

设置 Kubernetes ServiceAccount 以访问存储桶

部署 vLLM 模型服务器

应用模型

设置自定义自动扩缩器

在 vLLM 端点上创建负载

验证 Google Cloud Managed Service for Prometheus 是否注入指标

部署 Pod 横向自动扩缩器配置

清理

删除已部署的资源

后续步骤

通过 vLLM，使用 GKE 中的 TPU Trillium 提供 LLM