Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

通过 vLLM 使用 GKE 中的 GPU 应用 Gemma 开放模型

Autopilot Standard

如需在 Google Kubernetes Engine (GKE) 上使用 vLLM 框架通过 GPU 来应用 Gemma 4 大语言模型 (LLM)，您必须预配具有受支持加速器（例如 NVIDIA B200、H100、RTX Pro 6000 或 L4 GPU）的 GKE 集群。

如需获取 Gemma 4 模型权重，您可以将预构建的 vLLM 容器配置为从 Hugging Face 代码库下载这些权重。或者，容器可以从现有永久性存储空间加载模型权重，例如通过在 Google Cloud Managed Lustre 实例上缓存 Cloud Storage 模型存储分区。

加载权重后，vLLM 容器会公开与 OpenAI 兼容的 API 端点，以实现高吞吐量推理。

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员，以及希望使用 Kubernetes 容器编排功能在 H200、H100、A100 和 L4 GPU 硬件上处理 AI/机器学习工作负载的数据和 AI 专家。如需详细了解我们在 Google Cloud 内容中提及的常见角色和示例任务，请参阅常见的 GKE 用户角色和任务。

如果您需要统一的托管式 AI 平台，旨在经济高效地快速构建和部署机器学习模型，我们建议您试用我们的 Vertex AI 部署解决方案。

在阅读本页面之前，请确保您熟悉以下内容：

背景

本部分介绍本指南中使用的关键技术。

GPU

利用 GPU，您可以加速在节点上运行的特定工作负载（例如机器学习和数据处理）。GKE 提供了一系列机器类型选项以用于节点配置，包括配备 NVIDIA H200、H100、L4 和 A100 GPU 的机器类型。

vLLM

vLLM 是一个经过高度优化的开源 LLM 服务框架，可提高 GPU 上的服务吞吐量，具有如下功能：

具有 PagedAttention 且经过优化的 Transformer（转换器）实现
连续批处理，可提高整体服务吞吐量
多个 GPU 上的张量并行处理和分布式服务

如需了解详情，请参阅 vLLM 文档。

目标

本教程为理解和探索在托管式 Kubernetes 环境中部署实际 LLM 以进行推理提供了基础。

使用处于 Autopilot 或 Standard 模式的 GKE 集群准备环境。
将 vLLM 容器部署到您的集群。
通过 curl 和网页聊天界面，使用 vLLM 提供 Gemma 4 模型。

准备工作

登录您的 Google Cloud 账号。如果您是 Google Cloud新手，请创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

确保您在项目中拥有以下一个或多个角色： roles/container.admin、roles/iam.serviceAccountAdmin
检查角色
1. 在 Google Cloud 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 在主账号 列中，找到标识您或您所属群组的所有行。如需了解您属于哪些群组，请与您的管理员联系。
4. 对于指定或包含您的所有行，请检查角色列以查看角色列表是否包含所需的角色。
授予角色
1. 在 Google Cloud 控制台中，前往 IAM 页面。
  转到 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号 字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 点击选择角色，然后搜索相应角色。
6. 如需授予其他角色，请点击 添加其他角色 ，然后添加其他各个角色。
7. 点击 Save （保存）。

确保您的项目具有足够的 L4 GPU 配额。如需了解详情，请参阅 GPU 简介和分配配额。

准备环境

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 中预安装了本教程所需的软件，包括 kubectl 和 gcloud CLI。

如需使用 Cloud Shell 设置您的环境，请按照以下步骤操作：

在 Google Cloud 控制台中，点击 Google Cloud 控制台中的 激活 Cloud Shell 以启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。
设置默认环境变量：
```
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=REGION
export ZONE=ZONE
export CLUSTER_NAME=CLUSTER_NAME
```
替换以下值：
- PROJECT_ID：您的 Google Cloud 项目 ID。
- REGION：支持要使用的加速器类型的区域，例如适用于 L4 GPU 的 us-central1。您可以查找哪些区域具有哪些 GPU。
- ZONE：支持要使用的加速器类型的可用区，例如适用于 RTX PRO 6000 GPU 的 us-central1-b 和 us-central1-f。您可以查找哪些可用区具有哪些 GPU。
- CLUSTER_NAME：您的集群的名称。

创建和配置 Google Cloud 资源

请按照以下说明创建所需的资源。

创建 GKE 集群和节点池

您可以在 GKE Autopilot 或 Standard 集群中的 GPU 上应用 Gemma。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式，请参阅选择 GKE 操作模式。

Autopilot

在 Cloud Shell 中，运行以下命令：

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=CONTROL_PLANE_LOCATION \
    --release-channel=rapid

替换以下值：

PROJECT_ID：您的 Google Cloud项目 ID。
CONTROL_PLANE_LOCATION：集群控制平面的 Compute Engine 区域。提供支持要使用的加速器类型的区域，例如适用于 L4 GPU 的 us-central1。
CLUSTER_NAME：您的集群的名称。

GKE 会根据所部署的工作负载的请求，创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。

Standard

在 Cloud Shell 中，运行以下命令以创建 Standard 集群：
```
gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=CONTROL_PLANE_LOCATION \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --release-channel=rapid \
    --num-nodes=1
```
替换以下值：
- PROJECT_ID：您的 Google Cloud项目 ID。
- CONTROL_PLANE_LOCATION：集群控制平面的 Compute Engine 区域。提供支持要使用的加速器类型的区域，例如适用于 L4 GPU 的 us-central1。
- CLUSTER_NAME：您的集群的名称。
集群创建可能需要几分钟的时间。

如需为集群创建具有适当磁盘大小的节点池，请运行以下命令：

  gcloud container node-pools create gpupool \
      --accelerator type=nvidia-rtx-pro-6000,count=1,gpu-driver-version=latest \
      --project=PROJECT_ID \
      --location=REGION \
      --node-locations=ZONE \
      --cluster=CLUSTER_NAME \
      --machine-type=g4-standard-48 \
      --num-nodes=1

GKE 会创建一个节点池，其中每个节点有一个 RTX PRO。 6000 GPU。

使用 Hugging Face 权重在 vLLM 上部署 Gemma 4 模型

如需使用 Hugging Face 权重部署 Gemma 4 模型，请为所选模型大小应用 Kubernetes Deployment 清单。Deployment 是一个 Kubernetes API 对象，可让您运行在集群节点中分布的多个 Pod 副本。

过程

应用此清单会拉取 vLLM 容器映像，请求 NVIDIA GPU，并自动从 Hugging Face 下载权重以启动 vLLM 推理引擎。

Gemma 4 E2B-it

请按照以下说明部署 Gemma 4 E2B 指令调优模型（纯文本输入）。

创建以下 vllm-4-e2b-it.yaml 清单：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-e2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "2"
            memory: "10Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "2"
            memory: "10Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: "1"
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
          - --model=$(MODEL_ID)
          - --host=0.0.0.0
          - --port=8000
          - --tensor-parallel-size=1
          - --enable-log-requests
          - --enable-chunked-prefill
          - --enable-prefix-caching
          - --enable-auto-tool-choice
          - --generation-config=auto
          - --tool-call-parser=gemma4
          - --dtype=bfloat16
          - --max-num-seqs=16
          - --max-model-len=32768
          - --gpu-memory-utilization=0.95
          - --reasoning-parser=gemma4
          - --trust-remote-code
        env:
        - name: LD_LIBRARY_PATH
          value: ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
        - name: MODEL_ID
          value: google/gemma-4-E2B-it
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-rtx-pro-6000
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

应用清单：
```
kubectl apply -f vllm-4-e2b-it.yaml
```

Gemma 4 E4B-it

请按照以下说明部署 Gemma 4 E4B 指令调优模型。

创建以下 vllm-4-e4b-it.yaml 清单：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-e4b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "4"
            memory: "20Gi"
            ephemeral-storage: "20Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "20Gi"
            ephemeral-storage: "20Gi"
            nvidia.com/gpu: "1"
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
          - --model=$(MODEL_ID)
          - --host=0.0.0.0
          - --port=8000
          - --tensor-parallel-size=1
          - --enable-log-requests
          - --enable-chunked-prefill
          - --enable-prefix-caching
          - --enable-auto-tool-choice
          - --generation-config=auto
          - --tool-call-parser=gemma4
          - --dtype=bfloat16
          - --max-num-seqs=16
          - --max-model-len=32768
          - --gpu-memory-utilization=0.95
          - --reasoning-parser=gemma4
          - --trust-remote-code
        env:
        - name: LD_LIBRARY_PATH
          value: ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
        - name: MODEL_ID
          value: google/gemma-4-E4b-it
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-rtx-pro-6000
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

应用清单：
```
kubectl apply -f vllm-4-E4b-it.yaml
```
在我们的示例中，我们使用 vLLM 选项 --max-model-len=32768 将上下文窗口限制为 32 K。如果您需要更大的上下文窗口（最多 128 K），请调整清单和节点池配置，以增加 GPU 容量。

Gemma 4 26B-A4B-it

请按照以下说明部署 Gemma 4 26B-A4B 指令调优模型。

创建以下 vllm-4-26b-a4b-it.yaml 清单：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-26b-a4b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "20"
            memory: "80Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
          - --model=$(MODEL_ID)
          - --host=0.0.0.0
          - --port=8000
          - --tensor-parallel-size=1
          - --enable-log-requests
          - --enable-chunked-prefill
          - --enable-prefix-caching
          - --enable-auto-tool-choice
          - --generation-config=auto
          - --tool-call-parser=gemma4
          - --dtype=bfloat16
          - --max-num-seqs=16
          - --max-model-len=16384
          - --gpu-memory-utilization=0.95
          - --reasoning-parser=gemma4
          - --trust-remote-code
        env:
        - name: LD_LIBRARY_PATH
          value: ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
        - name: MODEL_ID
          value: google/gemma-4-26B-A4B-it
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-rtx-pro-6000
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

应用清单：
```
kubectl apply -f vllm-4-26b-a4b-it.yaml
```
在我们的示例中，我们使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口（最多 128 K），请调整清单和节点池配置，以增加 GPU 容量。

Gemma 4 31B-it

请按照以下说明部署 Gemma 4 31B 指令调优模型。

创建以下 vllm-4-31b-it.yaml 清单：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-4-31b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4
        resources:
          requests:
            cpu: "22"
            memory: "100Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "22"
            memory: "100Gi"
            ephemeral-storage: "80Gi"
            nvidia.com/gpu: "1"
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
          - --model=$(MODEL_ID)
          - --host=0.0.0.0
          - --port=8000
          - --tensor-parallel-size=1
          - --enable-log-requests
          - --enable-chunked-prefill
          - --enable-prefix-caching
          - --enable-auto-tool-choice
          - --generation-config=auto
          - --tool-call-parser=gemma4
          - --dtype=bfloat16
          - --max-num-seqs=16
          - --max-model-len=16384
          - --gpu-memory-utilization=0.95
          - --reasoning-parser=gemma4
          - --trust-remote-code
        env:
        - name: LD_LIBRARY_PATH
          value: ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
        - name: MODEL_ID
          value: google/gemma-4-31B-it
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-rtx-pro-6000
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

应用清单：
```
kubectl apply -f vllm-4-31b-it.yaml
```
在我们的示例中，我们使用 vLLM 选项 --max-model-len=16384 将上下文窗口大小限制为 16 K。如果您需要更大的上下文窗口（最多 128K），请调整清单和节点池配置，以增加 GPU 容量。

验证

等待部署成为可用状态：

kubectl wait --for=condition=Available --timeout=1800s deployment/vllm-gemma-deployment

查看正在运行的部署的日志：

kubectl logs -f -l app=gemma-server

Deployment 资源会下载 Gemma 模型数据。此过程可能需要几分钟的时间。输出类似于以下内容：

  ...
  ...
  (APIServer pid=1) INFO:     Started server process [1]
  (APIServer pid=1) INFO:     Waiting for application startup.
  (APIServer pid=1) INFO:     Application startup complete.

Hugging Face 部署可用后，设置端口转发以与模型互动。

从 Managed Lustre 部署微调后的 Gemma

如需应用已存储在 Google Cloud Managed Lustre 实例上的微调 Gemma 模型（例如 Gemma 3 27B），您必须将相应的 PersistentVolumeClaim (PVC) 装载到 vLLM 容器。

前提条件

确保您的 GKE 集群中已有连接到 Lustre 实例的 PVC。在此示例中，PVC 名为 gemma-lustre-pvc。

如需了解如何为现有实例创建 PVC 和 PersistentVolume (PV)，请参阅访问现有 Managed Lustre 实例。

过程

将以下 YAML 清单保存为 vllm-lustre-gemma.yaml。在此示例中，Deployment 会将 Lustre PVC 装载到 /data，并指示 vLLM 从该本地路径加载模型权重。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-lustre
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=/data/gemma-3-27b
        - --tensor-parallel-size=1
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: model-weights
          mountPath: /data
      volumes:
      - name: model-weights
        persistentVolumeClaim:
          claimName: gemma-lustre-pvc
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-gpu-driver-version: latest

将清单应用于 GKE 集群：

kubectl apply -f vllm-lustre-gemma.yaml

验证

如需确认模型已从 Lustre 卷成功加载，请检查 Pod 日志中的 vLLM 启动序列：

kubectl logs -l app=gemma-server

应用模型

在本部分中，您将与模型互动。确保模型已完全下载，然后再继续。

设置端口转发

运行以下命令以设置到模型的端口转发：

kubectl port-forward service/llm-service 8000:8000

输出类似于以下内容：

Forwarding from 127.0.0.1:8000 -> 8000

使用 curl 与模型互动

本部分介绍如何执行基本的冒烟测试来验证所部署的 Gemma 4 指令调优模型。对于其他模型，请将 gemma-4-e4b-it 替换为相应模型的名称。

此示例展示了如何使用纯文本输入来测试 Gemma 4 E4B 指令调优模型。

在新的终端会话中，使用 curl 与模型聊天：

curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
        {
          "role": "user",
          "content": "Why is the sky blue?"
        }
    ],
    "chat_template_kwargs": {
         "enable_thinking": true
    },
    "skip_special_tokens": false
}'

输出类似于以下内容：

{
  "id": "chatcmpl-be75ccfcbdf753d1",
  "object": "chat.completion",
  "created": 1775006187,
  "model": "google/gemma-4-26B-A4B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The short answer is a phenomenon called **Rayleigh scattering**.\n\nTo understand how it works, you have to look at three things: sunlight, the Earth's atmosphere, and how light travels.\n\n### 1. Sunlight is a Rainbow\nAlthough sunlight looks white to us, it is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, and violet). Light travels as **waves**, and each color has a different wavelength:\n*   **Red light** travels in long, lazy, wide waves.\n*   **Blue and violet light** travel in short, choppy, tight waves.\n\n### 2. The Atmosphere is an Obstacle Course\nEarth's atmosphere is filled with gases (mostly nitrogen and oxygen). As sunlight travels through the atmosphere, it strikes the molecules of these gases. \n\nBecause the gas molecules are very small, they affect the colors differently based on their wavelength:\n*   The **long waves** (reds and yellows) pass through the atmosphere mostly straight, without hitting much. They are like large ocean waves that roll right over small pebbles.\n*   The **short waves** (blues and violets) strike the gas molecules and get scattered in every direction. They are like small ripples that hit a pebble and splash everywhere.\n\nBecause this blue light is being scattered in every direction by the air, when you look up, your eyes are catching that scattered blue light coming from every part of the sky.\n\n### 3. Why isn't the sky violet?\nIf violet light has an even shorter wavelength than blue light, you might wonder why the sky doesn't look purple. There are two main reasons:\n1.  **The Sun's output:** The Sun emits much more blue light than violet light.\n2.  **Human Biology:** Human eyes are much more sensitive to blue than to violet. Our eyes interpret the scatter of mixed blue and violet light simply as pale blue.\n\n---\n\n### Bonus: Why are sunsets red?\nWhen the sun is setting, it is much lower on the horizon. This means the sunlight has to travel through a much **thicker** layer of the atmosphere to reach your eyes. \n\nBy the time the light gets to you, the blue light has been scattered away completely로. Only the long-wavelength colors—the reds, oranges, and pinks—are able to make it through that thick layer of air without being scattered away, creating the beautiful colors of a sunset.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "\"Why is the sky blue?\"\nScience/Physics (Atmospheric scattering).\nExplain the phenomenon clearly, accurately, and engagingly.\n\n    *   Sunlight (white light) is made of a spectrum of colors (ROYGBIV).\n    *   Earth's atmosphere is filled with gases (nitrogen, oxygen) and particles.\n    *   Rayleigh Scattering: Shorter wavelengths (blue/violet) scatter more easily than longer wavelengths (red/yellow) when hitting small gas molecules.\n    *   The human eye's sensitivity: Why isn't it violet?\n\n    A good scientific explanation should follow a logical flow:\n    *   *Direct Answer:* The core mechanism (Rayleigh Scattering).\n    *   *The Components:* Sunlight and the Atmosphere.\n    *   *The Mechanism:* How light interacts with gas molecules.\n    *   *The Wavelength Factor:* Comparing colors.\n    *   *The \"Wait, why not violet?\" question:* Addressing human perception.\n    *   *Bonus/Related concept:* Why sunsets are red.\n\n        *   Use the term **Rayleigh Scattering**.\n        *   Summarize: Its how sunlight interacts with the Earth's atmosphere.\n\n        *   Sunlight looks white, but it's actually a mix of all colors (the rainbow).\n        *   Each color travels as a different wavelength. Red = long/lazy waves; Blue/Violet = short/choppy waves.\n\n        *   The atmosphere is mostly Nitrogen and Oxygen.\n        *   When sunlight hits these tiny gas molecules, the light gets scattered in all directions.\n\n        *   Blue light travels in shorter, smaller waves.\n        *   Because these waves are small, they strike the gas molecules more frequently and get scattered more easily than the longer red/yellow waves.\n        *   Result: When you look up, your eyes are catching this \"scattered\" blue light coming from every direction.\n\n        *   *Technically*, violet light has an even shorter wavelength than blue, so it scatters *even more*. Why isn't the sky violet?\n        *   Two reasons: 1. The Sun emits more blue light than violet light. 2. Human eyes are much more sensitive to blue than violet.\n\n        *   Briefly mention sunsets to provide a complete picture.\n        *   At sunset, light travels through *more* atmosphere. The blue is scattered away completely, leaving only the long red/orange waves to reach your eyes.\n\n    *   *Tone Check:* Is it too academic? Use analogies (like waves in water or skipping stones) if needed, but keep it concise.\n    *   *Clarity:* Ensure the distinction between wavelength and scattering is clear."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 1122,
    "completion_tokens": 1101,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

（可选）通过 Gradio 聊天界面与模型互动

在本部分中，您将构建一个网页聊天应用，可让您与指令调优模型互动。为简单起见，本部分仅介绍使用 4B-it 模型的测试方法。

Gradio 是一个 Python 库，它具有一个可为聊天机器人创建界面的 ChatInterface 封装容器。

部署聊天界面

在 Cloud Shell 中，将以下清单保存为 gradio.yaml。将 google/gemma-4-E4B-it 更改为您在部署中使用的 Gemma 4 模型的名称。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.7
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/v1/chat/completions"
        - name: HOST
          value: "http://llm-service:8000"
        - name: LLM_ENGINE
          value: "openai-chat"
        - name: MODEL_ID
          value: "google/gemma-4-E4B-it"
        - name: DISABLE_SYSTEM_MESSAGE
          value: "true"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio
spec:
  selector:
    app: gradio
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 7860
  type: ClusterIP

应用清单：
```
kubectl apply -f gradio.yaml
```

等待部署成为可用状态：

kubectl wait --for=condition=Available --timeout=900s deployment/gradio

使用聊天界面

在 Cloud Shell 中，运行以下命令：
```
kubectl port-forward service/gradio 8080:8080
```
这会创建从 Cloud Shell 到 Gradio 服务的端口转发。
点击 Cloud Shell 任务栏右上角的 网页预览按钮。点击在端口 8080 上预览。浏览器中会打开一个新的标签页。
使用 Gradio 聊天界面与 Gemma 互动。添加提示，然后点击提交。

问题排查

如果您收到 Empty reply from server 消息，则容器可能尚未完成模型数据下载。再次检查 Pod 的日志中是否包含 Connected 消息，该消息表明模型已准备好进行应用。
如果您看到 Connection refused，请验证您的端口转发已启用。

观察模型性能

如需查看模型的可观测性指标对应的信息中心，请按以下步骤操作：

在 Google Cloud 控制台中，前往已部署的模型页面。

前往“已部署的模型”页面
如需查看特定部署的详细信息（包括其指标、日志和信息中心），请点击列表中的模型名称。
在模型详情页面中，点击可观测性标签页以查看以下信息中心。如果系统提示，请点击启用以对集群启用指标收集。
- 基础设施使用情况信息中心会显示利用率指标。
- DCGM 信息中心会显示 DCGM 指标。
- 如果您使用的是 vLLM，则可以使用模型性能信息中心，该信息中心会显示 vLLM 模型性能的指标。

您还可以在 Cloud Monitoring 中的 vLLM 信息中心集成中查看指标。这些指标会针对所有 vLLM 部署进行汇总，且没有预设过滤条件

如需使用 Cloud Monitoring 中的信息中心，您必须在 GKE 集群中启用 Google Cloud Managed Service for Prometheus，该服务会从 vLLM 收集指标。 vLLM 默认以 Prometheus 格式公开指标；您无需安装其他导出工具。如需了解如何使用 Google Cloud Managed Service for Prometheus 从模型收集指标，请参阅 Cloud Monitoring 文档中的 vLLM 可观测性指南。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用，请运行以下命令：

gcloud container clusters delete CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION