使用 Model Garden 和受 Vertex AI GPU 支持的端点部署 Gemma 并运行推理

在此教程中，您将使用 Model Garden 将 Gemma 1B 开放模型部署到受 GPU 支持的 Vertex AI 端点。您必须先将模型部署到端点，然后才能使用该模型执行在线预测。部署模型会将物理资源与模型相关联，以便以低延迟方式执行在线预测。

部署 Gemma 1B 模型后，您可以使用 PredictionServiceClient 获取在线预测结果，以通过经过训练的模型进行推理。在线预测是指向部署到端点的模型发出的同步请求。

目标

本教程介绍了如何执行以下任务：

使用 Model Garden 将 Gemma 1B 开放模型部署到受 GPU 支持的端点
使用 PredictionServiceClient 获取在线预测结果

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

如需根据您的预计使用量来估算费用，请使用价格计算器。

新 Google Cloud 用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

本教程要求您执行以下操作：

设置 Google Cloud 项目并启用 Vertex AI API
在您的本地机器上：
- 安装、初始化 Google Cloud CLI 并向其进行身份验证
- 安装相应编程语言的 SDK

设置 Google Cloud 项目

设置 Google Cloud 项目并启用 Vertex AI API。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

设置 Google Cloud CLI

在您的本地机器上，设置 Google Cloud CLI。

安装并初始化 Google Cloud CLI。
如果您之前安装了 gcloud CLI，请确保通过运行此命令更新您的 gcloud 组件。
```
gcloud components update
```
如要向 gcloud CLI 进行身份验证，请通过运行此命令生成本地应用默认凭证 (ADC) 文件。该命令启动的 Web 流程用于提供您的用户凭证。
```
gcloud auth application-default login
```
如需了解详情，请参阅 gcloud CLI 身份验证配置和 ADC 配置。

设置相应编程语言的 SDK

如需设置本教程中使用的环境，请安装相应编程语言的 Vertex AI SDK 及协议缓冲区库。代码示例会使用协议缓冲区库中的函数将输入字典转换为 API 所需的 JSON 格式。

在您的本地机器上，点击以下标签页之一，安装相应编程语言的 SDK。

Python

在您的本地机器上，点击以下标签页之一，安装相应编程语言的 SDK。

运行以下命令，安装并更新 Vertex AI SDK for Python。
```
pip3 install --upgrade "google-cloud-aiplatform>=1.64"
```
运行以下命令，安装 Python 版协议缓冲区库。
```
pip3 install --upgrade "protobuf>=5.28"
```

Node.js

运行以下命令，安装或更新 Node.js 版 aiplatform SDK。

npm install @google-cloud/aiplatform

Java

如需将 google-cloud-aiplatform 添加为依赖项，请为您的环境添加相应的代码。

带有 BOM 的 Maven

将以下 HTML 添加到 pom.xml 中：

<dependencyManagement>
<dependencies>
  <dependency>
    <artifactId>libraries-bom</artifactId>
    <groupId>com.google.cloud</groupId>
    <scope>import</scope>
    <type>pom</type>
    <version>26.34.0</version>
  </dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-aiplatform</artifactId>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java-util</artifactId>
</dependency>
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
</dependency>
</dependencies>

不带 BOM 的 Maven

将以下内容添加到 pom.xml 中：

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-aiplatform</artifactId>
  <version>1.1.0</version>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java-util</artifactId>
  <version>5.28</version>
</dependency>
<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
  <version>2.11.0</version>
</dependency>

不带 BOM 的 Gradle

将以下内容添加到 build.gradle 中：

implementation 'com.google.cloud:google-cloud-aiplatform:1.1.0'

Go

运行以下命令，安装下面的 Go 软件包。

go get cloud.google.com/go/aiplatform
go get google.golang.org/protobuf
go get github.com/googleapis/gax-go/v2

使用 Model Garden 部署 Gemma

您可以使用 Google Cloud 控制台中的模型卡片或以编程方式部署 Gemma 1B。

如需详细了解如何设置 Google Gen AI SDK 或 Google Cloud CLI，请参阅 Google Gen AI SDK 概览或安装 Google Cloud CLI。

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

列出可部署的模型，并记录要部署的模型的 ID。您可以选择列出 Model Garden 中支持的 Hugging Face 模型，甚至可以按模型名称过滤这些模型。输出不包含任何经过调优的模型。


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

# List deployable models, optionally list Hugging Face models only or filter by model name.
deployable_models = model_garden.list_deployable_models(list_hf_models=False, model_filter="gemma")
print(deployable_models)
# Example response:
# ['google/gemma2@gemma-2-27b','google/gemma2@gemma-2-27b-it', ...]

使用上一步中的模型 ID 查看模型的部署规范。您可以查看 Model Garden 针对特定模型验证过的机器类型、加速器类型和容器映像 URI。


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# model = "google/gemma3@gemma-3-1b-it"
vertexai.init(project=PROJECT_ID, location="us-central1")

# For Hugging Face modelsm the format is the Hugging Face model name, as in
# "meta-llama/Llama-3.3-70B-Instruct".
# Go to https://console.cloud.google.com/vertex-ai/model-garden to find all deployable
# model names.

model = model_garden.OpenModel(model)
deploy_options = model.list_deploy_options()
print(deploy_options)
# Example response:
# [
#   dedicated_resources {
#     machine_spec {
#       machine_type: "g2-standard-12"
#       accelerator_type: NVIDIA_L4
#       accelerator_count: 1
#     }
#   }
#   container_spec {
#     ...
#   }
#   ...
# ]

将模型部署到端点。除非您指定其他参数和值，否则 Model Garden 会使用默认部署配置。


import vertexai
from vertexai import model_garden

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

open_model = model_garden.OpenModel("google/gemma3@gemma-3-12b-it")
endpoint = open_model.deploy(
    machine_type="g2-standard-48",
    accelerator_type="NVIDIA_L4",
    accelerator_count=4,
    accept_eula=True,
)

# Optional. Run predictions on the deployed endoint.
# endpoint.predict(instances=[{"prompt": "What is Generative AI?"}])

gcloud

在开始之前，请指定一个配额项目来运行以下命令。您运行的命令会计入相应项目的配额。如需了解详情，请参阅设置配额项目。

运行 gcloud ai model-garden models list 命令，列出可部署的模型。此命令会列出所有模型 ID 以及您可以自行部署的模型的 ID。

gcloud ai model-garden models list --model-filter=gemma

在输出中，找到要部署的模型的 ID。以下示例显示了简略版输出。

MODEL_ID                                      CAN_DEPLOY  CAN_PREDICT
google/gemma2@gemma-2-27b                     Yes         No
google/gemma2@gemma-2-27b-it                  Yes         No
google/gemma2@gemma-2-2b                      Yes         No
google/gemma2@gemma-2-2b-it                   Yes         No
google/gemma2@gemma-2-9b                      Yes         No
google/gemma2@gemma-2-9b-it                   Yes         No
google/gemma3@gemma-3-12b-it                  Yes         No
google/gemma3@gemma-3-12b-pt                  Yes         No
google/gemma3@gemma-3-1b-it                   Yes         No
google/gemma3@gemma-3-1b-pt                   Yes         No
google/gemma3@gemma-3-27b-it                  Yes         No
google/gemma3@gemma-3-27b-pt                  Yes         No
google/gemma3@gemma-3-4b-it                   Yes         No
google/gemma3@gemma-3-4b-pt                   Yes         No
google/gemma3n@gemma-3n-e2b                   Yes         No
google/gemma3n@gemma-3n-e2b-it                Yes         No
google/gemma3n@gemma-3n-e4b                   Yes         No
google/gemma3n@gemma-3n-e4b-it                Yes         No
google/gemma@gemma-1.1-2b-it                  Yes         No
google/gemma@gemma-1.1-2b-it-gg-hf            Yes         No
google/gemma@gemma-1.1-7b-it                  Yes         No
google/gemma@gemma-1.1-7b-it-gg-hf            Yes         No
google/gemma@gemma-2b                         Yes         No
google/gemma@gemma-2b-gg-hf                   Yes         No
google/gemma@gemma-2b-it                      Yes         No
google/gemma@gemma-2b-it-gg-hf                Yes         No
google/gemma@gemma-7b                         Yes         No
google/gemma@gemma-7b-gg-hf                   Yes         No
google/gemma@gemma-7b-it                      Yes         No
google/gemma@gemma-7b-it-gg-hf                Yes         No

输出不包含任何经过调优的模型或 Hugging Face 模型。如需查看支持哪些 Hugging Face 模型，请添加 --can-deploy-hugging-face-models 标志。

如需查看模型的部署规范，请运行 gcloud ai model-garden models list-deployment-config 命令。您可以查看 Model Garden 支持特定模型使用的机器类型、加速器类型和容器映像 URI。
```
gcloud ai model-garden models list-deployment-config \
    --model=MODEL_ID
```
将 MODEL_ID 替换为通过上一个 list 命令得到的模型 ID，例如 google/gemma@gemma-2b 或 stabilityai/stable-diffusion-xl-base-1.0。

运行 gcloud ai model-garden models deploy 命令，将模型部署到端点。Model Garden 会为您的端点生成显示名称，并使用默认部署配置，除非您另行指定其他参数和值。

如需异步运行命令，请添加 --asynchronous 标志。

gcloud ai model-garden models deploy \
    --model=MODEL_ID \
    [--machine-type=MACHINE_TYPE] \
    [--accelerator-type=ACCELERATOR_TYPE] \
    [--endpoint-display-name=ENDPOINT_NAME] \
    [--hugging-face-access-token=HF_ACCESS_TOKEN] \
    [--reservation-affinity reservation-affinity-type=any-reservation] \
    [--reservation-affinity reservation-affinity-type=specific-reservation, key="compute.googleapis.com/reservation-name", values=RESERVATION_RESOURCE_NAME] \
    [--asynchronous]

替换以下占位符：

MODEL_ID：通过上一个 list 命令得到的模型 ID。对于 Hugging Face 模型，请使用 Hugging Face 模型网址格式，例如 stabilityai/stable-diffusion-xl-base-1.0。
MACHINE_TYPE：定义要为模型部署的资源集，例如 g2-standard-4。
ACCELERATOR_TYPE：指定要添加到部署中的加速器，以帮助在处理密集型工作负载（例如 NVIDIA_L4）时提高性能。
ENDPOINT_NAME：已部署的 Vertex AI 端点的名称。
HF_ACCESS_TOKEN：对于 Hugging Face 模型，如果模型有门控限制，请提供访问令牌。
RESERVATION_RESOURCE_NAME：如需使用特定的 Compute Engine 预留，请指定预留的名称。如果您指定了特定预留，则无法指定 any-reservation。

输出包括 Model Garden 使用的部署配置、端点 ID 和部署操作 ID（您可以用其来检查部署状态）。

Using the default deployment configuration:
 Machine type: g2-standard-12
 Accelerator type: NVIDIA_L4
 Accelerator count: 1

The project has enough quota. The current usage of quota for accelerator type NVIDIA_L4 in region us-central1 is 0 out of 28.

Deploying the model to the endpoint. To check the deployment status, you can try one of the following methods:
1) Look for endpoint `ENDPOINT_DISPLAY_NAME` at the [Vertex AI] -> [Online prediction] tab in Cloud Console
2) Use `gcloud ai operations describe OPERATION_ID --region=LOCATION` to find the status of the deployment long-running operation

如需查看有关部署的详细信息，请运行 gcloud ai endpoints list --list-model-garden-endpoints-only 命令：
```
gcloud ai endpoints list --list-model-garden-endpoints-only \
    --region=LOCATION_ID
```
将 LOCATION_ID 替换为您在其中部署了模型的区域。

输出包含从 Model Garden 创建的所有端点，以及端点 ID、端点名称、端点是否与所部署模型相关联等信息。如需查找您的部署，请查找上一个命令返回的端点名称。

REST

列出所有可部署的模型，然后获取要部署的模型的 ID。然后，您可以使用默认配置和端点部署模型。或者，您也可以选择自定义部署，例如设置特定的机器类型或使用专用端点。

列出可部署的模型

在使用任何请求数据之前，请先进行以下替换：

PROJECT_ID：您的 Google Cloud 项目 ID。
QUERY_PARAMETERS：如需列出 Model Garden 模型，请添加以下查询参数 listAllVersions=True&filter=can_deploy(true)。如需列出 Hugging Face 模型，请将过滤条件设置为 alt=json&is_hf_wildcard(true)+AND+labels.VERIFIED_DEPLOYMENT_CONFIG%3DVERIFIED_DEPLOYMENT_SUCCEED&listAllVersions=True。

HTTP 方法和网址：

GET https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS

如需发送请求，请选择以下方式之一：

curl

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI，或者使用了 Cloud Shell，这会使您自动登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

执行以下命令：

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "x-goog-user-project: PROJECT_ID" \
     "https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS"

PowerShell

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

执行以下命令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_ID" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://us-central1-aiplatform.googleapis.com/v1/publishers/*/models?QUERY_PARAMETERS" | Select-Object -Expand Content

您会收到类似以下内容的 JSON 响应。

{
  "publisherModels": [
    {
      "name": "publishers/google/models/gemma3",
      "versionId": "gemma-3-1b-it",
      "openSourceCategory": "GOOGLE_OWNED_OSS_WITH_GOOGLE_CHECKPOINT",
      "supportedActions": {
        "openNotebook": {
          "references": {
            "us-central1": {
              "uri": "https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gradio_streaming_chat_completions.ipynb"
            }
          },
          "resourceTitle": "Notebook",
          "resourceUseCase": "Chat Completion Playground",
          "resourceDescription": "Chat with deployed Gemma 2 endpoints via Gradio UI."
        },
        "deploy": {
          "modelDisplayName": "gemma-3-1b-it",
          "containerSpec": {
            "imageUri": "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01",
            "args": [
              "python",
              "-m",
              "vllm.entrypoints.api_server",
              "--host=0.0.0.0",
              "--port=8080",
              "--model=gs://vertex-model-garden-restricted-us/gemma3/gemma-3-1b-it",
              "--tensor-parallel-size=1",
              "--swap-space=16",
              "--gpu-memory-utilization=0.95",
              "--disable-log-stats"
            ],
            "env": [
              {
                "name": "MODEL_ID",
                "value": "google/gemma-3-1b-it"
              },
              {
                "name": "DEPLOY_SOURCE",
                "value": "UI_NATIVE_MODEL"
              }
            ],
            "ports": [
              {
                "containerPort": 8080
              }
            ],
            "predictRoute": "/generate",
            "healthRoute": "/ping"
          },
          "dedicatedResources": {
            "machineSpec": {
              "machineType": "g2-standard-12",
              "acceleratorType": "NVIDIA_L4",
              "acceleratorCount": 1
            }
          },
          "publicArtifactUri": "gs://vertex-model-garden-restricted-us/gemma3/gemma3.tar.gz",
          "deployTaskName": "vLLM 128K context",
          "deployMetadata": {
            "sampleRequest": "{\n    \"instances\": [\n        {\n          \"@requestFormat\": \"chatCompletions\",\n          \"messages\": [\n              {\n                  \"role\": \"user\",\n                  \"content\": \"What is machine learning?\"\n              }\n          ],\n          \"max_tokens\": 100\n        }\n    ]\n}\n"
          }
        },
        ...

部署模型

部署 Model Garden 中的模型或 Hugging Face 中的模型。您还可以通过指定其他 JSON 字段来自定义部署。

使用默认配置部署模型。

在使用任何请求数据之前，请先进行以下替换：

LOCATION：将在其中部署模型的区域。
PROJECT_ID：您的 Google Cloud 项目 ID。
MODEL_ID：要部署的模型的 ID，您可以通过列出所有可部署的模型来获取此 ID。该 ID 采用以下格式：publishers/PUBLISHER_NAME/models/MODEL_NAME@MODEL_VERSION。

HTTP 方法和网址：

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

请求 JSON 正文：

{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}

如需发送请求，请选择以下方式之一：

curl

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI，或者使用了 Cloud Shell，这会使您自动登录 gcloud CLI。您可以通过运行 gcloud auth list 来检查当前的活跃账号。

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

cat > request.json << 'EOF'
{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}
EOF

然后，执行以下命令以发送 REST 请求：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以通过运行 gcloud auth list 来检查当前的活跃账号。

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

@'
{
  "publisher_model_name": "MODEL_ID",
  "model_config": {
    "accept_eula": "true"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

然后，执行以下命令以发送 REST 请求：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

您会收到类似以下内容的 JSON 响应。

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/google/models/gemma3@gemma-3-1b-it",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

部署 Hugging Face 模型

在使用任何请求数据之前，请先进行以下替换：

LOCATION：将在其中部署模型的区域。
PROJECT_ID：您的 Google Cloud 项目 ID。
MODEL_ID：要部署的 Hugging Face 模型的 ID，您可以通过列出所有可部署的模型来获取此 ID。该 ID 采用以下格式： PUBLISHER_NAME/MODEL_NAME。
ACCESS_TOKEN：如果模型有门控限制，请提供访问令牌。

HTTP 方法和网址：

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

请求 JSON 正文：

{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}

如需发送请求，请选择以下方式之一：

curl

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

cat > request.json << 'EOF'
{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}
EOF

然后，执行以下命令以发送 REST 请求：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

@'
{
  "hugging_face_model_id": "MODEL_ID",
  "hugging_face_access_token": "ACCESS_TOKEN",
  "model_config": {
    "accept_eula": "true"
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

然后，执行以下命令以发送 REST 请求：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

您会收到类似以下内容的 JSON 响应。

{
  "name": "projects/PROJECT_ID/locations/us-central1LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/PUBLISHER_NAME/model/MODEL_NAME",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

使用自定义设置部署模型

在使用任何请求数据之前，请先进行以下替换：

LOCATION：将在其中部署模型的区域。
PROJECT_ID：您的 Google Cloud 项目 ID。
MODEL_ID：要部署的模型的 ID，您可以通过列出所有可部署的模型来获取此 ID。该 ID 采用以下格式：publishers/PUBLISHER_NAME/models/MODEL_NAME@MODEL_VERSION，例如 google/gemma@gemma-2b 或 stabilityai/stable-diffusion-xl-base-1.0。
MACHINE_TYPE：定义要为模型部署的资源集，例如 g2-standard-4。
ACCELERATOR_TYPE：指定要添加到部署中的加速器，以帮助在处理密集型工作负载（例如 NVIDIA_L4）时提高性能
ACCELERATOR_COUNT：部署中要使用的加速器数量。
reservation_affinity_type：如需为部署使用现有的 Compute Engine 预留，请指定任意预留或特定预留。如果您指定了此值，则请勿指定 spot。
spot：指示是否为部署使用 Spot 虚拟机。
IMAGE_URI：要使用的容器映像的位置，例如 us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241016_0916_RC00_maas
CONTAINER_ARGS：在部署期间传递给容器的参数。
CONTAINER_PORT：容器的端口号。
fast_tryout_enabled：在测试模型时，您可以选择使用更快的部署。此选项仅适用于部分机器类型上的常用模型。如果启用该选项，您将无法指定模型或部署配置。

HTTP 方法和网址：

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy

请求 JSON 正文：

{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}

如需发送请求，请选择以下方式之一：

curl

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

cat > request.json << 'EOF'
{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}
EOF

然后，执行以下命令以发送 REST 请求：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy"

PowerShell

将请求正文保存在名为 request.json 的文件中。在终端中运行以下命令，在当前目录中创建或覆盖此文件：

@'
{
  "publisher_model_name": "MODEL_ID",
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "MACHINE_TYPE",
        "accelerator_type": "ACCELERATOR_TYPE",
        "accelerator_count": ACCELERATOR_COUNT,
        "reservation_affinity": {
          "reservation_affinity_type": "ANY_RESERVATION"
        }
      },
      "spot": "false"
    }
  },
  "model_config": {
    "accept_eula": "true",
    "container_spec": {
      "image_uri": "IMAGE_URI",
      "args": [CONTAINER_ARGS ],
      "ports": [
        {
          "container_port": CONTAINER_PORT
        }
      ]
    }
  },
  "deploy_config": {
    "fast_tryout_enabled": false
  },
}
'@  | Out-File -FilePath request.json -Encoding utf8

然后，执行以下命令以发送 REST 请求：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION:deploy" | Select-Object -Expand Content

您会收到类似以下内容的 JSON 响应。

{
  "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployOperationMetadata",
    "genericMetadata": {
      "createTime": "2025-03-13T21:44:44.538780Z",
      "updateTime": "2025-03-13T21:44:44.538780Z"
    },
    "publisherModel": "publishers/google/models/gemma3@gemma-3-1b-it",
    "destination": "projects/PROJECT_ID/locations/LOCATION",
    "projectNumber": "PROJECT_ID"
  }
}

控制台

在 Google Cloud 控制台中，前往 Model Garden 页面。

转到 Model Garden
找到要部署的受支持模型，然后点击其模型卡片。
点击部署以打开部署模型窗格。
在部署模型窗格中，指定部署的详细信息。
1. 使用或修改生成的模型和端点名称。
2. 选择要在其中创建模型端点的位置。
3. 为部署的每个节点选择要使用的机器类型。
4. 如需使用 Compute Engine 预留，请在部署设置部分下选择高级。
  
  在预留类型字段中，选择预留类型。预留必须符合您指定的机器规格。
  - 自动使用已创建的预留：Vertex AI 会自动选择具有匹配属性的允许预留。如果自动选择的预留中没有容量，Vertex AI 会使用常规 Google Cloud资源池。
  - 选择特定预留：Vertex AI 会使用特定预留。如果所选预留没有空位，系统会抛出错误。
  - 不使用（默认）：Vertex AI 会使用常规Google Cloud 资源池。此值的效果与不指定预留相同。
点击部署。

Terraform

如需了解如何应用或移除 Terraform 配置，请参阅基本 Terraform 命令。如需了解详情，请参阅 Terraform 提供程序参考文档。

部署模型

以下示例使用默认配置将 gemma-3-1b-it 模型部署到 us-central1 中的新 Vertex AI 端点。

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "6.45.0"
    }
  }
}

provider "google" {
  region  = "us-central1"
}

resource "google_vertex_ai_endpoint_with_model_garden_deployment" "gemma_deployment" {
  publisher_model_name = "publishers/google/models/gemma3@gemma-3-1b-it"
  location = "us-central1"
  model_config {
    accept_eula = True
  }
}

如需部署包含自定义设置的模型，请参阅 Vertex AI 端点与 Model Garden 部署了解详情。

应用配置

terraform init
terraform plan
terraform apply

应用配置后，Terraform 会预配新的 Vertex AI 端点并部署指定的开放模型。

清理

如需删除端点和模型部署，请运行以下命令：

terraform destroy

使用 PredictionServiceClient 推断 Gemma 1B

部署 Gemma 1B 后，您可以使用 PredictionServiceClient 获取以下提示的在线预测结果：“为什么天空是蓝色的？”。

代码参数

PredictionServiceClient 代码示例需要您更新以下内容。

PROJECT_ID：如需查找项目 ID，请按以下步骤操作。
1. 前往 Google Cloud 控制台中的欢迎页面。
  前往“欢迎”页面
2. 从页面顶部的项目选择器中，选择您的项目。
  
  项目名称、项目编号和项目 ID 会显示在欢迎标头后面。
ENDPOINT_REGION：这是您在其中部署端点的区域。
ENDPOINT_ID：如要查找端点 ID，您可以在控制台中查看，或者运行 gcloud ai endpoints list 命令。您需要记下部署模型窗格中的端点名称和区域。
控制台
您可以通过依次点击在线预测 > 端点并选择相应区域，来查看端点详细信息。请注意 ID 列中显示的数字。

转至 Endpoints
gcloud
您可以运行 gcloud ai endpoints list 命令来查看端点详细信息。
```
gcloud ai endpoints list \
  --region=ENDPOINT_REGION \
  --filter=display_name=ENDPOINT_NAME
```
输出类似于以下内容。
```
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
ENDPOINT_ID: 1234567891234567891
DISPLAY_NAME: gemma2-2b-it-mg-one-click-deploy
```

示例代码

在相应编程语言的示例代码中，更新 PROJECT_ID、ENDPOINT_REGION 和 ENDPOINT_ID。然后运行代码。

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

"""
Sample to run inference on a Gemma2 model deployed to a Vertex AI endpoint with GPU accellerators.
"""

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

# TODO(developer): Update & uncomment lines below
# PROJECT_ID = "your-project-id"
# ENDPOINT_REGION = "your-vertex-endpoint-region"
# ENDPOINT_ID = "your-vertex-endpoint-id"

# Default configuration
config = {"max_tokens": 1024, "temperature": 0.9, "top_p": 1.0, "top_k": 1}

# Prompt used in the prediction
prompt = "Why is the sky blue?"

# Encapsulate the prompt in a correct format for GPUs
# Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.9}}]
input = {"inputs": prompt, "parameters": config}

# Convert input message to a list of GAPIC instances for model input
instances = [json_format.ParseDict(input, Value())]

# Create a client
api_endpoint = f"{ENDPOINT_REGION}-aiplatform.googleapis.com"
client = aiplatform.gapic.PredictionServiceClient(
    client_options={"api_endpoint": api_endpoint}
)

# Call the Gemma2 endpoint
gemma2_end_point = (
    f"projects/{PROJECT_ID}/locations/{ENDPOINT_REGION}/endpoints/{ENDPOINT_ID}"
)
response = client.predict(
    endpoint=gemma2_end_point,
    instances=instances,
)
text_responses = response.predictions
print(text_responses[0])

Node.js

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Node.js 设置说明执行操作。如需了解详情，请参阅 Vertex AI Node.js API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

async function gemma2PredictGpu(predictionServiceClient) {
  // Imports the Google Cloud Prediction Service Client library
  const {
    // TODO(developer): Uncomment PredictionServiceClient before running the sample.
    // PredictionServiceClient,
    helpers,
  } = require('@google-cloud/aiplatform');
  /**
   * TODO(developer): Update these variables before running the sample.
   */
  const projectId = 'your-project-id';
  const endpointRegion = 'your-vertex-endpoint-region';
  const endpointId = 'your-vertex-endpoint-id';

  // Default configuration
  const config = {maxOutputTokens: 1024, temperature: 0.9, topP: 1.0, topK: 1};
  // Prompt used in the prediction
  const prompt = 'Why is the sky blue?';

  // Encapsulate the prompt in a correct format for GPUs
  // Example format: [{inputs: 'Why is the sky blue?', parameters: {temperature: 0.9}}]
  const input = {
    inputs: prompt,
    parameters: config,
  };

  // Convert input message to a list of GAPIC instances for model input
  const instances = [helpers.toValue(input)];

  // TODO(developer): Uncomment apiEndpoint and predictionServiceClient before running the sample.
  // const apiEndpoint = `${endpointRegion}-aiplatform.googleapis.com`;

  // Create a client
  // predictionServiceClient = new PredictionServiceClient({apiEndpoint});

  // Call the Gemma2 endpoint
  const gemma2Endpoint = `projects/${projectId}/locations/${endpointRegion}/endpoints/${endpointId}`;

  const [response] = await predictionServiceClient.predict({
    endpoint: gemma2Endpoint,
    instances,
  });

  const predictions = response.predictions;
  const text = predictions[0].stringValue;

  console.log('Predictions:', text);
  return text;
}

module.exports = gemma2PredictGpu;

// TODO(developer): Uncomment below lines before running the sample.
// gemma2PredictGpu(...process.argv.slice(2)).catch(err => {
//   console.error(err.message);
//   process.exitCode = 1;
// });

Java

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Java 设置说明执行操作。如需了解详情，请参阅 Vertex AI Java API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。


import com.google.cloud.aiplatform.v1.EndpointName;
import com.google.cloud.aiplatform.v1.PredictResponse;
import com.google.cloud.aiplatform.v1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1.PredictionServiceSettings;
import com.google.gson.Gson;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Gemma2PredictGpu {

  private final PredictionServiceClient predictionServiceClient;

  // Constructor to inject the PredictionServiceClient
  public Gemma2PredictGpu(PredictionServiceClient predictionServiceClient) {
    this.predictionServiceClient = predictionServiceClient;
  }

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String endpointRegion = "us-east4";
    String endpointId = "YOUR_ENDPOINT_ID";

    PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder()
            .setEndpoint(String.format("%s-aiplatform.googleapis.com:443", endpointRegion))
            .build();
    PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings);
    Gemma2PredictGpu creator = new Gemma2PredictGpu(predictionServiceClient);

    creator.gemma2PredictGpu(projectId, endpointRegion, endpointId);
  }

  // Demonstrates how to run inference on a Gemma2 model
  // deployed to a Vertex AI endpoint with GPU accelerators.
  public String gemma2PredictGpu(String projectId, String region,
               String endpointId) throws IOException {
    Map<String, Object> paramsMap = new HashMap<>();
    paramsMap.put("temperature", 0.9);
    paramsMap.put("maxOutputTokens", 1024);
    paramsMap.put("topP", 1.0);
    paramsMap.put("topK", 1);
    Value parameters = mapToValue(paramsMap);

    // Prompt used in the prediction
    String instance = "{ \"inputs\": \"Why is the sky blue?\"}";
    Value.Builder instanceValue = Value.newBuilder();
    JsonFormat.parser().merge(instance, instanceValue);
    // Encapsulate the prompt in a correct format for GPUs
    // Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.8}}]
    List<Value> instances = new ArrayList<>();
    instances.add(instanceValue.build());

    EndpointName endpointName = EndpointName.of(projectId, region, endpointId);

    PredictResponse predictResponse = this.predictionServiceClient
        .predict(endpointName, instances, parameters);
    String textResponse = predictResponse.getPredictions(0).getStringValue();
    System.out.println(textResponse);
    return textResponse;
  }

  private static Value mapToValue(Map<String, Object> map) throws InvalidProtocolBufferException {
    Gson gson = new Gson();
    String json = gson.toJson(map);
    Value.Builder builder = Value.newBuilder();
    JsonFormat.parser().merge(json, builder);
    return builder.build();
  }
}

Go

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Go 设置说明执行操作。如需了解详情，请参阅 Vertex AI Go API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/aiplatform/apiv1/aiplatformpb"

	"google.golang.org/protobuf/types/known/structpb"
)

// predictGPU demonstrates how to run interference on a Gemma2 model deployed to a Vertex AI endpoint with GPU accelerators.
func predictGPU(w io.Writer, client PredictionsClient, projectID, location, endpointID string) error {
	ctx := context.Background()

	// Note: client can be initialized in the following way:
	// apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	// client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	// if err != nil {
	// 	return fmt.Errorf("unable to create prediction client: %v", err)
	// }
	// defer client.Close()

	gemma2Endpoint := fmt.Sprintf("projects/%s/locations/%s/endpoints/%s", projectID, location, endpointID)
	prompt := "Why is the sky blue?"
	parameters := map[string]interface{}{
		"temperature":     0.9,
		"maxOutputTokens": 1024,
		"topP":            1.0,
		"topK":            1,
	}

	// Encapsulate the prompt in a correct format for TPUs.
	// Pay attention that prompt should be set in "inputs" field.
	// Example format: [{'inputs': 'Why is the sky blue?', 'parameters': {'temperature': 0.9}}]
	promptValue, err := structpb.NewValue(map[string]interface{}{
		"inputs":     prompt,
		"parameters": parameters,
	})
	if err != nil {
		fmt.Fprintf(w, "unable to convert prompt to Value: %v", err)
		return err
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint:  gemma2Endpoint,
		Instances: []*structpb.Value{promptValue},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return err
	}

	prediction := resp.GetPredictions()
	value := prediction[0].GetStringValue()
	fmt.Fprintf(w, "%v", value)

	return nil
}

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

删除各个资源

如果您想保留项目，则删除本教程中使用的资源即可：

取消部署模型并删除端点
从 Model Registry 中删除模型

取消部署模型并删除端点

请使用以下方法之一取消部署模型并删除端点。

控制台

在 Google Cloud 控制台中，依次点击在线预测和端点。

转到“端点”页面
在区域下拉列表中，选择您在其中部署了端点的区域。
点击端点名称以打开详情页面。例如：gemma2-2b-it-mg-one-click-deploy。
在 Gemma 2 (Version 1) 模型所对应的行中，点击操作，然后点击从端点取消部署模型。
在从端点取消部署模型对话框中，点击取消部署。
点击返回按钮，返回到端点页面。

转到“端点”页面
在 gemma2-2b-it-mg-one-click-deploy 行末尾，点击操作，然后选择删除端点。
在确认提示中，点击确认。

gcloud

如需使用 Google Cloud CLI 取消部署模型并删除端点，请按以下步骤操作。

在这些命令中，替换以下内容：

将 PROJECT_ID 替换为您的项目名称
将 LOCATION_ID 替换为您在其中部署了模型和端点的区域
将 ENDPOINT_ID 替换为端点 ID
将 DEPLOYED_MODEL_NAME 替换为模型的显示名称
将 DEPLOYED_MODEL_ID 替换为模型 ID

运行 gcloud ai endpoints list 命令获取端点 ID。此命令会列出项目中所有端点的 ID。记下本教程中使用的端点的 ID。

gcloud ai endpoints list \
    --project=PROJECT_ID \
    --region=LOCATION_ID

输出类似于以下内容。在输出中，该 ID 称为 ENDPOINT_ID。

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
ENDPOINT_ID: 1234567891234567891
DISPLAY_NAME: gemma2-2b-it-mg-one-click-deploy

运行 gcloud ai models describe 命令获取模型 ID。记下本教程中部署的模型的 ID。

gcloud ai models describe DEPLOYED_MODEL_NAME \
    --project=PROJECT_ID \
    --region=LOCATION_ID

简略版输出如下所示。在输出中，该 ID 称为 deployedModelId。

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
artifactUri: [URI removed]
baseModelSource:
  modelGardenSource:
    publicModelName: publishers/google/models/gemma2
...
deployedModels:
- deployedModelId: '1234567891234567891'
  endpoint: projects/12345678912/locations/us-central1/endpoints/12345678912345
displayName: gemma2-2b-it-12345678912345
etag: [ETag removed]
modelSourceInfo:
  sourceType: MODEL_GARDEN
name: projects/123456789123/locations/us-central1/models/gemma2-2b-it-12345678912345
...

从端点取消部署模型。您需要用到通过运行上述命令获得的端点 ID 和模型 ID。

gcloud ai endpoints undeploy-model ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID \
    --deployed-model-id=DEPLOYED_MODEL_ID

此命令没有任何输出。

运行 gcloud ai endpoints delete 命令以删除端点。
```
gcloud ai endpoints delete ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION_ID
```
出现提示时，输入 y 以确认。此命令没有任何输出。

删除模型

控制台

在 Google Cloud 控制台中，从 Vertex AI 部分进入 Model Registry 页面。

进入 Model Registry 页面
在区域下拉列表中，选择您在其中部署了模型的区域。
在 gemma2-2b-it-1234567891234 行末尾，点击操作。
选择删除模型。

删除模型时，所有关联的模型版本和评估都会从 Google Cloud 项目中一并删除。
在确认提示中，点击删除。

gcloud

如需使用 Google Cloud CLI 删除模型，请向 gcloud ai models delete 命令提供模型的显示名称和区域。

gcloud ai models delete DEPLOYED_MODEL_NAME \
    --project=PROJECT_ID \
    --region=LOCATION_ID

将 DEPLOYED_MODEL_NAME 替换为模型的显示名称。将 PROJECT_ID 替换为您的项目名称。将 LOCATION_ID 替换为您在其中部署了模型的区域。

后续步骤

详细了解 Gemma 开放模型。
阅读 Gemma 使用条款。
详细了解开放模型。
了解如何部署调优后的模型。
了解如何使用 HuggingFace Textgen Inference (TGI) 将 Gemma 2 部署到 Google Kubernetes Engine。
详细了解您偏好的编程语言（Python、Node.js、Java 或 Go）版的 PredictionServiceClient。

使用 Model Garden 和受 Vertex AI GPU 支持的端点部署 Gemma 并运行推理 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

目标

费用

准备工作

设置 Google Cloud 项目

设置 Google Cloud CLI

设置相应编程语言的 SDK

Python

Node.js

Java

带有 BOM 的 Maven

不带 BOM 的 Maven

不带 BOM 的 Gradle

Go

使用 Model Garden 部署 Gemma

Python

gcloud

REST

列出可部署的模型

curl

PowerShell

部署模型

使用默认配置部署模型。

curl

PowerShell

部署 Hugging Face 模型

curl

PowerShell

使用自定义设置部署模型

curl

PowerShell

控制台

Terraform

部署模型

应用配置

清理

使用 PredictionServiceClient 推断 Gemma 1B

代码参数

控制台

gcloud

示例代码

Python

Node.js

Java

Go

清理

删除项目

删除各个资源

取消部署模型并删除端点

控制台

gcloud

删除模型

控制台

gcloud

后续步骤

使用 Model Garden 和受 Vertex AI GPU 支持的端点部署 Gemma 并运行推理