本頁面由 Cloud Translation API 翻譯而成。

呼叫開放式模型的 MaaS API

Vertex AI 上的許多開放原始碼模型都提供全代管無伺服器模型，可透過 Vertex AI Chat Completions API 做為 API 使用。這些模型不需要佈建或管理基礎架構。

你可以串流傳送回覆，減少使用者感受到的延遲時間。串流回應會使用伺服器推送事件 (SSE) 逐步串流回應。

本頁面說明如何對支援 OpenAI Chat Completions API 的開放模型發出串流和非串流呼叫。如需 Llama 專屬注意事項，請參閱「要求 Llama 預測」。

事前準備

如要在 Vertex AI 中使用開放原始碼模型，請完成下列步驟。如要使用 Vertex AI，必須啟用 Vertex AI API (aiplatform.googleapis.com)。如果您已有啟用 Vertex AI API 的專案，可以改用該專案，不必建立新專案。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

前往要使用的模型在 Model Garden 中的模型資訊卡，然後按一下「啟用」，即可在專案中啟用該模型。
前往 Model Garden

對開放模型發出串流通話

下列範例會對開放模型發出串流呼叫：

Python

在試用這個範例之前，請先按照「使用用戶端程式庫的 Vertex AI 快速入門導覽課程」中的 Python 設定操作說明進行操作。詳情請參閱 Vertex AI Python API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

執行這個範例前，請務必設定 OPENAI_BASE_URL 環境變數。詳情請參閱「驗證和憑證」。

from openai import OpenAI
client = OpenAI()

stream = client.chat.completions.create(
    model="MODEL",
    messages=[{"role": "ROLE", "content": "CONTENT"}],
    max_tokens=MAX_OUTPUT_TOKENS,
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

MODEL：要使用的模型名稱，例如 deepseek-ai/deepseek-v3.1-maas。
ROLE：與訊息相關聯的角色。您可以指定 user 或 assistant。第一則訊息必須使用 user 角色。模型會交替進行 user 和 assistant 回合。如果最後一則訊息使用 assistant 角色，回應內容會立即接續該訊息的內容。您可以使用這項功能限制模型回覆的部分內容。
CONTENT：user 或 assistant 訊息的內容，例如文字。
MAX_OUTPUT_TOKENS：回覆內可以生成的詞元數量上限。一個符記約為四個字元。100 個符記約等於 60 到 80 個字。
如要取得較短的回覆，請指定較低的值；如要取得可能較長的回覆，請調高此值。

REST

設定環境後，您可以使用 REST 測試文字提示。下列範例會將要求傳送至發布商模型端點。

使用任何要求資料之前，請先替換以下項目：

LOCATION：支援開放式模型的區域。
MODEL：要使用的模型名稱，例如 deepseek-ai/deepseek-v2。
ROLE：與訊息相關聯的角色。您可以指定 user 或 assistant。第一則訊息必須使用 user 角色。模型會交替進行 user 和 assistant 回合。如果最終訊息使用 assistant 角色，回應內容會立即接續該訊息的內容。您可以使用這項功能限制模型回覆的部分內容。
CONTENT：user 或 assistant 訊息的內容，例如文字。
MAX_OUTPUT_TOKENS：回覆內可以生成的詞元數量上限。一個符記約為四個字元。100 個符記約等於 60 到 80 個字。
如要取得較短的回覆，請指定較低的值；如要取得可能較長的回覆，請調高此值。
STREAM：布林值，用於指定是否要串流傳輸回應。串流傳送回覆內容，減少使用者對延遲的感受。設為 true 可串流回應，設為 false 則可一次傳回回應。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

JSON 要求內文：

{
  "model": "MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": true
}

如要傳送要求，請選擇以下其中一個選項：

curl

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login 透過使用者帳戶登入 gcloud CLI，或是使用 Cloud Shell 自動登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求內文儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

您應該會收到類似如下的 JSON 回應。

回應

data: {
  "choices": [
    {
      "delta": {
        "content": "CONTENT",
        "role": "assistant"
      },
      "index": 0,
      "logprobs": null
    }
  ],
  "created": 1234567890,
  "id": "2025-06-11|10:00:00.292195-07|9.7.144.202|-123456789",
  "model": "MODEL",
  "object": "chat.completion.chunk",
  "system_fingerprint": ""
}

data: {
  "choices": [
    {
      "delta": {
        "content": "CONTENT",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "created": 1234567890,
  "id": "2025-06-11|10:00:00.292195-07|9.7.144.202|-123456789",
  "model": "MODEL",
  "object": "chat.completion.chunk",
  "system_fingerprint": "",
  "usage": {
    "completion_tokens": 131,
    "prompt_tokens": 14,
    "total_tokens": 145
  }
}

data: [DONE]

對開放模型發出非串流呼叫

下列範例會對開放模型發出非串流呼叫：

Python

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

執行這個範例前，請務必設定 OPENAI_BASE_URL 環境變數。詳情請參閱「驗證和憑證」。

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model="MODEL",
    messages=[{"role": "ROLE", "content": "CONTENT"}],
    max_tokens=MAX_OUTPUT_TOKENS,
    stream=False,
)
print(completion.choices[0].message)

MODEL：要使用的模型名稱，例如 deepseek-ai/deepseek-v3.1-maas。
ROLE：與訊息相關聯的角色。您可以指定 user 或 assistant。第一則訊息必須使用 user 角色。模型會交替進行 user 和 assistant 回合。如果最後一則訊息使用 assistant 角色，回應內容會立即接續該訊息的內容。您可以使用這項功能限制模型回覆的部分內容。
CONTENT：user 或 assistant 訊息的內容，例如文字。
MAX_OUTPUT_TOKENS：回覆內可以生成的詞元數量上限。一個符記約為四個字元。100 個符記約等於 60 到 80 個字。
如要取得較短的回覆，請指定較低的值；如要取得可能較長的回覆，請調高此值。

REST

設定環境後，您可以使用 REST 測試文字提示。下列範例會將要求傳送至發布商模型端點。

使用任何要求資料之前，請先替換以下項目：

LOCATION：支援開放式模型的區域。
MODEL：要使用的模型名稱，例如 deepseek-ai/deepseek-v2。
ROLE：與訊息相關聯的角色。您可以指定 user 或 assistant。第一則訊息必須使用 user 角色。模型會交替進行 user 和 assistant 回合。如果最終訊息使用 assistant 角色，回應內容會立即接續該訊息的內容。您可以使用這項功能限制模型回覆的部分內容。
CONTENT：user 或 assistant 訊息的內容，例如文字。
MAX_OUTPUT_TOKENS：回覆內可以生成的詞元數量上限。一個符記約為四個字元。100 個符記約等於 60 到 80 個字。
如要取得較短的回覆，請指定較低的值；如要取得可能較長的回覆，請調高此值。
STREAM：布林值，用於指定是否要串流傳輸回應。串流傳送回覆內容，減少使用者對延遲的感受。設為 true 可串流回應，設為 false 則可一次傳回回應。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

JSON 要求內文：

{
  "model": "MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": false
}

如要傳送要求，請選擇以下其中一個選項：

curl

將要求內文儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

您應該會收到類似如下的 JSON 回應。

回應

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "CONTENT",
        "role": "assistant"
      }
    }
  ],
  "created": 1234567890,
  "id": "2025-06-11|10:00:00.292195-07|9.7.144.202|-123456789",
  "model": "MODEL",
  "object": "chat.completion",
  "system_fingerprint": "",
  "usage": {
    "completion_tokens": 367,
    "prompt_tokens": 14,
    "total_tokens": 381
  }
}

區域和全域端點

如果是區域端點，系統會從您指定的區域提供要求。如果您有資料落地規定，或是模型不支援全域端點，請使用區域端點。

使用全域端點時，Google 可以從您所用模型支援的任何區域處理及提供要求。在某些情況下，這可能會導致延遲時間增加。全球端點有助於提升整體可用性，並減少錯誤。

使用全域端點時，價格與區域端點相同。不過，全球端點的配額和支援的模型功能可能與區域端點不同。詳情請參閱相關的第三方模型頁面。

指定全域端點

如要使用全域端點，請將地區設為 global。

舉例來說，curl 指令的要求網址採用下列格式： https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/endpoints/openapi

Vertex AI SDK 預設使用區域端點。將區域設為 GLOBAL，即可使用全域端點。

限制全域 API 端點用量

如要強制使用區域端點，請使用 constraints/gcp.restrictEndpointUsage 組織政策限制，封鎖對全域 API 端點的要求。詳情請參閱「限制端點用量」。

後續步驟

瞭解如何使用函式呼叫。
瞭解結構化輸出內容。
瞭解批次預測。

呼叫開放式模型的 MaaS API 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

事前準備

對開放模型發出串流通話

Python

REST

curl

PowerShell

回應

對開放模型發出非串流呼叫

Python

REST

curl

PowerShell

回應

區域和全域端點

指定全域端點

限制全域 API 端點用量

後續步驟

呼叫開放式模型的 MaaS API