Vertex AI 說明文件不再更新

Vertex AI 的服務現已併入 Gemini Enterprise Agent Platform。如要查看最新資訊，請參閱 Agent Platform 說明文件。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Priority PayGo

優先即付即用 (優先 PayGo) 是一種用量選項，與標準即付即用相比，可提供更穩定的效能，且不必預先承諾佈建輸送量。

使用 Priority PayGo 時，系統會依較高的費率，按權杖用量收費。如需定價資訊，請參閱 Vertex AI 定價頁面。

使用 Priority PayGo 的時機

Priority PayGo 非常適合流量模式起伏不定或無法預測的重要業務工作負載。以下是範例用途：

客戶服務虛擬助理
代理工作流程和跨代理互動
研究模擬

支援的機型和地點

下列模型僅支援 global 端點的優先 PayGo。Priority PayGo 不支援區域或多區域端點。

使用 Priority PayGo

如要使用 Priority PayGo 將要求傳送至 Vertex AI 中的 Gemini API，您必須在要求中加入 X-Vertex-AI-LLM-Shared-Request-Type 標頭。您可以使用 Priority PayGo 執行下列操作：

使用佈建輸送量配額 (如有)，並溢出至優先 PayGo。
僅使用 Priority PayGo。

以佈建輸送量做為預設值時，使用 Priority PayGo

如要先使用任何可用的佈建輸送量配額，再使用優先順序隨用隨付方案，請在要求中加入 X-Vertex-AI-LLM-Shared-Request-Type: priority 標頭，如下列範例所示。

Python

安裝

pip install --upgrade google-genai

詳情請參閱 SDK 參考文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

初始化 GenAI 用戶端，即可使用 Priority PayGo。完成這個步驟後，您不需要進一步調整程式碼，即可在同一個用戶端上，使用 Priority PayGo 與 Gemini API 互動。

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Shared-Request-Type": "priority"
      },
  )
)

REST

設定環境後，您可以使用 REST 測試文字提示。下列範例會將要求傳送至發布商模型端點。

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：您的專案 ID。
MODEL_ID：您要為其初始化 Priority PayGo 的模型 ID。如需支援優先隨用即付方案的機型清單，請參閱「機型版本」。
PROMPT_TEXT：要加入提示的文字指令。 JSON。

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

您應該會收到類似如下的 JSON 回應。

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

使用 generateContent 方法要求在完整生成回覆後再傳回。如要減少人類觀眾的延遲感，請使用 streamGenerateContent 方法，在生成回覆的同時串流回覆內容。
多模態模型 ID 位於網址尾端，方法之前 (例如 gemini-2.0-flash)。這個範例也可能支援其他模型。

僅使用 Priority PayGo

如要只使用 Priority PayGo，請在要求中加入 X-Vertex-AI-LLM-Request-Type: shared 和 X-Vertex-AI-LLM-Shared-Request-Type: priority 標頭，如下列範例所示。

Python

安裝

pip install --upgrade google-genai

詳情請參閱 SDK 參考文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

初始化 GenAI 用戶端，即可使用 Priority PayGo。完成這個步驟後，您不需要進一步調整程式碼，即可在同一個用戶端上，使用 Priority PayGo 與 Gemini API 互動。

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Request-Type": "shared",
        "X-Vertex-AI-LLM-Shared-Request-Type": "priority"
      },
  )
)

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：您的專案 ID。
MODEL_ID：您要為其初始化 Priority PayGo 的模型 ID。如需支援優先隨用即付方案的機型清單，請參閱「機型版本」。
PROMPT_TEXT：要加入提示的文字指令。 JSON。

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Vertex-AI-LLM-Request-Type: shared" \
  -H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

您應該會收到類似如下的 JSON 回應。

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

使用 generateContent 方法要求在完整生成回覆後再傳回。如要減少人類觀眾的延遲感，請使用 streamGenerateContent 方法，在生成回覆的同時串流回覆內容。
多模態模型 ID 位於網址尾端，方法之前 (例如 gemini-2.0-flash)。這個範例也可能支援其他模型。

驗證 Priority PayGo 用量

如要確認要求是否使用了優先 PayGo，請查看回應中的流量類型，如下列範例所示。

Python

您可以透過回應中的 traffic_type 欄位，確認要求是否使用 Priority PayGo。如果您的要求是使用 Priority PayGo 處理，則 traffic_type 欄位會設為 ON_DEMAND_PRIORITY。

sdk_http_response=HttpResponse(
  headers=<dict len=9>
) candidates=[Candidate(
  avg_logprobs=-0.539712212302468,
  content=Content(
    parts=[
      Part(
        text="""Response to sample request.
        """
      ),
    ],
    role='model'
  ),
  finish_reason=<FinishReason.STOP: 'STOP'>
)] create_time=datetime.datetime(2025, 12, 3, 20, 32, 55, 916498, tzinfo=TzInfo(0)) model_version='gemini-2.5-flash' prompt_feedback=None response_id='response_id' usage_metadata=GenerateContentResponseUsageMetadata(
  candidates_token_count=1408,
  candidates_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=1408
    ),
  ],
  prompt_token_count=5,
  prompt_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=5
    ),
  ],
  thoughts_token_count=1356,
  total_token_count=2769,
  traffic_type=<TrafficType.ON_DEMAND_PRIORITY: 'ON_DEMAND_PRIORITY'>
) automatic_function_calling_history=[] parsed=None

REST

您可以透過回應中的 trafficType 欄位，確認要求是否使用 Priority PayGo。如果您的要求是使用 Priority PayGo 處理，則 trafficType 欄位會設為 ON_DEMAND_PRIORITY。

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_PRIORITY",
    "thoughtsTokenCount": 1054
  }
}

斜坡限制

即付即用優先方案會在機構層級設定斜坡限制。斜坡限制有助於提供可預測且一致的效能。起始限制取決於模型，如下所示：

Gemini Flash 和 Flash-Lite 模型：每分鐘 400 萬個權杖。
Gemini Pro 模型：每分鐘 100 萬個符記。

每持續使用 10 分鐘，斜坡限制就會增加 50%。

如果要求超出升級限制，或系統因流量負載過高而暫時超出容量上限，要求可能會降級為標準隨用隨付方案，並以標準隨用隨付方案費率計費。

為盡量避免降級，請逐步增加用量，確保不超過限制。如果仍需要提升效能，建議購買額外的佈建輸送量配額。

您可以從回應中確認要求是否遭到降級。如果要求降級為標準即付即用，流量類型會設為 ON_DEMAND。詳情請參閱「驗證 Priority PayGo 使用情形」。

後續步驟

如要進一步瞭解佈建輸送量，請參閱佈建輸送量。
如要瞭解 Vertex AI 的配額和限制，請參閱「Vertex AI 配額和限制」。
如要進一步瞭解配額和系統限制，請參閱 Cloud Quotas 說明文件。 Google Cloud

Vertex AI 說明文件不再更新

Priority PayGo 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

使用 Priority PayGo 的時機

支援的機型和地點

使用 Priority PayGo

以佈建輸送量做為預設值時，使用 Priority PayGo

Python

安裝

REST

僅使用 Priority PayGo

Python

安裝

REST

驗證 Priority PayGo 用量

Python

REST

斜坡限制

後續步驟

Priority PayGo