Priority pay-as-you-go (Priority PayGo) is a consumption option that provides provides more consistent performance than Standard PayGo without the upfront commitment of Provisioned Throughput.
When using Priority PayGo, you're charged per token usage at a higher rate than Standard PayGo. For information about pricing, see the Vertex AI pricing page.
When to use Priority PayGo
Priority PayGo is ideal for the latency-sensitive and critical workloads with fluctuating or unpredictable traffic patterns. The following are example use cases:
Customer-facing virtual assistants
Latency-sensitive document and data processing
Agentic workflows and cross-agent interactions
Research simulations
Supported models and locations
The following models support Priority PayGo in the global
endpoint only. Priority PayGo doesn't support regional or
multi-regional endpoints.
Use Priority PayGo
To send requests to the Gemini API in Vertex AI using Priority PayGo,
you must include the X-Vertex-AI-LLM-Shared-Request-Type header in your
request. You can use Priority PayGo in two ways:
Use Provisioned Throughput quota (if available) and spill over to Priority PayGo.
Use only Priority PayGo.
Use Priority PayGo while using Provisioned Throughput as default
To utilize any available Provisioned Throughput quota before using
Priority PayGo, include the header
X-Vertex-AI-LLM-Shared-Request-Type: priority in your requests, as shown in the
following samples.
Python
Install
pip install --upgrade google-genai
To learn more, see the SDK reference documentation.
Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=global export GOOGLE_GENAI_USE_VERTEXAI=True
Initialize your GenAI client to use Priority PayGo. After performing this step, you won't need to make further adjustments to your code to interact with the Gemini API using Priority PayGo on the same client.
from google import genai from google.genai.types import HttpOptions client = genai.Client( vertexai=True, project='your_project_id', location='global', http_options=HttpOptions( api_version="v1", headers={ "X-Vertex-AI-LLM-Shared-Request-Type": "priority" }, ) )
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
PROJECT_ID: Your project ID.MODEL_ID: The model ID of the model for which you want to initialize Priority PayGo. For a list of models that support Priority PayGo, see Model versions.PROMPT_TEXT: The text instructions to include in the prompt. JSON.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
"https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
$'{
"contents": {
"role": "model",
"parts": { "text": "PROMPT_TEXT" }
}
}'
You should receive a JSON response similar to the following.
{
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Response to sample request."
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 3,
"candidatesTokenCount": 900,
"totalTokenCount": 1957,
"trafficType": "ON_DEMAND_PRIORITY",
"thoughtsTokenCount": 1054
}
}
- Use the
generateContentmethod to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using thestreamGenerateContentmethod. - The multimodal model ID is located at the end of the URL before the method
(for example,
gemini-2.0-flash). This sample might support other models as well.
Use only Priority PayGo
To use only Priority PayGo, include the headers
X-Vertex-AI-LLM-Request-Type: shared and
X-Vertex-AI-LLM-Shared-Request-Type: priority in your requests, as shown in the
following samples.
Python
Install
pip install --upgrade google-genai
To learn more, see the SDK reference documentation.
Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=global export GOOGLE_GENAI_USE_VERTEXAI=True
Initialize your GenAI client to use Priority PayGo. After performing this step, you won't need to make further adjustments to your code to interact with the Gemini API using Priority PayGo on the same client.
from google import genai from google.genai.types import HttpOptions client = genai.Client( vertexai=True, project='your_project_id', location='global', http_options=HttpOptions( api_version="v1", headers={ "X-Vertex-AI-LLM-Request-Type": "shared", "X-Vertex-AI-LLM-Shared-Request-Type": "priority" }, ) )
REST
Before using any of the request data, make the following replacements:
PROJECT_ID: Your project ID.MODEL_ID: The model ID of the model for which you want to initialize Priority PayGo. For a list of models that support Priority PayGo, see Model versions.PROMPT_TEXT: The text instructions to include in the prompt. JSON.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-Vertex-AI-LLM-Request-Type: shared" \
-H "X-Vertex-AI-LLM-Shared-Request-Type: priority" \
"https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
$'{
"contents": {
"role": "model",
"parts": { "text": "PROMPT_TEXT" }
}
}'
You should receive a JSON response similar to the following.
{
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Response to sample request."
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 3,
"candidatesTokenCount": 900,
"totalTokenCount": 1957,
"trafficType": "ON_DEMAND_PRIORITY",
"thoughtsTokenCount": 1054
}
}
- Use the
generateContentmethod to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using thestreamGenerateContentmethod. - The multimodal model ID is located at the end of the URL before the method
(for example,
gemini-2.0-flash). This sample might support other models as well.
Verify Priority PayGo usage
You can verify whether a request utilized Priority PayGo from the traffic type in the response, as shown in the following examples.
Python
You can verify whether
Priority PayGo was utilized for a request from the traffic_type
field in the response. If your request was processed using
Priority PayGo, the traffic_type field is set to
ON_DEMAND_PRIORITY.
sdk_http_response=HttpResponse( headers=) candidates=[Candidate( avg_logprobs=-0.539712212302468, content=Content( parts=[ Part( text="""Response to sample request. """ ), ], role='model' ), finish_reason= nishReason.STOP: 'STOP'> )] create_time=datetime.datetime(2025, 12, 3, 20, 32, 55, 916498, tzinfo=TzInfo(0)) model_version='gemini-2.5-flash' prompt_feedback=None response_id='response_id' usage_metadata=GenerateContentResponseUsageMetadata( candidates_token_count=1408, candidates_tokens_details=[ ModalityTokenCount( modality= ty.TEXT: 'TEXT'>, token_count=1408 ), ], prompt_token_count=5, prompt_tokens_details=[ ModalityTokenCount( modality= ty.TEXT: 'TEXT'>, token_count=5 ), ], thoughts_token_count=1356, total_token_count=2769, traffic_type= fficType.ON_DEMAND_PRIORITY: 'ON_DEMAND_PRIORITY'> ) automatic_function_calling_history=[] parsed=None
REST
You can verify whether
Priority PayGo was utilized for a request from the trafficType
field in the response. If your request was processed using
Priority PayGo, the trafficType field is set to
ON_DEMAND_PRIORITY.
{ "candidates": [ { "content": { "role": "model", "parts": [ { "text": "Response to sample request." } ] }, "finishReason": "STOP" } ], "usageMetadata": { "promptTokenCount": 3, "candidatesTokenCount": 900, "totalTokenCount": 1957, "trafficType": "ON_DEMAND_PRIORITY", "thoughtsTokenCount": 1054 } }
Ramp limits
Priority PayGo sets ramp limits at the organization level. Ramp limits help provide a predictable and consistent performance. The starting limit depends on the model, as follows:
- Gemini Flash and Flash-Lite models: 4M tokens/min.
- Gemini Pro models: 1M tokens/min.
The ramp limit increases by 50% for every 10 minutes of sustained usage.
If a request exceeds the ramp limit and the system is over capacity owing to high loads of traffic, then the request is downgraded to Standard PayGo and is charged at Standard PayGo rates.
To minimize downgrades, scale usage incrementally to stay within the limit. If you still require better performance, consider purchasing additional Provisioned Throughput quota.
You can verify whether a request was downgraded from the response. For
requests downgraded to Standard PayGo, the traffic type is set
to ON_DEMAND. For more information, see
Verify Priority PayGo usage.
What's next
- To learn more about Provisioned Throughput, see Provisioned Throughput.
- To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and system limits, see the Cloud Quotas documentation.