Flex pay-as-you-go (Flex PayGo), is a cost-effective option for accessing Gemini models for non-critical workloads that can tolerate longer response times and higher throttling. Flex PayGo offers a 50% discount compared to Standard PayGo.
When to use Flex PayGo
Flex PayGo is ideal for the synchronous, latency-tolerant, and non-critical tasks that aren't time-sensitive. The following are example use cases:
Offline analysis of text, documents, image, audio and video files
Evaluation of model qualities
Data annotation and labeling
Document translation
Building a product catalog
Supported models and locations
The following preview
Gemini models support Flex PayGo in the global
endpoint only. Flex PayGo doesn't support regional or
multi-regional endpoints.
Use Flex PayGo
To send requests to the Gemini API using Flex PayGo,
you must include the X-Vertex-AI-LLM-Shared-Request-Type header in your
request. You can use Flex PayGo in two ways:
Use Provisioned Throughput quota (if available) and then use Flex PayGo.
Use only Flex PayGo.
Note that requests that use Flex PayGo have longer latency than
Standard PayGo. The default timeout is 20 minutes, which you can
override (in milliseconds) using the timeout parameter. The maximum
allowed value is 30 minutes.
Use Flex PayGo while using Provisioned Throughput as default
To utilize any available Provisioned Throughput quota before using
Flex PayGo, include the header
X-Vertex-AI-LLM-Shared-Request-Type: flex in your requests, as shown in the
following samples.
Python
Install
pip install --upgrade google-genai
To learn more, see the SDK reference documentation.
Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=global export GOOGLE_GENAI_USE_VERTEXAI=True
Initialize your GenAI client to use Flex PayGo. After performing this step, you won't need to make further adjustments to your code to interact with the Gemini API using Flex PayGo on the same client.
from google import genai from google.genai.types import HttpOptions client = genai.Client( vertexai=True, project='your_project_id', location='global', http_options=HttpOptions( api_version="v1", headers={ "X-Vertex-AI-LLM-Shared-Request-Type": "flex" }, # timeout = 600000 # Timeout in milliseconds ) )
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
PROJECT_ID: Your project ID.MODEL_ID: The model ID of the model for which you want to initialize Flex PayGo. For a list of models that support Flex PayGo, see Model versions.PROMPT_TEXT: The text instructions to include in the prompt. JSON.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-Server-Timeout: 600" \ # Timeout in milliseconds
-H "X-Vertex-AI-LLM-Shared-Request-Type: flex" \
"https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
$'{
"contents": {
"role": "model",
"parts": { "text": "PROMPT_TEXT" }
}
}'
You should receive a JSON response similar to the following.
{
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Response to sample request."
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 3,
"candidatesTokenCount": 900,
"totalTokenCount": 1957,
"trafficType": "ON_DEMAND_FLEX",
"thoughtsTokenCount": 1054
}
}
- Use the
generateContentmethod to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using thestreamGenerateContentmethod. - The multimodal model ID is located at the end of the URL before the method
(for example,
gemini-2.0-flash). This sample might support other models as well.
Use only Flex PayGo
To use only Flex PayGo, include the headers
X-Vertex-AI-LLM-Request-Type: shared and
X-Vertex-AI-LLM-Shared-Request-Type: flex in your requests, as shown in the
following samples.
Python
Install
pip install --upgrade google-genai
To learn more, see the SDK reference documentation.
Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=global export GOOGLE_GENAI_USE_VERTEXAI=True
Initialize your GenAI client to use Flex PayGo. After performing this step, you won't need to make further adjustments to your code to interact with the Gemini API using Flex PayGo on the same client.
from google import genai from google.genai.types import HttpOptions client = genai.Client( vertexai=True, project='your_project_id', location='global', http_options=HttpOptions( api_version="v1", headers={ "X-Vertex-AI-LLM-Request-Type": "shared", "X-Vertex-AI-LLM-Shared-Request-Type": "flex" }, # timeout = 600000 # Timeout in milliseconds ) )
REST
Before using any of the request data, make the following replacements:
PROJECT_ID: Your project ID.MODEL_ID: The model ID of the model for which you want to initialize Flex PayGo. For a list of models that support Flex PayGo, see Model versions.PROMPT_TEXT: The text instructions to include in the prompt. JSON.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-H "X-Server-Timeout: 600" \ # Timeout in milliseconds
-H "X-Vertex-AI-LLM-Request-Type: shared" \
-H "X-Vertex-AI-LLM-Shared-Request-Type: flex" \
"https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
$'{
"contents": {
"role": "model",
"parts": { "text": "PROMPT_TEXT" }
}
}'
You should receive a JSON response similar to the following.
{
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Response to sample request."
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 3,
"candidatesTokenCount": 900,
"totalTokenCount": 1957,
"trafficType": "ON_DEMAND_FLEX",
"thoughtsTokenCount": 1054
}
}
- Use the
generateContentmethod to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using thestreamGenerateContentmethod. - The multimodal model ID is located at the end of the URL before the method
(for example,
gemini-2.0-flash). This sample might support other models as well.
Verify Flex PayGo usage
You can verify whether a request utilized Flex PayGo from the traffic type in the response, as shown in the following examples.
Python
You can verify whether
Flex PayGo was utilized for a request from the traffic_type
field in the response. If your request was processed using
Flex PayGo, the traffic_type field is set to
ON_DEMAND_FLEX.
sdk_http_response=HttpResponse( headers=) candidates=[Candidate( avg_logprobs=-0.539712212302468, content=Content( parts=[ Part( text="""Response to sample request. """ ), ], role='model' ), finish_reason=<FinishReason.STOP: 'STOP'> )] create_time=datetime.datetime(2025, 12, 3, 20, 32, 55, 916498, tzinfo=TzInfo(0)) model_version='gemini-2.5-flash' prompt_feedback=None response_id='response_id' usage_metadata=GenerateContentResponseUsageMetadata( candidates_token_count=1408, candidates_tokens_details=[ ModalityTokenCount( modality=<MediaModality.TEXT: 'TEXT'>, token_count=1408 ), ], prompt_token_count=5, prompt_tokens_details=[ ModalityTokenCount( modality=<MediaModality.TEXT: 'TEXT'>, token_count=5 ), ], thoughts_token_count=1356, total_token_count=2769, traffic_type=<TrafficType.ON_DEMAND_FLEX: 'ON_DEMAND_FLEX'> ) automatic_function_calling_history=[] parsed=None
REST
You can verify whether
Flex PayGo was utilized for a request from the trafficType
field in the response. If your request was processed using
Flex PayGo, the trafficType field is set to
ON_DEMAND_FLEX.
{
"candidates": [
{
"content": {
"role": "model",
"parts": [
{
"text": "Response to sample request."
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 3,
"candidatesTokenCount": 900,
"totalTokenCount": 1957,
"trafficType": "ON_DEMAND_FLEX",
"thoughtsTokenCount": 1054
}
}Additional quota for Flex PayGo
In addition to the available quotas for content generation requests (including Provisioned Throughput quota for spillover traffic), requests utilizing Flex PayGo are subject to the following quota:
| Description | QPM for each base model in a project |
|---|---|
| Quota for each base model in a project requests utilizing Flex PayGo | 3000 |
What's next
- To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and system limits, see the Cloud Quotas documentation.