Flex PayGo

Flex pay-as-you-go (Flex PayGo), is a cost-effective option for accessing Gemini models for non-critical workloads that can tolerate longer response times and higher throttling. Flex PayGo offers a 50% discount compared to Standard PayGo.

When to use Flex PayGo

Flex PayGo is ideal for the synchronous, latency-tolerant, and non-critical tasks that aren't time-sensitive. The following are example use cases:

Offline analysis of text, documents, image, audio and video files
Evaluation of model qualities
Data annotation and labeling
Document translation
Building a product catalog

Supported models and locations

The following preview Gemini models support Flex PayGo in the global endpoint only. Flex PayGo doesn't support regional or multi-regional endpoints.

Use Flex PayGo

To send requests to the Gemini API using Flex PayGo, you must include the X-Vertex-AI-LLM-Shared-Request-Type header in your request. You can use Flex PayGo in two ways:

Use Provisioned Throughput quota (if available) and then use Flex PayGo.
Use only Flex PayGo.

Note that requests that use Flex PayGo have longer latency than Standard PayGo. The default timeout is 20 minutes, which you can override (in milliseconds) using the timeout parameter. The maximum allowed value is 30 minutes.

Use Flex PayGo while using Provisioned Throughput as default

To utilize any available Provisioned Throughput quota before using Flex PayGo, include the header X-Vertex-AI-LLM-Shared-Request-Type: flex in your requests, as shown in the following samples.

Python

Install

pip install --upgrade google-genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

Initialize your GenAI client to use Flex PayGo. After performing this step, you won't need to make further adjustments to your code to interact with the Gemini API using Flex PayGo on the same client.

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Shared-Request-Type": "flex"
      },
    # timeout = 600000  # Timeout in milliseconds
  )
)

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

PROJECT_ID: Your project ID.
MODEL_ID: The model ID of the model for which you want to initialize Flex PayGo. For a list of models that support Flex PayGo, see Model versions.
PROMPT_TEXT: The text instructions to include in the prompt. JSON.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Server-Timeout: 600" \  # Timeout in milliseconds
  -H "X-Vertex-AI-LLM-Shared-Request-Type: flex" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

You should receive a JSON response similar to the following.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_FLEX",
    "thoughtsTokenCount": 1054
  }
}

Note the following in the URL for this sample:

Use the generateContent method to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using the streamGenerateContent method.
The multimodal model ID is located at the end of the URL before the method (for example, gemini-2.0-flash). This sample might support other models as well.

Use only Flex PayGo

To use only Flex PayGo, include the headers X-Vertex-AI-LLM-Request-Type: shared and X-Vertex-AI-LLM-Shared-Request-Type: flex in your requests, as shown in the following samples.

Python

Install

pip install --upgrade google-genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import HttpOptions
client = genai.Client(
  vertexai=True, project='your_project_id', location='global',
  http_options=HttpOptions(
    api_version="v1",
      headers={
        "X-Vertex-AI-LLM-Request-Type": "shared",
        "X-Vertex-AI-LLM-Shared-Request-Type": "flex"
      },
    # timeout = 600000  # Timeout in milliseconds
  )
)

REST

Before using any of the request data, make the following replacements:

PROJECT_ID: Your project ID.
MODEL_ID: The model ID of the model for which you want to initialize Flex PayGo. For a list of models that support Flex PayGo, see Model versions.
PROMPT_TEXT: The text instructions to include in the prompt. JSON.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  -H "X-Server-Timeout: 600" \  # Timeout in milliseconds
  -H "X-Vertex-AI-LLM-Request-Type: shared" \
  -H "X-Vertex-AI-LLM-Shared-Request-Type: flex" \
  "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/MODEL_ID:generateContent" -d \
  $'{
      "contents": {
        "role": "model",
        "parts": { "text": "PROMPT_TEXT" }
    }
  }'

You should receive a JSON response similar to the following.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_FLEX",
    "thoughtsTokenCount": 1054
  }
}

Note the following in the URL for this sample:

Use the generateContent method to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using the streamGenerateContent method.
The multimodal model ID is located at the end of the URL before the method (for example, gemini-2.0-flash). This sample might support other models as well.

Verify Flex PayGo usage

You can verify whether a request utilized Flex PayGo from the traffic type in the response, as shown in the following examples.

Python

You can verify whether Flex PayGo was utilized for a request from the traffic_type field in the response. If your request was processed using Flex PayGo, the traffic_type field is set to ON_DEMAND_FLEX.

sdk_http_response=HttpResponse(
  headers=
) candidates=[Candidate(
  avg_logprobs=-0.539712212302468,
  content=Content(
    parts=[
      Part(
        text="""Response to sample request.
        """
      ),
    ],
    role='model'
  ),
  finish_reason=<FinishReason.STOP: 'STOP'>
)] create_time=datetime.datetime(2025, 12, 3, 20, 32, 55, 916498, tzinfo=TzInfo(0)) model_version='gemini-2.5-flash' prompt_feedback=None response_id='response_id' usage_metadata=GenerateContentResponseUsageMetadata(
  candidates_token_count=1408,
  candidates_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=1408
    ),
  ],
  prompt_token_count=5,
  prompt_tokens_details=[
    ModalityTokenCount(
      modality=<MediaModality.TEXT: 'TEXT'>,
      token_count=5
    ),
  ],
  thoughts_token_count=1356,
  total_token_count=2769,
  traffic_type=<TrafficType.ON_DEMAND_FLEX: 'ON_DEMAND_FLEX'>
) automatic_function_calling_history=[] parsed=None

REST

You can verify whether Flex PayGo was utilized for a request from the trafficType field in the response. If your request was processed using Flex PayGo, the trafficType field is set to ON_DEMAND_FLEX.

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "Response to sample request."
          }
        ]
      },
      "finishReason": "STOP"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 3,
    "candidatesTokenCount": 900,
    "totalTokenCount": 1957,
    "trafficType": "ON_DEMAND_FLEX",
    "thoughtsTokenCount": 1054
  }
}

Additional quota for Flex PayGo

In addition to the available quotas for content generation requests (including Provisioned Throughput quota for spillover traffic), requests utilizing Flex PayGo are subject to the following quota:

Description	QPM for each base model in a project
Quota for each base model in a project requests utilizing Flex PayGo	3000

What's next

To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
To learn more about Google Cloud quotas and system limits, see the Cloud Quotas documentation.

Flex PayGo Stay organized with collections Save and categorize content based on your preferences.

When to use Flex PayGo

Supported models and locations

Use Flex PayGo

Use Flex PayGo while using Provisioned Throughput as default

Python

Install

REST

Use only Flex PayGo

Python

Install

REST

Verify Flex PayGo usage

Python

REST

Additional quota for Flex PayGo

What's next

Flex PayGo