Google models
Provisioned Throughput only supports models that you call directly
from your project using the specific model ID and not a model alias. To use
Provisioned Throughput to make API calls to a model, you must use the
specific model version ID (for example, gemini-2.0-flash-001) and not a
model version alias.
Moreover, Provisioned Throughput doesn't support models that are called by other Vertex AI products, such as Vertex AI Agents and Vertex AI Search. For example, if you make API calls to Gemini 2.0 Flash while using Vertex AI Search, your Provisioned Throughput order for Gemini 2.0 Flash won't guarantee the calls made by Vertex AI Search.
Provisioned Throughput doesn't support batch prediction calls.
The following table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. Your per-second throughput is defined as your prompt input and generated output across all requests per second.
To find out how many tokens your workload requires, refer to the SDK tokenizer or the countTokens API.
| Model | Per-second throughput per GSU | Units | Minimum GSU purchase increment | Burndown rates |
|---|---|---|---|---|
Gemini 2.5 Flash with Live API Latest supported version: |
1620 | Tokens | 1 | 1 input text token = 1 input text token 1 input audio token = 6 input text tokens 1 input video token = 6 input text tokens 1 input session memory token = 1 input text token 1 output text token = 4 input text tokens 1 output audio token = 24 input text tokens |
|
Latest supported version: |
2690 | Tokens | 1 |
1 input text token = 1 token 1 input image token = 1 token 1 output text token = 9 tokens 1 output image token = 100 tokens |
|
Latest supported version (GA): Latest supported version (preview): |
8070 | Tokens | 1 |
1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 3 tokens 1 output response text token = 4 tokens 1 output reasoning text token = 4 tokens |
|
Gemini 2.5 Flash with Live API native audio Latest supported version: |
1620 | Tokens | 1 |
1 input text token = 1 token 1 input audio token = 6 tokens 1 input video token = 6 tokens 1 input image token = 6 tokens 1 input session memory token = 1 token 1 output text token = 4 tokens 1 output audio token = 24 tokens |
|
Latest supported version: |
650 | Tokens | 1 |
Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 8 tokens 1 output reasoning text token = 8 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 12 tokens 1 output reasoning text token = 12 tokens |
|
Latest supported version (GA): Latest supported version (preview): |
2690 | Tokens | 1 |
1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 4 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens |
|
Latest supported version: |
3360 | Tokens | 1 |
1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 7 tokens 1 output text token = 4 tokens |
|
Latest supported version: |
6720 | Tokens | 1 |
1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output text token = 4 tokens |
Latest supported version: |
0.0040 | Video seconds | 34 | 1 output video second = 1 output video second |
| Video+audio seconds | 67 | 1 output video+audio second = 2 output video seconds | ||
Latest supported version: |
0.0080 | Video seconds | 17 | 1 output video second = 1 output video second |
| Video+audio seconds | 25 | 1 output video+audio second = 1.45 output video seconds | ||
Latest supported version: |
0.0040 | Video seconds | 34 | 1 output video second = 1 output video second |
| Video+audio seconds | 67 | 1 output video+audio second = 2 output video seconds | ||
Latest supported version: |
0.0080 | Video seconds | 17 | 1 output video second = 1 output video second |
| Video+audio seconds | 25 | 1 output video+audio second = 1.45 output video seconds | ||
|
|
0.015 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
|
|
0.02 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
|
|
0.04 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
|
|
0.02 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
|
|
0.025 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| Imagen 3 Fast | 0.05 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
For information about a model's capabilities and input or output limits, see the documentation for the model.
You can upgrade to new models as they are made available. For information about model availability and discontinuation dates, see Google models.
For more information about supported locations, see Available locations.
Partner models
The following table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.
| Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
|---|---|---|---|---|
| Anthropic's Claude Sonnet 4.5 | 350 | 25 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write token = 2.5 tokens 1 cache hit token = 0.2 token |
| Anthropic's Claude Opus 4.1 | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Haiku 4.5 | 1050 | 8 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Opus 4 | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Sonnet 4 | 350 | 25 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write token = 2.5 tokens 1 cache hit token = 0.2 token |
| Anthropic's Claude 3.7 Sonnet | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Sonnet v2 (deprecated) | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Haiku | 2,000 | 10 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3 Opus | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3 Haiku | 4,200 | 5 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Sonnet (deprecated) | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token |
For information about supported locations, see Anthropic Claude region availability. To order Provisioned Throughput for Anthropic models, contact your Google Cloud account representative.
Open models
The following table shows the throughput, purchase increment, and burndown rates for open models that support Provisioned Throughput.
| Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
|---|---|---|---|---|
|
Latest supported version: |
6725 | 1 | 1 | 1 input text token = 1 token 1 output text token = 8 tokens |
|
Latest supported version: |
6725 | 1 | 1 | 1 input text token = 1 token 1 output text token = 8 tokens |
|
Latest supported version: |
1010 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
|
Latest supported version: |
4035 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Available capabilities for Google and open models
The following table lists the capabilities that are available with Provisioned Throughput for Google models and open models:
| Capability | Google models | Open models (preview) |
|---|---|---|
| Order through Google Cloud console | Yes | Yes |
| Supports global endpoints | See Global endpoint model support. | See Global endpoint model support. |
| Supports supervised fine-tuned models | Yes | No |
| Supports API key usage | Yes | No |
| Integrated with implicit context caching | Yes | Not applicable |
| Integrated with explicit context caching | No | Not applicable |
| ML processing | Available in specific regions. For details, see Single Zone Provisioned Throughput. | Not applicable |
| Available order terms | 1 week, 1 month, 3 month, and 1 year | 1 month, 3 month, and 1 year |
| Change order from the console | Yes | No |
| Order statuses: pending review, approved, active, expired | Yes | Yes |
| Overages spillover to pay-as-you-go by default | Yes | Yes |
| API header control: use "dedicated" to only use provisioned throughput or "shared" to only use pay-as-you-go | Yes | Yes |
| Monitoring: metrics, dashboards, and alerting | Yes | Yes |
Global endpoint model support
Provisioned Throughput supports the global endpoint for Google models and open models.
Traffic that exceeds the Provisioned Throughput quota uses the global endpoint, by default.
To assign Provisioned Throughput to the global endpoint of a model,
select global as the region when you place a Provisioned Throughput order.
Google models with global endpoint support
The following table lists the Google models for which Provisioned Throughput supports the global endpoint:
| Model | Latest supported model version |
|---|---|
| Gemini 2.5 Flash Image | gemini-2.5-flash-image |
| Gemini 2.5 Flash-Lite | |
| Gemini 2.5 Pro | gemini-2.5-pro |
| Gemini 2.5 Flash | |
| Gemini 2.0 Flash | gemini-2.0-flash-001 |
| Gemini 2.0 Flash-Lite | gemini-2.0-flash-lite-001 |
Open models with global endpoint support
The following table lists the open models for which Provisioned Throughput supports the global endpoint:
| Model | Latest supported model version |
|---|---|
| Qwen3-Next-80B Instruct | qwen3-next-80b-a3b-instruct-maas |
| Qwen3-Next-80B Thinking | qwen3-next-80b-a3b-thinking-maas |
| Qwen3 Coder | qwen3-coder-480b-a35b-instruct-maas |
Supervised fine-tuned model support
The following is supported for Google models that support supervised fine-tuning:
Provisioned Throughput can be applied to both base models and supervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model count towards the same Provisioned Throughput quota.
For example, Provisioned Throughput purchased for
gemini-2.0-flash-lite-001for a specific project prioritizes requests that are made from supervised fine-tuned versions ofgemini-2.0-flash-lite-001created within that project. Use the appropriate header to control traffic behavior.