Control costs with token quotas
This document describes how you can define and manage daily limits on the number of input and output tokens consumed by generative AI functions.BigQuery generative AI functions use large language models (LLMs) to perform advanced analysis within your SQL queries. Because LLM usage is typically billed based on the number of tokens processed, BigQuery provides token quotas to help you manage and control the costs associated with using these functions.
The token quotas apply to BigQuery SQL
functions designed for all generative AI inference tasks that use
Gemini LLMs, such as the
AI.CLASSIFY
and
AI.GENERATE
functions.
Quota details
BigQuery provides the following daily quotas based on LLM token usage. Token usage directly correlates with your Vertex AI billing for BigQuery generative AI functions that use Gemini models. These quotas are tracked globally across all regions.
These token quotas govern the number of input and output tokens processed by the LLMs for generative AI functions:
- Input tokens: Tokens sent to the model for processing. This includes tokens in prompt text and any other data provided to the model as input.
- Output tokens: Tokens generated by the model in its response. This includes tokens in generated text (candidate tokens) and tokens generated during internal reasoning steps (thought tokens).
| Quota name | Metric | Scope | Default value |
|---|---|---|---|
GenAiInputTokensPerDay |
Input tokens used by the LLM | Per day per project | 200,000,000,000 |
GenAiInputTokensPerUserPerDay |
Input tokens used by the LLM | Per day per user | 40,000,000,000 |
GenAiOutputTokensPerDay |
Output and thought tokens used by the LLM | Per day per project | 20,000,000,000 |
GenAiOutputTokensPerUserPerDay |
Output and thought tokens used by the LLM | Per day per user | 4,000,000,000 |
These quotas are tracked in increments of millions of tokens. While you can set precise limits, values smaller than a few million tokens might not be reflected with perfect accuracy because of the nature of token reporting and aggregation.
Cached tokens don't count towards the quotas.
Manage quotas
Depending on your resource usage, you might want to view or adjust your token quota values up or down. You can use the Google Cloud console to perform these tasks:
In the Google Cloud console, go to the IAM & Admin > Quotas & System Limits page.
Filter the quotas by entering
Service: BigQuery API.Search for a specific quota from the list of quotas (for example, search for
GenAiInputTokensPerDay).Click Edit.
Increase or decrease the quota in the Quota changes pane by entering a new value.
- If your workloads require more capacity than the default limit provides, you can request a quota increase.
- If you want to place a stricter limit on your usage to prevent budget overruns, you can create a quota override to cap your usage.
Click Submit request.
Quota enforcement behavior
BigQuery monitors your token consumption at multiple stages of query execution:
- Pre-execution check: BigQuery checks the available token
quota before executing a query that contains generative AI functions. If the
relevant quota (for example, project daily input tokens) is already exhausted,
the query is rejected with a
QuotaExceedederror. - During execution: If a query is running and consumes tokens such that it
exhausts any of the configured quotas (input or output, per project or per
user), new LLM calls within that query are rejected.
- Any remaining rows that depend on LLM calls encounter a quota exhaustion error.
- The query's outcome depends on the
max_error_ratioargument if used in functions likeAI.IF. If the error ratio remains within the allowed limit, partial results might be returned. Otherwise, the entire query fails. - Subsequent queries attempting to use generative AI functions will fail with
a
QuotaExceedederror until the daily quota resets.
Important considerations
- Global quotas: The defined quotas are global. Token usage is aggregated across all regions where your project operates, providing a unified cost control mechanism. This prevents unexpected charges from usage in different regions.
- Provisioned throughput: If you are using Vertex AI models with provisioned throughput, billing is not based on token usage. You should set these BigQuery token quotas to a high value to avoid unnecessarily blocking your queries.
What's next
- Learn more about optimizing AI function costs.
- Read an overview of generative AI in BigQuery.