Control costs with token quotas

This document describes how you can define and manage daily limits on the number of input and output tokens consumed by generative AI functions.

BigQuery generative AI functions use large language models (LLMs) to perform advanced analysis within your SQL queries. Because LLM usage is typically billed based on the number of tokens processed, BigQuery provides token quotas to help you manage and control the costs associated with using these functions.

The token quotas apply to BigQuery SQL functions designed for all generative AI inference tasks that use Gemini LLMs, such as the AI.CLASSIFY and AI.GENERATE functions.

Quota details

BigQuery provides the following daily quotas based on LLM token usage. Token usage directly correlates with your Vertex AI billing for BigQuery generative AI functions that use Gemini models. These quotas are tracked globally across all regions.

These token quotas govern the number of input and output tokens processed by the LLMs for generative AI functions:

  • Input tokens: Tokens sent to the model for processing. This includes tokens in prompt text and any other data provided to the model as input.
  • Output tokens: Tokens generated by the model in its response. This includes tokens in generated text (candidate tokens) and tokens generated during internal reasoning steps (thought tokens).
Quota name Metric Scope Default value
GenAiInputTokensPerDay Input tokens used by the LLM Per day per project 200,000,000,000
GenAiInputTokensPerUserPerDay Input tokens used by the LLM Per day per user 40,000,000,000
GenAiOutputTokensPerDay Output and thought tokens used by the LLM Per day per project 20,000,000,000
GenAiOutputTokensPerUserPerDay Output and thought tokens used by the LLM Per day per user 4,000,000,000

These quotas are tracked in increments of millions of tokens. While you can set precise limits, values smaller than a few million tokens might not be reflected with perfect accuracy because of the nature of token reporting and aggregation.

Cached tokens don't count towards the quotas.

Manage quotas

Depending on your resource usage, you might want to view or adjust your token quota values up or down. You can use the Google Cloud console to perform these tasks:

  1. In the Google Cloud console, go to the IAM & Admin > Quotas & System Limits page.

    Go to Quotas & System Limits

  2. Filter the quotas by entering Service: BigQuery API.

  3. Search for a specific quota from the list of quotas (for example, search for GenAiInputTokensPerDay).

  4. Click Edit.

  5. Increase or decrease the quota in the Quota changes pane by entering a new value.

    • If your workloads require more capacity than the default limit provides, you can request a quota increase.
    • If you want to place a stricter limit on your usage to prevent budget overruns, you can create a quota override to cap your usage.
  6. Click Submit request.

Quota enforcement behavior

BigQuery monitors your token consumption at multiple stages of query execution:

  • Pre-execution check: BigQuery checks the available token quota before executing a query that contains generative AI functions. If the relevant quota (for example, project daily input tokens) is already exhausted, the query is rejected with a QuotaExceeded error.
  • During execution: If a query is running and consumes tokens such that it exhausts any of the configured quotas (input or output, per project or per user), new LLM calls within that query are rejected.
    • Any remaining rows that depend on LLM calls encounter a quota exhaustion error.
    • The query's outcome depends on the max_error_ratio argument if used in functions like AI.IF. If the error ratio remains within the allowed limit, partial results might be returned. Otherwise, the entire query fails.
    • Subsequent queries attempting to use generative AI functions will fail with a QuotaExceeded error until the daily quota resets.

Important considerations

  • Global quotas: The defined quotas are global. Token usage is aggregated across all regions where your project operates, providing a unified cost control mechanism. This prevents unexpected charges from usage in different regions.
  • Provisioned throughput: If you are using Vertex AI models with provisioned throughput, billing is not based on token usage. You should set these BigQuery token quotas to a high value to avoid unnecessarily blocking your queries.

What's next