This page introduces two ways to consume generative AI services, provides a list of quotas by region and model, and shows you how to view and edit your quotas in the Google Cloud console.
Overview
There are two ways to consume generative AI services. You can choose pay-as-you-go (PayGo), or you can pay in advance using Provisioned Throughput.
If you're using PayGo, your usage of generative AI features is subject to one of the following quota systems, depending on which model you're using:
- Models earlier than Gemini 2.0 use a standard quota system for each generative AI model to help ensure fairness and to reduce spikes in resource use and availability. Quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.
- Newer models use Dynamic shared quota (DSQ), which dynamically distributes available PayGo capacity among all customers for a specific model and region, removing the need to set quotas and to submit quota increase requests. There are no quotas with DSQ.
To help ensure high availability for your application and to get predictable service levels for your production workloads, see Provisioned Throughput.
Quota system by model
The following models support Dynamic shared quota (DSQ):
- Gemini 2.5 Flash (Preview)
- Gemini 2.5 Flash-Lite (Preview)
- Gemini 2.5 Flash Image
- Gemini 2.5 Flash-Lite
- Gemini 2.0 Flash with Live API (Preview)
- Gemini 2.0 Flash with image generation (Preview)
- Gemini 2.5 Pro
- Gemini 2.5 Flash
- Gemini 2.0 Flash
- Gemini 2.0 Flash-Lite
Non-Gemini and earlier Gemini models use the standard quota system. For more information, see Vertex AI quotas and limits.
MaaS third-party models use standard quotas and for more information see each model's reference page: Use partner models.
Tuned model quotas
Tuned model inference shares the same quota as the base model. There is no separate quota for tuned model inference.
Text embedding limits
Each request can have up to 250 input texts (generating 1 embedding per input text) and 20,000 tokens per request. Only the first 2,048 tokens in each input text are used to compute the embeddings. Forgemini-embedding-001, the
quota is listed under the name
gemini-embedding.
Embed content input tokens per minute per base model
Unlike previous embedding models which were primarily limited by RPM quotas, the quota for the Gemini Embedding model limits the number of tokens that can be sent per minute per project.
| Quota | Value | 
|---|---|
| Embed content input tokens per minute | 5,000,000 | 
Vertex AI Agent Engine limits
The following limits apply to Vertex AI Agent Engine for a given project in each region:| Description | Limit | 
|---|---|
| Create, delete, or update Vertex AI Agent Engine per minute | 10 | 
| Create, delete, or update Vertex AI Agent Engine sessions per minute | 100 | 
| QueryorStreamQueryVertex AI Agent Engine per minute | 90 | 
| Append event to Vertex AI Agent Engine sessions per minute | 300 | 
| Maximum number of Vertex AI Agent Engine resources | 100 | 
| Create, delete, or update Vertex AI Agent Engine memory resources per minute | 100 | 
| Get, list, or retrieve from Vertex AI Agent Engine Memory Bank per minute | 300 | 
| Sandbox environment (Code Execution) execute requests per minute | 1000 | 
| Sandbox environment (Code Execution) entities per region | 1000 | 
| A2A Agent post requests like sendMessageandcancelTaskper minute | 60 | 
| A2A Agent get requests like getTaskandgetCardper minute | 600 | 
| Concurrent live bidirectional connections using the BidiStreamQueryAPI per minute | 10 | 
Batch prediction
The quotas and limits for batch inference jobs are the same across all regions.Concurrent batch inference job limits for Gemini models
There are no predefined quota limits on batch inference for Gemini models. Instead, the batch service provides access to a large, shared pool of resources, dynamically allocated based on the model's real-time availability and demand across all customers for that model. When more customers are active and saturated the model's capacity, your batch requests might be queued for capacity.Concurrent batch inference job quotas non-Gemini models
The following table lists the quotas for the number of concurrent batch inference jobs, which don't apply to Gemini models:| Quota | Value | 
|---|---|
| aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs | 4 | 
View and edit the quotas in the Google Cloud console
To view and edit the quotas in the Google Cloud console, do the following:- Go to the Quotas and System Limits page.
- To adjust the quota, copy and paste the property
  aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobsin the Filter. Press Enter.
- Click the three dots at the end of the row, and select Edit quota.
- Enter a new quota value in the pane, and click Submit request.
Go to Quotas and System Limits
Vertex AI RAG Engine
For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).| Service | Quota | Metric | 
|---|---|---|
| RAG Engine data management APIs | 60 RPM | VertexRagDataService requests per minute per region | 
| RetrievalContextsAPI | 600 RPM | VertexRagService retrieve requests per minute per region | 
| base_model: textembedding-gecko | 1,500 RPM | Online prediction requests per base model per minute per region per base_modelAn additional filter for you to specify is base_model: textembedding-gecko | 
| Service | Limit | Metric | 
|---|---|---|
| Concurrent ImportRagFilesrequests | 3 RPM | VertexRagService concurrent import requests per region | 
| Maximum number of files per ImportRagFilesrequest | 10,000 | VertexRagService import rag files requests per region | 
For more rate limits and quotas, see Generative AI on Vertex AI rate limits.
Gen AI evaluation service
The Gen AI evaluation service usesgemini-2.0-flash as a default judge model
for model-based metrics.
A single evaluation request for a model-based metric might result in multiple underlying requests to
the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means
that any requests directed to gemini-2.0-flash for model inference and
model-based evaluation contribute to the quota.
Quotas for the Gen AI evaluation service and the underlying judge model are shown
in the following table:
| Request quota | Default quota | 
|---|---|
| Gen AI evaluation service requests per minute | 1,000 requests per project per region | 
| Online prediction requests per minute for base_model: gemini-2.0-flash | See Quotas by region and model. | 
If you receive an error related to quotas while using the Gen AI evaluation service, you might need to file a quota increase request. See View and manage quotas for more information.
| Limit | Value | 
|---|---|
| Gen AI evaluation service request timeout | 60 seconds | 
When you use the Gen AI evaluation service for the first time in a new project, you might experience an initial setup delay up to two minutes. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.
The maximum input and output tokens for model-based metrics depend on the model used as the judge model. See Google models for a list of models.
Vertex AI Pipelines quotas
Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.
What's next
- To learn more about dynamic shared quota, see Dynamic shared quota.
- To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and system limits, see the Cloud Quotas documentation.