The AI.AGG function
The AI.AGG function uses a Vertex AI Gemini model to
aggregate data based on natural language instructions that you provide. The
function can process text or image data and returns a single string.
Use cases
The following are common use cases for the AI.AGG function:
- Sentiment analysis of user reviews. What is the overall sentiment of user feedback? What are the common positive and negative aspects across reviews? How do users feel about specific aspects of the product (size, price, value)?
- Content summarization of multimodal data. What kinds of animals are captured in these images? What are the most common shoe brands in the user-uploaded pictures?
- Analysis of AI agent performance. What kinds of tasks did the agent complete the fastest? What kinds of tasks required the most user interactions before completion?
- Log analysis. What parts of the application did users interact with the most? What were the most common final user actions before disconnecting?
We recommend the AI.AGG function for aggregation tasks instead of the
AI.GENERATE
and
AI.GENERATE_TEXT
general-purpose AI functions for the following reasons:
- The
AI.AGGfunction automatically performs multi-level aggregation through batching data and aggregating the batch results. Because of this process, theAI.AGGfunction is capable of analyzing data that exceeds the size of the standard Gemini context window. The general-purpose AI functions require manually writing multi-level batching logic to achieve similar results. - The
AI.AGGfunction returns one string result per input group, while theAI.GENERATEfunction returns one result per input row and theAI.GENERATE_TEXTfunction returns a table with additional information.
Syntax
AI.AGG( [ DISTINCT ] INPUT, INSTRUCTION [, connection_id => 'CONNECTION'] [, endpoint => 'ENDPOINT'] )
Arguments
The AI.AGG function takes the following arguments:
INPUT: the data to be aggregated, as either aSTRINGvalue or aSTRUCTobject consisting ofSTRINGvalues,ObjectRefRuntimevalues, and arrays ofSTRINGandObjectRefRuntimevalues.ObjectRefRuntimevalues reference text or image data in Cloud Storage and are generated by theOBJ.GET_ACCESS_URLfunction. Text and image data must meet the technical specifications for the version of Gemini that you plan to use.For the best performance, we recommend an input of no more than 20 million rows per query and no more than 1,000 distinct groups. Queries that exceed these limits might time out and not return any results.
INSTRUCTION: the natural language aggregation prompt to apply to the input data, as either a string or query parameter.CONNECTION: a Cloud resource connection in the[PROJECT_ID.]LOCATION.CONNECTION_IDformat. This connection is used to communicate with Vertex AI and Cloud Storage. If you don't specify a connection, then your end-user credentials are used.ENDPOINT: the Vertex AI endpoint to use, which can consist of any Gemini model that doesn't require thinking budget. If you only include the model ID, BigQuery automatically identifies and uses the full endpoint in the region where the query is processed. If you don't specify any endpoint information, BigQuery chooses a model for you.
Output
The AI.AGG function returns a STRING value. If you use the AI.AGG
function with a GROUP_BY statement, then the function returns a STRING value
for each input group.
If a call to Vertex AI is unsuccessful for any reason, such as
exceeding a quota or model unavailability, then the function returns partial
results. If all calls to Vertex AI fail, or if all the input rows
are invalid, then the function returns the NULL value.
Function output is capped at 10,000 tokens per group.
Examples
These examples use the AI.AGG function for various sentiment analysis,
content summarization, and data exploration use cases.
User sentiment analysis
The following example uses the AI.AGG function to analyze viewer sentiment
for a set of movies:
SELECT title, movie_id, AI.AGG( review, 'You will be given user-provided reviews of a movie. Summarize the overall sentiment towards the movie.' ) AS sentiment FROM `bigquery-public-data.imdb.reviews` WHERE movie_id IN ('tt0339384', 'tt0084787', 'tt0029850') GROUP BY movie_id, title;
The result is similar to the following:
+---------------+-----------+--------------------------------------------------+ | title | movie_id | sentiment | +---------------+-----------+--------------------------------------------------+ | Mickey's | tt0339384 | The movie is overwhelmingly well-received, with | | PhilharMagic | | reviewers consistently praising its spectacular | | | | 3D effects as the best they've seen. Many | | | | found it incredibly joyful and exciting, | | | | bringing smiles to their faces and leaving them | | | | breathless. The musical numbers and beloved | | | | Disney characters also significantly contributed | | | | to this positive experience, with some viewers | | | | even watching the show multiple times. | +---------------+-----------+--------------------------------------------------+ | Alexander | tt0029850 | Overall, the sentiment towards | | Nevsky | | "Alexander Nevsky" is overwhelmingly positive, | | | | with many reviewers calling it a cinematic | | | | masterpiece and a work of art. People especially | | | | love the stunning visuals, particularly the | | | | iconic battle on the ice, and Prokofiev's | | | | magnificent musical score. The film is also | | | | highly regarded for its historical and artistic | | | | significance, with many noting how it transcends | | | | its propagandistic elements to become great art. | | | | While some acknowledge the film's age and | | | | its overt propaganda, these points are generally | | | | outweighed by its artistic achievements and | | | | lasting impact. | +---------------+-----------+--------------------------------------------------+ | The Thing | tt0084787 | The overall sentiment towards "The Thing" is | | | | overwhelmingly positive, with many reviewers | | | | hailing it as a masterpiece and one of the best | | | | horror/sci-fi films ever made. People | | | | consistently praise its intense tension and | | | | paranoia, the groundbreaking and enduring | | | | quality of its practical special effects, and | | | | its intelligent, suspenseful story. John | | | | Carpenter's direction and the strong | | | | ensemble acting, particularly Kurt | | | | Russell's performance, are also frequently | | | | highlighted. Reviewers often note that the film | | | | has aged incredibly well, remaining fresh and | | | | impactful decades later. While there were some | | | | initial criticisms regarding its box office | | | | performance and a few minor points about the | | | | score or character depth, these are largely | | | | overshadowed by the widespread acclaim for its | | | | psychological depth, terrifying effects, and | | | | lasting suspense. | +---------------+-----------+--------------------------------------------------+
Common categories
The following example uses the AI.AGG function to determine the most-used
languages in a set of comments:
WITH sample_data AS ( SELECT title, comment FROM `bigquery-public-data.samples.wikipedia` WHERE comment IS NOT NULL LIMIT 5000 ) SELECT AI.AGG( TO_JSON_STRING(t), 'These are wikipedia comments. What are the most used languages?' ) AS most_used_language FROM sample_data as t;
The result is similar to the following:
+-----------------------------------------------------------------------------+ | most_used_language | +-----------------------------------------------------------------------------+ | The most used language in the Wikipedia comments is English, by a | | significant margin. Other languages that appear frequently include French, | | German, Italian, and Dutch. | +-----------------------------------------------------------------------------+
Image content summarization
The following example uses the AI.AGG function to summarize the contents of a
set of images:
# Create a dataset CREATE SCHEMA IF NOT EXISTS cymbal_pets; # Create an object table CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images WITH CONNECTION us.example_connection OPTIONS ( object_metadata = 'SIMPLE', uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png'] ); # Summarize content of images in the object table SELECT AI.AGG( STRUCT(OBJ.GET_ACCESS_URL(ref, 'r')), 'What are the major categories of these images?' ) AS category_description FROM `cymbal_pets.product_images`;
The result is similar to the following:
+-----------------------------------------------------------------------------+ | category_description | +-----------------------------------------------------------------------------+ | The images primarily fall into the category of Pet Supplies. Within this | | main category, the most common types are Pet Food, Pet Accessories, | | Aquarium Supplies, Pet Care Products (non-food), Pet Toys, and Pet | | Habitats. There are also a few images in an "Other" category and one image | | of Craft Supplies. | +-----------------------------------------------------------------------------+
Most frequent items
The following example uses the AI.AGG function to find the most referenced
companies in a set of news articles:
SELECT AI.AGG( TO_JSON_STRING(t), 'In the following tech news articles, what are the top 3 most referenced companies, and what are they famous for?' ) AS tech_news_companies FROM (SELECT * FROM `bigquery-public-data.bbc_news.fulltext` LIMIT 30000) AS t WHERE t.category = 'tech';
The result is similar to the following:
+-----------------------------------------------------------------------------+ | tech_news_companies | +-----------------------------------------------------------------------------+ | Here are the top 3 most referenced companies across all the provided tech | | news, along with what they are famous for: | | | | 1. **Microsoft** (referenced 64 times) | | **Famous for:** Its dominant **Windows operating system**, the | | **Xbox gaming console** , and a wide array of **software products** | | (including Office, Internet Explorer, and security tools). Microsoft is | | consistently mentioned for its efforts in computer security, combating | | piracy, and its significant presence in both consumer electronics | | and enterprise solutions. | | | | 2. **Apple** (referenced 42 times) | | **Famous for:** Its innovative and design-led | | **consumer electronics**, particularly the highly successful | | **iPod digital music player** and **Macintosh computers** (including | | the Mac Mini and Powerbook). Apple is also known for its iTunes online | | store, its secrecy regarding new products, and its legal battles to | | protect intellectual property. | | | | 3. **Sony** (referenced 30 times) | | **Famous for:** Its prominent role in **consumer electronics** | | and the **gaming industry**, especially with its | | **PlayStation series of consoles** (PS2, PS3, PSP). Sony is also | | recognized for its involvement in DVD technology (Blu-ray Disc), | | humanoid robots (Qrio), and its broader range of consumer electronics | | products. | +-----------------------------------------------------------------------------+
Text summarization
The following example uses the AI.AGG function to summarize blog topics that
are grouped by author:
WITH sample_data AS ( SELECT * FROM `bigquery-public-data.hacker_news.full` AS t WHERE t.text IS NOT NULL AND t.type = 'story' AND t.by IN UNNEST(['amichail', 'kids_book', 'whoishiring', 'michaelmayer']) LIMIT 30000 ) SELECT t.by, AI.AGG( t.text, 'These are the hacker news blog posts of a single user. Tell me what topics the user talks about the most. Keep it concise.' ) AS common_topic FROM sample_data AS t GROUP BY t.by;
The result is similar to the following:
+--------------+---------------------------------------------------------------+
| by | common_topic |
+--------------+---------------------------------------------------------------+
| amichail | The user talks most about Artificial Intelligence and |
| | Machine Learning, followed by general Technology and |
| | Software Development (especially the Apple ecosystem), and |
| | Gaming and Game Development. They also frequently discuss |
| | social dynamics, human behavior, and education. |
+--------------+---------------------------------------------------------------+
| whoishiring | This user primarily discusses job-related topics on Hacker |
| | News. They frequently provide guidelines for job postings |
| | (for companies), job seekers ("Who wants to be hired?"), and |
| | freelancers. Remote work is a significant sub-theme, and |
| | they show a specific interest in connecting refugees with |
| | employment opportunities. They also touch on general Hacker |
| | News community guidelines and job search resources. |
+--------------+---------------------------------------------------------------+
| michaelmayer | The user primarily discusses mobile technology (especially |
| | Apple and Android), web design and development, social media |
| | (particularly Twitter and Facebook), giveaways, bundles, and |
| | inspiration and creativity. |
+--------------+---------------------------------------------------------------+
| kids_book | The user most frequently discusses anti-bullying, often |
| | promoting the children's book "Butterfly Me" as a resource. |
| | They also frequently talk about children's books in general, |
| | particularly those by Evon Latrail, including "When Mommy |
| | Went to Heaven" which addresses death and grief for children. |
| | Giveaways and promotions, especially for "Butterfly Me," |
| | are also a recurring theme. |
+--------------+---------------------------------------------------------------+
Locations
You can run the AI.AGG function in any BigQuery region as long
as your endpoint location supports the Gemini model that you're
using. If your query processing region doesn't have a supported
Gemini endpoint, you must specify a full endpoint that is
supported, such as the model's global endpoint. For more information, see
Google model endpoint locations.
Manage inference costs
Inference using the Vertex AI Gemini model can be a
relatively expensive operation. Due to the nature of query processing in
BigQuery, the actual number of rows processed by the model might differ
from what you expect, particularly when running complex queries, such as
JOIN or ORDER BY ... LIMIT clauses. To strictly control the number of rows
processed by your complex queries, we
recommended that you write the results of your query
to a separate table beforehand, and then
perform the Gemini inference directly on that materialized table.
For information about how to view inference charges that you incur in Vertex AI, see
Track costs.
Because the AI.AGG function uses batching and aggregation to analyze data that
exceeds the size of the standard Gemini context window, the total
number of tokens that are sent to Gemini is greater than the
number of tokens in the source data.
Quotas and limits
For quota and limit information, see Generative AI functions in the BigQuery quotas and limits reference.
Known issues
The AI.AGG function has the following known issues:
- If a single input row has 10 or more images, it might be skipped.
- Input rows with arrays of
ObjectRefRuntimeobjects that call theOBJ.GET_ACCESS_URLfunction might be skipped. - If you're using Workforce Identity Federation without a specified Cloud resource connection and your query takes more than a few minutes to run, it might fail unexpectedly.
- Queries that use Gemini 3.0 or 3.1 with connection-based authentication might receive an unauthenticated error. To resolve this issue, use end-user credentials.
What's next
- For more information about using generative AI functions in BigQuery, see Generative AI overview.
- For more information about how to apply generative AI models to your workflow, see End-to-end user journeys for generative AI models.
- For information about building task-specific solutions with generative AI functions through the Cloud AI APIs, see Task-specific solutions overview.