The AI.AGG function

The AI.AGG function uses a Vertex AI Gemini model to aggregate data based on natural language instructions that you provide. The function can process text or image data and returns a single string.

Use cases

The following are common use cases for the AI.AGG function:

  • Sentiment analysis of user reviews. What is the overall sentiment of user feedback? What are the common positive and negative aspects across reviews? How do users feel about specific aspects of the product (size, price, value)?
  • Content summarization of multimodal data. What kinds of animals are captured in these images? What are the most common shoe brands in the user-uploaded pictures?
  • Analysis of AI agent performance. What kinds of tasks did the agent complete the fastest? What kinds of tasks required the most user interactions before completion?
  • Log analysis. What parts of the application did users interact with the most? What were the most common final user actions before disconnecting?

We recommend the AI.AGG function for aggregation tasks instead of the AI.GENERATE and AI.GENERATE_TEXT general-purpose AI functions for the following reasons:

  • The AI.AGG function automatically performs multi-level aggregation through batching data and aggregating the batch results. Because of this process, the AI.AGG function is capable of analyzing data that exceeds the size of the standard Gemini context window. The general-purpose AI functions require manually writing multi-level batching logic to achieve similar results.
  • The AI.AGG function returns one string result per input group, while the AI.GENERATE function returns one result per input row and the AI.GENERATE_TEXT function returns a table with additional information.

Syntax

AI.AGG(
  [ DISTINCT ]
  INPUT,
  INSTRUCTION
  [, connection_id => 'CONNECTION']
  [, endpoint => 'ENDPOINT']
)

Arguments

The AI.AGG function takes the following arguments:

  • INPUT: the data to be aggregated, as either a STRING value or a STRUCT object consisting of STRING values, ObjectRefRuntime values, and arrays of STRING and ObjectRefRuntime values. ObjectRefRuntime values reference text or image data in Cloud Storage and are generated by the OBJ.GET_ACCESS_URL function. Text and image data must meet the technical specifications for the version of Gemini that you plan to use.

    For the best performance, we recommend an input of no more than 20 million rows per query and no more than 1,000 distinct groups. Queries that exceed these limits might time out and not return any results.

  • INSTRUCTION: the natural language aggregation prompt to apply to the input data, as either a string or query parameter.

  • CONNECTION: a Cloud resource connection in the [PROJECT_ID.]LOCATION.CONNECTION_ID format. This connection is used to communicate with Vertex AI and Cloud Storage. If you don't specify a connection, then your end-user credentials are used.

  • ENDPOINT: the Vertex AI endpoint to use, which can consist of any Gemini model that doesn't require thinking budget. If you only include the model ID, BigQuery automatically identifies and uses the full endpoint in the region where the query is processed. If you don't specify any endpoint information, BigQuery chooses a model for you.

Output

The AI.AGG function returns a STRING value. If you use the AI.AGG function with a GROUP_BY statement, then the function returns a STRING value for each input group.

If a call to Vertex AI is unsuccessful for any reason, such as exceeding a quota or model unavailability, then the function returns partial results. If all calls to Vertex AI fail, or if all the input rows are invalid, then the function returns the NULL value.

Function output is capped at 10,000 tokens per group.

Examples

These examples use the AI.AGG function for various sentiment analysis, content summarization, and data exploration use cases.

User sentiment analysis

The following example uses the AI.AGG function to analyze viewer sentiment for a set of movies:

SELECT
  title,
  movie_id,
  AI.AGG(
    review,
    'You will be given user-provided reviews of a movie. Summarize the overall sentiment towards the movie.'
  ) AS sentiment
FROM `bigquery-public-data.imdb.reviews`
WHERE movie_id IN ('tt0339384', 'tt0084787', 'tt0029850')
GROUP BY movie_id, title;

The result is similar to the following:

+---------------+-----------+--------------------------------------------------+
|     title     | movie_id  |           sentiment                              |
+---------------+-----------+--------------------------------------------------+
| Mickey's      | tt0339384 | The movie is overwhelmingly well-received, with  |
| PhilharMagic  |           | reviewers consistently praising its spectacular  |
|               |           | 3D effects as the best they've seen. Many        |
|               |           | found it incredibly joyful and exciting,         |
|               |           | bringing smiles to their faces and leaving them  |
|               |           | breathless. The musical numbers and beloved      |
|               |           | Disney characters also significantly contributed |
|               |           | to this positive experience, with some viewers   |
|               |           | even watching the show multiple times.           |
+---------------+-----------+--------------------------------------------------+
|   Alexander   | tt0029850 | Overall, the sentiment towards                   |
|   Nevsky      |           | "Alexander Nevsky" is overwhelmingly positive,   |
|               |           | with many reviewers calling it a cinematic       |
|               |           | masterpiece and a work of art. People especially |
|               |           | love the stunning visuals, particularly the      |
|               |           | iconic battle on the ice, and Prokofiev's        |
|               |           | magnificent musical score. The film is also      |
|               |           | highly regarded for its historical and artistic  |
|               |           | significance, with many noting how it transcends |
|               |           | its propagandistic elements to become great art. |
|               |           | While some acknowledge the film's age and        |
|               |           | its overt propaganda, these points are generally |
|               |           | outweighed by its artistic achievements and      |
|               |           | lasting impact.                                  |
+---------------+-----------+--------------------------------------------------+
|   The Thing   | tt0084787 | The overall sentiment towards "The Thing" is     |
|               |           | overwhelmingly positive, with many reviewers     |
|               |           | hailing it as a masterpiece and one of the best  |
|               |           | horror/sci-fi films ever made. People            |
|               |           | consistently praise its intense tension and      |
|               |           | paranoia, the groundbreaking and enduring        |
|               |           | quality of its practical special effects, and    |
|               |           | its intelligent, suspenseful story. John         |
|               |           | Carpenter's direction and the strong             |
|               |           | ensemble acting, particularly Kurt               |
|               |           | Russell's performance, are also frequently       |
|               |           | highlighted. Reviewers often note that the film  |
|               |           | has aged incredibly well, remaining fresh and    |
|               |           | impactful decades later. While there were some   |
|               |           | initial criticisms regarding its box office      |
|               |           | performance and a few minor points about the     |
|               |           | score or character depth, these are largely      |
|               |           | overshadowed by the widespread acclaim for its   |
|               |           | psychological depth, terrifying effects, and     |
|               |           | lasting suspense.                                |
+---------------+-----------+--------------------------------------------------+

Common categories

The following example uses the AI.AGG function to determine the most-used languages in a set of comments:

WITH sample_data AS (
  SELECT title, comment
  FROM `bigquery-public-data.samples.wikipedia`
  WHERE comment IS NOT NULL
  LIMIT 5000
  )
SELECT
  AI.AGG(
    TO_JSON_STRING(t),
    'These are wikipedia comments. What are the most used languages?'
  ) AS most_used_language
FROM sample_data as t;

The result is similar to the following:

+-----------------------------------------------------------------------------+
|                          most_used_language                                 |
+-----------------------------------------------------------------------------+
| The most used language in the Wikipedia comments is English, by a           |
| significant margin. Other languages that appear frequently include French,  |
| German, Italian, and Dutch.                                                 |
+-----------------------------------------------------------------------------+

Image content summarization

The following example uses the AI.AGG function to summarize the contents of a set of images:

# Create a dataset
CREATE SCHEMA IF NOT EXISTS cymbal_pets;

# Create an object table
CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images
WITH CONNECTION us.example_connection
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']
);

# Summarize content of images in the object table
SELECT
  AI.AGG(
    STRUCT(OBJ.GET_ACCESS_URL(ref, 'r')),
    'What are the major categories of these images?'
  ) AS category_description
FROM
  `cymbal_pets.product_images`;

The result is similar to the following:

+-----------------------------------------------------------------------------+
|                         category_description                                |
+-----------------------------------------------------------------------------+
| The images primarily fall into the category of Pet Supplies. Within this    |
| main category, the most common types are Pet Food, Pet Accessories,         |
| Aquarium Supplies, Pet Care Products (non-food), Pet Toys, and Pet          |
| Habitats. There are also a few images in an "Other" category and one image  |
| of Craft Supplies.                                                          |
+-----------------------------------------------------------------------------+

Most frequent items

The following example uses the AI.AGG function to find the most referenced companies in a set of news articles:

SELECT
  AI.AGG(
    TO_JSON_STRING(t),
    'In the following tech news articles, what are the top 3 most referenced companies, and what are they famous for?'
  ) AS tech_news_companies
FROM (SELECT * FROM `bigquery-public-data.bbc_news.fulltext` LIMIT 30000) AS t
WHERE t.category = 'tech';

The result is similar to the following:

+-----------------------------------------------------------------------------+
|                          tech_news_companies                                |
+-----------------------------------------------------------------------------+
| Here are the top 3 most referenced companies across all the provided tech   |
| news, along with what they are famous for:                                  |
|                                                                             |
| 1.  **Microsoft** (referenced 64 times)                                     |
|     **Famous for:** Its dominant **Windows operating system**, the          |
|     **Xbox gaming console** , and a wide array of **software products**     |
|     (including Office, Internet Explorer, and security tools). Microsoft is |
|     consistently mentioned for its efforts in computer security, combating  |
|     piracy, and its significant presence in both consumer electronics       |
|     and enterprise solutions.                                               |
|                                                                             |
| 2.  **Apple** (referenced 42 times)                                         |
|     **Famous for:** Its innovative and design-led                           |
|     **consumer electronics**, particularly the highly successful            |
|     **iPod digital music player** and **Macintosh computers** (including    |
|     the Mac Mini and Powerbook). Apple is also known for its iTunes online  |
|     store, its secrecy regarding new products, and its legal battles to     |
|     protect intellectual property.                                          |
|                                                                             |
| 3.  **Sony** (referenced 30 times)                                          |
|     **Famous for:** Its prominent role in **consumer electronics**          |
|     and the **gaming industry**, especially with its                        |
|     **PlayStation series of consoles** (PS2, PS3, PSP). Sony is also        |
|     recognized for its involvement in DVD technology (Blu-ray Disc),        |
|     humanoid robots (Qrio), and its broader range of consumer electronics   |
|     products.                                                               |
+-----------------------------------------------------------------------------+

Text summarization

The following example uses the AI.AGG function to summarize blog topics that are grouped by author:

WITH sample_data AS (
  SELECT *
  FROM `bigquery-public-data.hacker_news.full` AS t
  WHERE t.text IS NOT NULL
    AND t.type = 'story'
    AND t.by IN UNNEST(['amichail', 'kids_book', 'whoishiring', 'michaelmayer'])
  LIMIT 30000
  )
SELECT
  t.by,
  AI.AGG(
    t.text,
    'These are the hacker news blog posts of a single user. Tell me what topics the user talks about the most. Keep it concise.'
    ) AS common_topic
FROM sample_data AS t
GROUP BY t.by;

The result is similar to the following:

+--------------+---------------------------------------------------------------+
|     by       |                          common_topic                         |
+--------------+---------------------------------------------------------------+
|   amichail   | The user talks most about Artificial Intelligence and         |
|              | Machine Learning, followed by general Technology and          |
|              | Software Development (especially the Apple ecosystem), and    |
|              | Gaming and Game Development. They also frequently discuss     |
|              | social dynamics, human behavior, and education.               |
+--------------+---------------------------------------------------------------+
| whoishiring  | This user primarily discusses job-related topics on Hacker    |
|              | News. They frequently provide guidelines for job postings     |
|              | (for companies), job seekers ("Who wants to be hired?"), and  |
|              | freelancers. Remote work is a significant sub-theme, and      |
|              | they show a specific interest in connecting refugees with     |
|              | employment opportunities. They also touch on general Hacker   |
|              | News community guidelines and job search resources.           |
+--------------+---------------------------------------------------------------+
| michaelmayer | The user primarily discusses mobile technology (especially    |
|              | Apple and Android), web design and development, social media  |
|              | (particularly Twitter and Facebook), giveaways, bundles, and  |
|              | inspiration and creativity.                                   |
+--------------+---------------------------------------------------------------+
|  kids_book   | The user most frequently discusses anti-bullying, often       |
|              | promoting the children's book "Butterfly Me" as a resource.   |
|              | They also frequently talk about children's books in general,  |
|              | particularly those by Evon Latrail, including "When Mommy     |
|              | Went to Heaven" which addresses death and grief for children. |
|              | Giveaways and promotions, especially for "Butterfly Me,"      |
|              | are also a recurring theme.                                   |
+--------------+---------------------------------------------------------------+

Locations

You can run the AI.AGG function in any BigQuery region as long as your endpoint location supports the Gemini model that you're using. If your query processing region doesn't have a supported Gemini endpoint, you must specify a full endpoint that is supported, such as the model's global endpoint. For more information, see Google model endpoint locations.

Manage inference costs

Inference using the Vertex AI Gemini model can be a relatively expensive operation. Due to the nature of query processing in BigQuery, the actual number of rows processed by the model might differ from what you expect, particularly when running complex queries, such as JOIN or ORDER BY ... LIMIT clauses. To strictly control the number of rows processed by your complex queries, we recommended that you write the results of your query to a separate table beforehand, and then perform the Gemini inference directly on that materialized table. For information about how to view inference charges that you incur in Vertex AI, see Track costs.

Because the AI.AGG function uses batching and aggregation to analyze data that exceeds the size of the standard Gemini context window, the total number of tokens that are sent to Gemini is greater than the number of tokens in the source data.

Quotas and limits

For quota and limit information, see Generative AI functions in the BigQuery quotas and limits reference.

Known issues

The AI.AGG function has the following known issues:

  • If a single input row has 10 or more images, it might be skipped.
  • Input rows with arrays of ObjectRefRuntime objects that call the OBJ.GET_ACCESS_URL function might be skipped.
  • If you're using Workforce Identity Federation without a specified Cloud resource connection and your query takes more than a few minutes to run, it might fail unexpectedly.
  • Queries that use Gemini 3.0 or 3.1 with connection-based authentication might receive an unauthenticated error. To resolve this issue, use end-user credentials.

What's next