The AI.CLASSIFY function

This document describes the AI.CLASSIFY function, which uses a Vertex AI Gemini model to classify inputs into categories that you provide. BigQuery automatically structures your input to improve the quality of the classification.

The following are common use cases:

  • Retail: Classify reviews by sentiment or classify products by categories.
  • Text analysis: Classify support tickets or emails by topic.
  • Image analysis: Classify an image by its style or contents.

Input

AI.CLASSIFY accepts the following types of input:

This function passes your input to a Gemini model and incurs charges in Vertex AI each time it's called. For information about how to view these charges, see Track costs.

Syntax

AI.CLASSIFY(
  [ input => ] INPUT,
  [ categories => ] CATEGORIES
  [, examples => EXAMPLES ]
  [, connection_id => 'CONNECTION' ]
  [, endpoint => 'ENDPOINT' ]
  [, output_mode => 'OUTPUT_MODE' ]
  [, embeddings => EMBEDDINGS ]
  [, optimization_mode => 'OPTIMIZATION_MODE' ]
  [, max_error_ratio => MAX_ERROR_RATIO ]
)

Arguments

AI.CLASSIFY takes the following arguments.

  • INPUT: a STRING or STRUCT value that specifies the input to classify. The input must be the first argument that you specify. You can provide the input value in the following ways: You can provide the value in the following ways:
    • Specify a STRING value. For example, 'apples'
    • Specify a STRUCT value that contains one or more fields. You can use the following types of fields within the STRUCT value:
      Field type Description Examples
      STRING
      or
      ARRAY<STRING>
      A string literal, array of string literals, or the name of a STRING column. String literal:
      'apples'

      String column name:
      my_string_column
      ObjectRefRuntime
      or
      ARRAY<ObjectRefRuntime>

      An ObjectRefRuntime value returned by the OBJ.GET_ACCESS_URL function. The OBJ.GET_ACCESS_URL function takes an ObjectRef value as input, which you can provide by either specifying the name of a column that contains ObjectRef values, or by constructing an ObjectRef value.

      ObjectRefRuntime values must have the access_url.read_url and details.gcs_metadata.content_type elements of the JSON value populated.

      Your input can contain at most one video object.

      Function call with ObjectRef column:
      OBJ.GET_ACCESS_URL(my_objectref_column, 'r')

      Function call with constructed ObjectRef value:
      OBJ.GET_ACCESS_URL(OBJ.MAKE_REF('gs://image.jpg', 'myconnection'), 'r')
      The function combines STRUCT fields similarly to a CONCAT operation and concatenates the fields in their specified order. The same is true for the elements of any arrays used within the struct. The following table shows some examples of STRUCT prompt values and how they are interpreted:
      Struct field types Struct value Semantic equivalent
      STRUCT<STRING, STRING, STRING> ('crisp', color_column, 'apples') 'crisp color_column apples'
      STRUCT<STRING, ObjectRefRuntime> ('Classify the following city', OBJ.GET_ACCESS_URL(image_objectref_column, 'r')) 'Classify the following city image'
  • CATEGORIES: the categories by which to classify the input. You can specify categories with or without descriptions:

    • With descriptions: Use an ARRAY<STRUCT<STRING, STRING>> value where each struct contains the category name, followed by a description of the category. The array can only contain string literals. For example, you could use colors to classify sentiment:

      [('green', 'positive'), ('yellow', 'neutral'), ('red', 'negative')]

      You can optionally name the fields of the struct for your own readability, but the field names aren't used by the function:

        [STRUCT('green' AS label, 'positive' AS description),
         STRUCT('yellow' AS label, 'neutral' AS description),
         STRUCT('red' AS label, 'negative' AS description)]
      
    • Without descriptions: Use an ARRAY<STRING> value. The array can only contain string literals. This works well when your categories are self-explanatory. For example, you could use the following categories to classify sentiment:

      ['positive', 'neutral', 'negative']

    To handle input that doesn't closely match a category, consider including an 'Other' category.

    To use categories that come from a column of a table, you can define a variable based on that column and then use that variable as your categories argument.

  • EXAMPLES: an ARRAY<STRUCT<STRING, STRING>> value that contains representative examples of input strings and the output category that you expect. You can provide examples to help the model understand your intended threshold for a condition with nuanced or subjective logic. We recommend that you provide at most 5 examples.

    If you specify output_mode => 'multi', then your examples must have the type ARRAY<STRUCT<STRING, ARRAY<STRING>>>.

  • CONNECTION: a STRING value specifying the connection to use to communicate with the model, in the format [PROJECT_ID].LOCATION.CONNECTION_ID. For example, myproject.us.myconnection.

    If you don't specify a connection, then the query uses your end-user credentials.

    For information about configuring permissions, see Set permissions for BigQuery ML generative AI functions that call Vertex AI models.

  • ENDPOINT: a STRING value that specifies the Vertex AI endpoint to use for the model. You can specify any generally available or preview Gemini model. If you specify the model name, BigQuery ML automatically identifies and uses the full endpoint of the model. If you don't specify an ENDPOINT value, BigQuery ML dynamically chooses a model based on your query to have the best cost to quality tradeoff for the task. You can also specify the global endpoint:

    https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/GEMINI_ENDPOINT
    
  • OUTPUT_MODE: a STRING value that indicates whether a single input can be classified into multiple categories. Specifying an output mode changes the return type of the function to ARRAY<STRING>. The supported values are the following:

    • single: Each input is classified into exactly one category.
    • multi: Each value is classified into zero or more categories. In this case, the function returns an array that contains each relevant category, or an empty array if no category applies.
  • EMBEDDINGS: the embeddings to use for optimized mode (Preview). This argument is optional. If you don't specify this argument, then the query uses standard LLM inference for all rows unless the table has autonomous embedding generation enabled.

    This argument accepts the following data types:

    • ARRAY<FLOAT64>: use this for a single column reference.
    • ARRAY<STRUCT<STRING, ARRAY<FLOAT64>>>: use this to map multiple columns to their corresponding embeddings. For example: [STRUCT('title', title_embedding), STRUCT('body', body_embedding)].
    • ARRAY<STRUCT<ARRAY<STRING>, ARRAY<FLOAT64>>>: use this for advanced mapping scenarios.
  • OPTIMIZATION_MODE: a STRING value that specifies the optimization strategy to use. Supported values are as follows:

    • MINIMIZE_COST (default): uses a local, distilled model to process the majority of rows, reducing latency and cost. This mode requires input embeddings and that the input to the AI function contain approximately 3,000 rows to ensure enough data for model training. Note that output_mode => 'multi' is not supported in this mode.
    • MAXIMIZE_QUALITY: always uses the remote LLM for inference.
  • MAX_ERROR_RATIO: a FLOAT64 value between 0.0 and 1.0 that contains the maximum acceptable ratio of row-level inference failures to rows processed on this function. If this value is exceeded, then the query fails and BigQuery returns an error message that describes the most frequent types of errors. For example, if the value is 0.3 then the query fails if more than 30% of rows processed have failed to return results. If max_error_ratio is set for multiple functions, the query fails if the ratio is exceeded on any function. The default value is 1.0. However, the query still fails if inference fails for every row. This argument isn't supported when optimization_mode is set to MINIMIZE_COST.

Output

If you don't specify an OUTPUT_MODE, then AI.CLASSIFY returns a STRING value containing the category that best fits the input.

If you specify an OUTPUT_MODE, then AI.CLASSIFY returns an ARRAY<STRING> value that contains all categories that the input is classified into. If OUTPUT_MODE is single then the array always has length 1. If OUTPUT_MODE is multi then the array length is between 0 and the number of categories.

If the call to Vertex AI is unsuccessful for any reason, such as exceeding quota or model unavailability, then the function returns NULL for that row. However, if the ratio of unsuccessful rows exceeds the value of max_error_ratio, then the entire query fails.

Examples

The following examples show how to use the AI.CLASSIFY function to classify text and images into predefined categories.

Classify text by topic

The following query categorizes BBC news articles into high-level categories:

SELECT
  title,
  body,
  AI.CLASSIFY(
    body,
    categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other']) AS category
FROM
  `bigquery-public-data.bbc_news.fulltext`
LIMIT 100;

The result is similar to the following:

+-------------------------------+------------------------------------+----------+
| title                         | body                               | category |
+-------------------------------+------------------------------------+----------+
| Anti-spam screensave scrapped | A contentious campaign to bump up  | tech     |
|                               | the bandwidth bills of spammers... |          |
| ...                           | ...                                | ...      |
+-------------------------------+------------------------------------+----------+

To extract your categories from a table instead of using an array of string literals directly in your query, you can use variables. Suppose you have a table called mydataset.categories with a string column called category that contains each of the categories from the previous example. You can rewrite the previous query using a variable in the following way:

DECLARE article_types ARRAY<STRING>
  DEFAULT (SELECT ARRAY_AGG(category) FROM mydataset.categories);

SELECT
  title,
  body,
  AI.CLASSIFY(
    body,
    categories => article_types) AS category
FROM
  `bigquery-public-data.bbc_news.fulltext`
LIMIT 100;

Classify text into multiple topics

The following query categorizes each news article into one or more high-level categories and provides two examples of categorization to the function:

WITH NewsArticles AS (
  SELECT
    'A major streaming platform announced a high-tech virtual reality broadcast for the upcoming championship game.' AS article_text
  UNION ALL
  SELECT
    'New legislation has been proposed to regulate the use of facial recognition technology in government buildings.' AS article_text
  UNION ALL
  SELECT
    'The superstar athlete announced a multi-million dollar movie deal and a new sports apparel venture.' AS article_text
)
SELECT
  article_text,
  AI.CLASSIFY(
    ('Main topics of this news article: ', article_text),
    categories => ['Politics', 'Finance', 'Technology', 'Sports', 'Entertainment'],
    output_mode => 'multi',
    examples => [
      ('The new stock market app is a hit with investors.', ['Finance', 'Technology']),
      ('The senator\'s speech on the economy was widely criticized.', ['Politics', 'Finance'])
    ]
  ) AS topics
FROM NewsArticles;

The result is similar to the following:

+----------------------+-------------------------------------+
| article_text         | topics                              |
+----------------------+-------------------------------------+
| New legislation...   | [Politics, Technology]              |
| The superstar...     | [Sports, Entertainment, Finance]    |
| A major streaming... | [Technology, Sports, Entertainment] |
+----------------------+-------------------------------------+

Classify text with optimized mode

The following query categorizes BBC news articles using optimized mode (Preview):

SELECT
  title,
  body,
  AI.CLASSIFY(
    body,
    categories => ['tech', 'sport', 'business', 'other'],
    embeddings => AI.EMBED(body, endpoint => 'text-embedding-005', task_type => 'CLASSIFICATION').result,
    optimization_mode => 'MINIMIZE_COST'
   ) AS category
FROM
  `bigquery-public-data.bbc_news.fulltext`;

For this example, embeddings are generated on-the-fly. In practice, we recommend that you materialize embeddings so that they can be reused. For more information, see Optimize AI function costs.

Classify reviews by sentiment

The following query classifies movie reviews of The English Patient by sentiment according to a custom color scheme. For example, a review that is very positive is classified as 'green'.

SELECT
  AI.CLASSIFY(
    ('Classify the review by sentiment: ', review),
    categories =>
         [('green', 'The review is positive.'),
          ('yellow', 'The review is neutral.'),
          ('red', 'The review is negative.')]) AS ai_review_rating,
  reviewer_rating AS human_provided_rating,
  review,
FROM
  `bigquery-public-data.imdb.reviews`
WHERE
  title = 'The English Patient'

Classify images by type

The following query creates an external table from images of pet products stored in a publicly available Cloud Storage bucket. Then, it classifies each image as a box, ball, bottle, stand, or other type of item.

-- Create a dataset
CREATE SCHEMA IF NOT EXISTS cymbal_pets;

-- Create an object table
CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images
WITH CONNECTION us.example_connection
OPTIONS (
 object_metadata = 'SIMPLE',
 uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']
);

-- Classify images in the object table
SELECT
  signed_url,
  AI.CLASSIFY(
    STRUCT(OBJ.GET_ACCESS_URL(images.ref, 'r')),
    ['box', 'ball', 'bottle', 'stand', 'other'],
    endpoint => 'gemini-2.5-flash') AS category
FROM
  EXTERNAL_OBJECT_TRANSFORM(TABLE `cymbal_pets.product_images`,
                              ['SIGNED_URL']) AS images
LIMIT 10;

Handle inference errors

The following query classifies news articles but sets max_error_ratio to 0.05, meaning the query fails if more than 5% of rows return an error during inference:

SELECT
  title,
  body,
  AI.CLASSIFY(
    body,
    categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
    max_error_ratio => 0.05) AS category
FROM
  `bigquery-public-data.bbc_news.fulltext`
LIMIT 100;

If the query exceeds the 0.05 error ratio, it fails and returns an error message similar to the following: Query failed because AI functions exceeded their allowed error ratio

Locations

You can run AI.CLASSIFY in all of the regions that support Gemini models, and also in the US and EU multi-regions.

Quotas

See Generative AI functions quotas and limits.

What's next