The AI.CLASSIFY function
This document describes the AI.CLASSIFY function, which uses a
Vertex AI Gemini model to classify
inputs into categories that you provide. BigQuery automatically
structures
your input to improve the quality of the classification.
The following are common use cases:
- Retail: Classify reviews by sentiment or classify products by categories.
- Text analysis: Classify support tickets or emails by topic.
- Image analysis: Classify an image by its style or contents.
Input
AI.CLASSIFY accepts the following types of input:
- Text data from standard tables.
ObjectRefRuntimevalues that are generated by theOBJ.GET_ACCESS_URLfunction. You can useObjectRefvalues from standard tables as input to theOBJ.GET_ACCESS_URLfunction.
This function passes your input to a Gemini model and incurs charges in Vertex AI each time it's called. For information about how to view these charges, see Track costs.
Syntax
AI.CLASSIFY( [ input => ] INPUT, [ categories => ] CATEGORIES [, examples => EXAMPLES ] [, connection_id => 'CONNECTION' ] [, endpoint => 'ENDPOINT' ] [, output_mode => 'OUTPUT_MODE' ] [, embeddings => EMBEDDINGS ] [, optimization_mode => 'OPTIMIZATION_MODE' ] [, max_error_ratio => MAX_ERROR_RATIO ] )
Arguments
AI.CLASSIFY takes the following arguments.
-
INPUT: aSTRINGorSTRUCTvalue that specifies the input to classify. The input must be the first argument that you specify. You can provide the input value in the following ways: You can provide the value in the following ways:-
Specify a
STRINGvalue. For example, 'apples' -
Specify a
STRUCTvalue that contains one or more fields. You can use the following types of fields within theSTRUCTvalue:
The function combinesField type Description Examples STRING
or
ARRAY<STRING>A string literal, array of string literals, or the name of a STRINGcolumn.String literal: 'apples'
String column name:my_string_columnObjectRefRuntime
or
ARRAY<ObjectRefRuntime>An
ObjectRefRuntimevalue returned by theOBJ.GET_ACCESS_URLfunction. TheOBJ.GET_ACCESS_URLfunction takes anObjectRefvalue as input, which you can provide by either specifying the name of a column that containsObjectRefvalues, or by constructing anObjectRefvalue.ObjectRefRuntimevalues must have theaccess_url.read_urlanddetails.gcs_metadata.content_typeelements of the JSON value populated.Your input can contain at most one video object.
Function call with ObjectRefcolumn:OBJ.GET_ACCESS_URL(my_objectref_column, 'r')
Function call with constructedObjectRefvalue:OBJ.GET_ACCESS_URL(OBJ.MAKE_REF('gs://image.jpg', 'myconnection'), 'r')STRUCTfields similarly to aCONCAToperation and concatenates the fields in their specified order. The same is true for the elements of any arrays used within the struct. The following table shows some examples ofSTRUCTprompt values and how they are interpreted:Struct field types Struct value Semantic equivalent STRUCT<STRING, STRING, STRING>('crisp', color_column, 'apples')'crisp color_column apples' STRUCT<STRING, ObjectRefRuntime>('Classify the following city', OBJ.GET_ACCESS_URL(image_objectref_column, 'r'))'Classify the following city image'
-
Specify a
CATEGORIES: the categories by which to classify the input. You can specify categories with or without descriptions:With descriptions: Use an
ARRAY<STRUCT<STRING, STRING>>value where each struct contains the category name, followed by a description of the category. The array can only contain string literals. For example, you could use colors to classify sentiment:[('green', 'positive'), ('yellow', 'neutral'), ('red', 'negative')]You can optionally name the fields of the struct for your own readability, but the field names aren't used by the function:
[STRUCT('green' AS label, 'positive' AS description), STRUCT('yellow' AS label, 'neutral' AS description), STRUCT('red' AS label, 'negative' AS description)]Without descriptions: Use an
ARRAY<STRING>value. The array can only contain string literals. This works well when your categories are self-explanatory. For example, you could use the following categories to classify sentiment:['positive', 'neutral', 'negative']
To handle input that doesn't closely match a category, consider including an
'Other'category.To use categories that come from a column of a table, you can define a variable based on that column and then use that variable as your categories argument.
EXAMPLES: anARRAY<STRUCT<STRING, STRING>>value that contains representative examples of input strings and the output category that you expect. You can provide examples to help the model understand your intended threshold for a condition with nuanced or subjective logic. We recommend that you provide at most 5 examples.If you specify
output_mode => 'multi', then your examples must have the typeARRAY<STRUCT<STRING, ARRAY<STRING>>>.CONNECTION: aSTRINGvalue specifying the connection to use to communicate with the model, in the format[PROJECT_ID].LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.If you don't specify a connection, then the query uses your end-user credentials.
For information about configuring permissions, see Set permissions for BigQuery ML generative AI functions that call Vertex AI models.
ENDPOINT: aSTRINGvalue that specifies the Vertex AI endpoint to use for the model. You can specify any generally available or preview Gemini model. If you specify the model name, BigQuery ML automatically identifies and uses the full endpoint of the model. If you don't specify anENDPOINTvalue, BigQuery ML dynamically chooses a model based on your query to have the best cost to quality tradeoff for the task. You can also specify the global endpoint:https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/GEMINI_ENDPOINTOUTPUT_MODE: aSTRINGvalue that indicates whether a single input can be classified into multiple categories. Specifying an output mode changes the return type of the function toARRAY<STRING>. The supported values are the following:single: Each input is classified into exactly one category.multi: Each value is classified into zero or more categories. In this case, the function returns an array that contains each relevant category, or an empty array if no category applies.
EMBEDDINGS: the embeddings to use for optimized mode (Preview). This argument is optional. If you don't specify this argument, then the query uses standard LLM inference for all rows unless the table has autonomous embedding generation enabled.This argument accepts the following data types:
ARRAY<FLOAT64>: use this for a single column reference.ARRAY<STRUCT<STRING, ARRAY<FLOAT64>>>: use this to map multiple columns to their corresponding embeddings. For example:[STRUCT('title', title_embedding), STRUCT('body', body_embedding)].ARRAY<STRUCT<ARRAY<STRING>, ARRAY<FLOAT64>>>: use this for advanced mapping scenarios.
OPTIMIZATION_MODE: aSTRINGvalue that specifies the optimization strategy to use. Supported values are as follows:MINIMIZE_COST(default): uses a local, distilled model to process the majority of rows, reducing latency and cost. This mode requires input embeddings and that the input to the AI function contain approximately 3,000 rows to ensure enough data for model training. Note thatoutput_mode => 'multi'is not supported in this mode.MAXIMIZE_QUALITY: always uses the remote LLM for inference.
MAX_ERROR_RATIO: aFLOAT64value between0.0and1.0that contains the maximum acceptable ratio of row-level inference failures to rows processed on this function. If this value is exceeded, then the query fails and BigQuery returns an error message that describes the most frequent types of errors. For example, if the value is0.3then the query fails if more than 30% of rows processed have failed to return results. Ifmax_error_ratiois set for multiple functions, the query fails if the ratio is exceeded on any function. The default value is1.0. However, the query still fails if inference fails for every row. This argument isn't supported whenoptimization_modeis set toMINIMIZE_COST.
Output
If you don't specify an OUTPUT_MODE,
then AI.CLASSIFY returns a STRING value containing the category
that best fits the input.
If you specify an OUTPUT_MODE, then AI.CLASSIFY returns an
ARRAY<STRING> value that contains all categories that the input is classified
into. If OUTPUT_MODE is single then the array always has length 1. If
OUTPUT_MODE is multi then the array length is between 0 and the number
of categories.
If the call to Vertex AI is unsuccessful for any reason, such
as exceeding quota or model unavailability, then the function returns NULL for
that row. However, if the ratio of unsuccessful rows exceeds the value of
max_error_ratio, then the entire query fails.
Examples
The following examples show how to use the AI.CLASSIFY function to classify
text and images into predefined categories.
Classify text by topic
The following query categorizes BBC news articles into high-level categories:
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other']) AS category
FROM
`bigquery-public-data.bbc_news.fulltext`
LIMIT 100;
The result is similar to the following:
+-------------------------------+------------------------------------+----------+
| title | body | category |
+-------------------------------+------------------------------------+----------+
| Anti-spam screensave scrapped | A contentious campaign to bump up | tech |
| | the bandwidth bills of spammers... | |
| ... | ... | ... |
+-------------------------------+------------------------------------+----------+
To extract your categories from a table instead of using an array of string
literals directly in your query, you can use variables. Suppose you have a table
called mydataset.categories with a string column
called category that contains each of the categories from the previous
example. You can rewrite the previous query using a variable in the following
way:
DECLARE article_types ARRAY<STRING>
DEFAULT (SELECT ARRAY_AGG(category) FROM mydataset.categories);
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => article_types) AS category
FROM
`bigquery-public-data.bbc_news.fulltext`
LIMIT 100;
Classify text into multiple topics
The following query categorizes each news article into one or more high-level categories and provides two examples of categorization to the function:
WITH NewsArticles AS (
SELECT
'A major streaming platform announced a high-tech virtual reality broadcast for the upcoming championship game.' AS article_text
UNION ALL
SELECT
'New legislation has been proposed to regulate the use of facial recognition technology in government buildings.' AS article_text
UNION ALL
SELECT
'The superstar athlete announced a multi-million dollar movie deal and a new sports apparel venture.' AS article_text
)
SELECT
article_text,
AI.CLASSIFY(
('Main topics of this news article: ', article_text),
categories => ['Politics', 'Finance', 'Technology', 'Sports', 'Entertainment'],
output_mode => 'multi',
examples => [
('The new stock market app is a hit with investors.', ['Finance', 'Technology']),
('The senator\'s speech on the economy was widely criticized.', ['Politics', 'Finance'])
]
) AS topics
FROM NewsArticles;
The result is similar to the following:
+----------------------+-------------------------------------+
| article_text | topics |
+----------------------+-------------------------------------+
| New legislation... | [Politics, Technology] |
| The superstar... | [Sports, Entertainment, Finance] |
| A major streaming... | [Technology, Sports, Entertainment] |
+----------------------+-------------------------------------+
Classify text with optimized mode
The following query categorizes BBC news articles using optimized mode (Preview):
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => ['tech', 'sport', 'business', 'other'],
embeddings => AI.EMBED(body, endpoint => 'text-embedding-005', task_type => 'CLASSIFICATION').result,
optimization_mode => 'MINIMIZE_COST'
) AS category
FROM
`bigquery-public-data.bbc_news.fulltext`;
For this example, embeddings are generated on-the-fly. In practice, we recommend that you materialize embeddings so that they can be reused. For more information, see Optimize AI function costs.
Classify reviews by sentiment
The following query classifies movie reviews of The English Patient by sentiment according to a custom color scheme. For example, a review that is very positive is classified as 'green'.
SELECT
AI.CLASSIFY(
('Classify the review by sentiment: ', review),
categories =>
[('green', 'The review is positive.'),
('yellow', 'The review is neutral.'),
('red', 'The review is negative.')]) AS ai_review_rating,
reviewer_rating AS human_provided_rating,
review,
FROM
`bigquery-public-data.imdb.reviews`
WHERE
title = 'The English Patient'
Classify images by type
The following query creates an external table from images of pet products stored in a publicly available Cloud Storage bucket. Then, it classifies each image as a box, ball, bottle, stand, or other type of item.
-- Create a dataset
CREATE SCHEMA IF NOT EXISTS cymbal_pets;
-- Create an object table
CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images
WITH CONNECTION us.example_connection
OPTIONS (
object_metadata = 'SIMPLE',
uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']
);
-- Classify images in the object table
SELECT
signed_url,
AI.CLASSIFY(
STRUCT(OBJ.GET_ACCESS_URL(images.ref, 'r')),
['box', 'ball', 'bottle', 'stand', 'other'],
endpoint => 'gemini-2.5-flash') AS category
FROM
EXTERNAL_OBJECT_TRANSFORM(TABLE `cymbal_pets.product_images`,
['SIGNED_URL']) AS images
LIMIT 10;
Handle inference errors
The following query classifies news articles but sets max_error_ratio to
0.05, meaning the query fails if more than 5% of rows return an error
during inference:
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
max_error_ratio => 0.05) AS category
FROM
`bigquery-public-data.bbc_news.fulltext`
LIMIT 100;
If the query exceeds the 0.05 error ratio, it fails and returns an error message
similar to the following:
Query failed because AI functions exceeded their allowed error ratio
Locations
You can run AI.CLASSIFY in all of the
regions
that support Gemini models, and also in the US and EU
multi-regions.
Quotas
See Generative AI functions quotas and limits.
What's next
- For more information about using Vertex AI models to generate text and embeddings, see Generative AI overview.
- For more information about using Cloud AI APIs to perform AI tasks, see AI application overview.
- For more information about supported SQL statements and functions for generative AI models, see End-to-end user journeys for generative AI models.
- To use this function in a tutorial, see Perform semantic analysis with managed AI functions.