The AI.SIMILARITY function
This document describes the AI.SIMILARITY function, which lets you
find the cosine similarity
between two inputs. Values closer to 1 indicate more similar inputs and values
closer to 0 indicate less similar inputs.
# Returns 0.9166 because 'happy' and 'glad' are synonyms.
SELECT AI.SIMILARITY('happy', 'glad', endpoint => 'text-embedding-005');
Use cases include the following:
- Semantic search: Search for text or images based off a description, without having to match specific keywords.
- Recommendation: Return entities with attributes similar to a given entity.
- Classification: Return the class of entities whose attributes are similar to the given entity.
The function works by creating embeddings for the inputs, and then
determining the cosine similarity between the embeddings.
The AI.SIMILARITY function sends requests to a
stable Vertex AI embedding model.
The function incurs charges in
Vertex AI each time it's called.
Input
You can use the AI.SIMILARITY function with data from BigQuery
standard tables. AI.SIMILARITY
accepts the following types of input:
- Text data.
- Image data represented by
ObjectRefvalues. (Preview) - Image data represented by
ObjectRefRuntimevalues.ObjectRefRuntimevalues are generated by theOBJ.GET_ACCESS_URLfunction. You can useObjectRefvalues from standard tables as input to theOBJ.GET_ACCESS_URLfunction. (Preview)
When you analyze image data, the content must be in one of the supported image
formats that are described in the Gemini API model
mimeType parameter.
Syntax
Text embedding
AI.SIMILARITY( content1 => 'CONTENT1', content2 => 'CONTENT2', endpoint => 'ENDPOINT' [, model_params => MODEL_PARAMS] [, connection_id => 'CONNECTION_ID'] )
Arguments
AI.SIMILARITY takes the following arguments:
-
CONTENT1: aSTRINGvalue that provides the first value to compare. The value ofCONTENT1can be a string literal, the name of a table column, or the output of an expression that evaluates to a string. -
CONTENT2: aSTRINGvalue that provides the second value to compare. The value ofCONTENT2can be a string literal, the name of a table column, or the output of an expression that evaluates to a string. -
ENDPOINT: aSTRINGvalue that specifies the Vertex AI endpoint to use for the text embedding model. If you specify the model name, such as'text-embedding-005', rather than a URL, then BigQuery ML automatically identifies the model and uses the model's full endpoint. -
MODEL_PARAMS: aJSONliteral that provides additional parameters to the model. You can use any of theparametersobject fields. One of these fields,outputDimensionality, lets you specify the number of dimensions to use when generating embeddings. For example, if you specify256for theoutputDimensionalityfield, then the model returns a 256-dimensional embedding for each input value. -
CONNECTION_ID: aSTRINGvalue specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.For user-initiated queries, the
CONNECTIONargument is optional. When a user initiates a query, BigQuery ML uses the credentials of the user who submitted the query to run it.If your query job is expected to run for 48 hours or longer, you should use the
CONNECTIONargument to run the query using a service account.Replace the following:
PROJECT_ID: the project ID of the project that contains the connection.LOCATION: the location used by the connection. The connection must be in the same region in which the query is run.CONNECTION_ID: the connection ID—for example,myconnection.You can get this value by viewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown in Connection ID. For example,
projects/myproject/locations/connection_location/connections/myconnection.
You need to grant the Vertex AI User role to the connection's service account in the project where you run the function.
Multimodal embedding
AI.SIMILARITY( content1 => 'CONTENT1', content2 => 'CONTENT2', connection_id => 'CONNECTION_ID', endpoint => 'ENDPOINT' [, model_params => MODEL_PARAMS] )
Arguments
AI.SIMILARITY takes the following arguments:
-
CONTENT1: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the first value to compare.- For text embeddings, you can specify one of the following:
- A string literal.
- The name of a
STRINGcolumn. - The output of an expression that evaluates to a string.
- For image content embeddings, you can specify one of the following:
- The name of an
ObjectRefcolumn. - An
ObjectRefvalue generated by a combination of theOBJ.FETCH_METADATAandOBJ.MAKE_REFfunctions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));. - An
ObjectRefRuntimevalue generated by theOBJ.GET_ACCESS_URLfunction.
- The name of an
ObjectRefvalues must have thedetails.gcs_metadata.content_typeelements of theJSONvalue populated.ObjectRefRuntimevalues must have theaccess_url.read_urlanddetails.gcs_metadata.content_typeelements of theJSONvalue populated. - For text embeddings, you can specify one of the following:
-
CONTENT2: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the second value to compare.- For text embeddings, you can specify one of the following:
- A string literal.
- The name of a
STRINGcolumn. - The output of an expression that evaluates to a string.
- For image content embeddings, you can specify one of the following:
- The name of an
ObjectRefcolumn. - An
ObjectRefvalue generated by a combination of theOBJ.FETCH_METADATAandOBJ.MAKE_REFfunctions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));. - An
ObjectRefRuntimevalue generated by theOBJ.GET_ACCESS_URLfunction.
- The name of an
- For text embeddings, you can specify one of the following:
-
CONNECTION_ID: aSTRINGvalue specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.Replace the following:
PROJECT_ID: the project ID of the project that contains the connection.LOCATION: the location used by the connection. The connection must be in the same location as the dataset that contains the model.CONNECTION_ID: the connection ID—for example,myconnection.You can get this value by viewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown in Connection ID. For example,
projects/myproject/locations/connection_location/connections/myconnection.
You need to grant the Vertex AI User role to the connection's service account in the project where you run the function.
-
ENDPOINT: aSTRINGvalue that specifies the Vertex AI endpoint to use for the multimodal embedding model. If you specify the model name rather than a URL, BigQuery ML automatically identifies the model and uses the model's full endpoint. -
MODEL_PARAMS: aJSONliteral that provides additional parameters to the model. Only thedimensionfield is supported. You can use thedimensionfield to specify the number of dimensions to use when generating embeddings. For example, if you specify256for thedimensionfield , then the model returns a 256-dimensional embedding for each input value.
Output
AI.SIMILARITY returns a FLOAT64 value that represents the cosine similarity
between the two inputs. If an error occurs, the function returns NULL.
Examples
The following examples demonstrate how to use the
AI.SIMILARITY function.
Compare text values
The following example queries publicly available BBC news articles and shows the top five articles most related to downward trends in the housing market:
SELECT title, body, AI.SIMILARITY( "housing market downward trends", body, endpoint => "text-embedding-005") AS similarity_score FROM `bigquery-public-data.bbc_news.fulltext` ORDER BY similarity_score DESC LIMIT 5;
Compare text and ObjectRef values
The following query creates an external table from images of pet products stored in a publicly available Cloud Storage bucket. Then, it queries the table for images for those that are similar to the text "aquarium device." The results include images of products for maintaining water and air quality in an aquarium.
# Create a dataset CREATE SCHEMA IF NOT EXISTS cymbal_pets; # Create an object table CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images WITH CONNECTION DEFAULT OPTIONS ( object_metadata = 'SIMPLE', uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png'] ); SELECT uri, STRING(OBJ.GET_ACCESS_URL(ref,'r').access_urls.read_url) AS signed_url, ai.similarity( "aquarium device", ref, endpoint => 'multimodalembedding@001', connection_id => 'us.example_connection') AS similarity_score FROM cymbal_pets.product_images ORDER BY similarity_score DESC LIMIT 3;
Related functions
The AI.SIMILARITY and
VECTOR_SEARCH
functions support overlapping use cases. In general, you should use
AI.SIMILARITY
when you want to perform a small number of comparisons and you haven't
precomputed any embeddings. You should use VECTOR_SEARCH when performance
is critical and you're working with a large number of embeddings. The following
table summarizes the differences:
| Feature | AI.SIMILARITY |
VECTOR_SEARCH |
|---|---|---|
| Function type | A scalar function that takes a single value as input and returns a single value as output. | A table-valued function (TVF) that takes a table as input and returns a table as output. |
| Primary purpose | Computes the semantic similarity score between two specific content inputs. | Finds the top-K closest embeddings from a base table or query to a given embedding. |
| Input | Typically two STRING or ObjectRef columns or values. |
A query, a base table, and columns to search. Traditionally operates on pre-computed embedding columns of type ARRAY<FLOAT64>. |
| Output | A single FLOAT64 value per row, representing the cosine
similarity between the inputs. |
A table containing the nearest neighbor rows from the base table, and the distance or similarity scores depending on options passed to the function. |
| Embedding | Always generates embeddings for both content inputs at runtime using Vertex AI embedding LLMs. | Uses pre-computed embeddings. |
| Indexing | Does not use vector indexes. Performs a direct comparison, including the cost of two embedding generations per call. | Designed to use a vector index for efficient approximate nearest neighbor (ANN) search over large datasets. Can also perform brute-force search without an index. |
| Common Use Case | Calculating similarity between pairs of items in a row, such as in a
SELECT or WHERE clause.Useful to prototype queries while iterating on a design. |
Useful for semantic search, recommendation systems, and retrieval augmented generation (RAG) to find the best matches from a large corpus. |
Locations
You can run AI.SIMILARITY in all of the
regions
that support Gemini models, and also in the US and EU
multi-regions.
Quotas
See Vertex AI and Cloud AI service functions quotas and limits.
What's next
- For more information about using Vertex AI models to generate text and embeddings, see Generative AI overview.
- For more information about using Cloud AI APIs to perform AI tasks, see AI application overview.