The AI.SIMILARITY function

This document describes the AI.SIMILARITY function, which lets you find the cosine similarity between two inputs. Values closer to 1 indicate more similar inputs and values closer to 0 indicate less similar inputs.

# Returns 0.9166 because 'happy' and 'glad' are synonyms.
SELECT AI.SIMILARITY('happy', 'glad', endpoint => 'text-embedding-005');

Use cases include the following:

  • Semantic search: Search for text or images based off a description, without having to match specific keywords.
  • Recommendation: Return entities with attributes similar to a given entity.
  • Classification: Return the class of entities whose attributes are similar to the given entity.

The function works by creating embeddings for the inputs, and then determining the cosine similarity between the embeddings. The AI.SIMILARITY function sends requests to a stable Vertex AI embedding model. The function incurs charges in Vertex AI each time it's called.

Input

You can use the AI.SIMILARITY function with data from BigQuery standard tables. AI.SIMILARITY accepts the following types of input:

When you analyze image data, the content must be in one of the supported image formats that are described in the Gemini API model mimeType parameter.

Syntax

Text embedding

AI.SIMILARITY(
  content1 => 'CONTENT1',
  content2 => 'CONTENT2',
  endpoint => 'ENDPOINT'
  [, model_params => MODEL_PARAMS]
  [, connection_id => 'CONNECTION_ID']
)

Arguments

AI.SIMILARITY takes the following arguments:

  • CONTENT1: a STRING value that provides the first value to compare. The value of CONTENT1 can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.
  • CONTENT2: a STRING value that provides the second value to compare. The value of CONTENT2 can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.
  • ENDPOINT: a STRING value that specifies the Vertex AI endpoint to use for the text embedding model. If you specify the model name, such as 'text-embedding-005', rather than a URL, then BigQuery ML automatically identifies the model and uses the model's full endpoint.
  • MODEL_PARAMS: a JSON literal that provides additional parameters to the model. You can use any of the parameters object fields. One of these fields, outputDimensionality, lets you specify the number of dimensions to use when generating embeddings. For example, if you specify 256 for the outputDimensionality field, then the model returns a 256-dimensional embedding for each input value.
  • CONNECTION_ID: a STRING value specifying the connection to use to communicate with the model, in the format PROJECT_ID.LOCATION.CONNECTION_ID. For example, myproject.us.myconnection.

    For user-initiated queries, the CONNECTION argument is optional. When a user initiates a query, BigQuery ML uses the credentials of the user who submitted the query to run it.

    If your query job is expected to run for 48 hours or longer, you should use the CONNECTION argument to run the query using a service account.

    Replace the following:

    • PROJECT_ID: the project ID of the project that contains the connection.
    • LOCATION: the location used by the connection. The connection must be in the same region in which the query is run.
    • CONNECTION_ID: the connection ID—for example, myconnection.

      You can get this value by viewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown in Connection ID. For example, projects/myproject/locations/connection_location/connections/myconnection.

    You need to grant the Vertex AI User role to the connection's service account in the project where you run the function.

Multimodal embedding

AI.SIMILARITY(
  content1 => 'CONTENT1',
  content2 => 'CONTENT2',
  connection_id => 'CONNECTION_ID',
  endpoint => 'ENDPOINT'
  [, model_params => MODEL_PARAMS]
)

Arguments

AI.SIMILARITY takes the following arguments:

  • CONTENT1: a STRING, ObjectRef, or ObjectRefRuntimevalue that provides the first value to compare.
    • For text embeddings, you can specify one of the following:
      • A string literal.
      • The name of a STRING column.
      • The output of an expression that evaluates to a string.
    • For image content embeddings, you can specify one of the following:
      • The name of an ObjectRef column.
      • An ObjectRef value generated by a combination of the OBJ.FETCH_METADATA and OBJ.MAKE_REF functions. For example SELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));.
      • An ObjectRefRuntime value generated by the OBJ.GET_ACCESS_URL function.

    ObjectRef values must have the details.gcs_metadata.content_type elements of the JSON value populated.

    ObjectRefRuntime values must have the access_url.read_url and details.gcs_metadata.content_type elements of the JSON value populated.

  • CONTENT2: a STRING, ObjectRef, or ObjectRefRuntimevalue that provides the second value to compare.
    • For text embeddings, you can specify one of the following:
      • A string literal.
      • The name of a STRING column.
      • The output of an expression that evaluates to a string.
    • For image content embeddings, you can specify one of the following:
      • The name of an ObjectRef column.
      • An ObjectRef value generated by a combination of the OBJ.FETCH_METADATA and OBJ.MAKE_REF functions. For example SELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));.
      • An ObjectRefRuntime value generated by the OBJ.GET_ACCESS_URL function.
  • CONNECTION_ID: a STRING value specifying the connection to use to communicate with the model, in the format PROJECT_ID.LOCATION.CONNECTION_ID. For example, myproject.us.myconnection.

    Replace the following:

    • PROJECT_ID: the project ID of the project that contains the connection.
    • LOCATION: the location used by the connection. The connection must be in the same location as the dataset that contains the model.
    • CONNECTION_ID: the connection ID—for example, myconnection.

      You can get this value by viewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown in Connection ID. For example, projects/myproject/locations/connection_location/connections/myconnection.

    You need to grant the Vertex AI User role to the connection's service account in the project where you run the function.

  • ENDPOINT: a STRING value that specifies the Vertex AI endpoint to use for the multimodal embedding model. If you specify the model name rather than a URL, BigQuery ML automatically identifies the model and uses the model's full endpoint.
  • MODEL_PARAMS: a JSON literal that provides additional parameters to the model. Only the dimension field is supported. You can use the dimension field to specify the number of dimensions to use when generating embeddings. For example, if you specify 256 for the dimension field , then the model returns a 256-dimensional embedding for each input value.

Output

AI.SIMILARITY returns a FLOAT64 value that represents the cosine similarity between the two inputs. If an error occurs, the function returns NULL.

Examples

The following examples demonstrate how to use the AI.SIMILARITY function.

Compare text values

The following example queries publicly available BBC news articles and shows the top five articles most related to downward trends in the housing market:

SELECT
  title,
  body,
  AI.SIMILARITY(
    "housing market downward trends", body, endpoint => "text-embedding-005")
    AS similarity_score
FROM `bigquery-public-data.bbc_news.fulltext`
ORDER BY similarity_score DESC
LIMIT 5;

Compare text and ObjectRef values

The following query creates an external table from images of pet products stored in a publicly available Cloud Storage bucket. Then, it queries the table for images for those that are similar to the text "aquarium device." The results include images of products for maintaining water and air quality in an aquarium.

# Create a dataset
CREATE SCHEMA IF NOT EXISTS cymbal_pets;

# Create an object table
CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images
WITH CONNECTION DEFAULT
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']
);

SELECT
  uri,
  STRING(OBJ.GET_ACCESS_URL(ref,'r').access_urls.read_url) AS signed_url,
  ai.similarity(
    "aquarium device",
    ref,
    endpoint => 'multimodalembedding@001',
    connection_id => 'us.example_connection') AS similarity_score
FROM cymbal_pets.product_images
ORDER BY similarity_score DESC
LIMIT 3;

The AI.SIMILARITY and VECTOR_SEARCH functions support overlapping use cases. In general, you should use AI.SIMILARITY when you want to perform a small number of comparisons and you haven't precomputed any embeddings. You should use VECTOR_SEARCH when performance is critical and you're working with a large number of embeddings. The following table summarizes the differences:

Feature AI.SIMILARITY VECTOR_SEARCH
Function type A scalar function that takes a single value as input and returns a single value as output. A table-valued function (TVF) that takes a table as input and returns a table as output.
Primary purpose Computes the semantic similarity score between two specific content inputs. Finds the top-K closest embeddings from a base table or query to a given embedding.
Input Typically two STRING or ObjectRef columns or values. A query, a base table, and columns to search.
Traditionally operates on pre-computed embedding columns of type ARRAY<FLOAT64>.
Output A single FLOAT64 value per row, representing the cosine similarity between the inputs. A table containing the nearest neighbor rows from the base table, and the distance or similarity scores depending on options passed to the function.
Embedding Always generates embeddings for both content inputs at runtime using Vertex AI embedding LLMs. Uses pre-computed embeddings.
Indexing Does not use vector indexes. Performs a direct comparison, including the cost of two embedding generations per call. Designed to use a vector index for efficient approximate nearest neighbor (ANN) search over large datasets. Can also perform brute-force search without an index.
Common Use Case Calculating similarity between pairs of items in a row, such as in a SELECT or WHERE clause.
Useful to prototype queries while iterating on a design.
Useful for semantic search, recommendation systems, and retrieval augmented generation (RAG) to find the best matches from a large corpus.

Locations

You can run AI.SIMILARITY in all of the regions that support Gemini models, and also in the US and EU multi-regions.

Quotas

See Vertex AI and Cloud AI service functions quotas and limits.

What's next