Autonomous embedding generation
This document describes how to use autonomous embedding generation for your data, which lets BigQuery maintain a column of embeddings on a table based on a source column. When you add or modify data in the source column, BigQuery automatically generates or updates the embedding column for that data by using a Vertex AI embedding model. This is helpful if you want to let BigQuery maintain your embeddings when your source data is updated regularly.
Embeddings are useful for modern generative AI applications such as Retrieval Augmented Generation (RAG), but they can be complex to create, manage, and query. You can use autonomous embedding generation to simplify the process of creating, maintaining, and querying embeddings for use in similarity searches and other generative AI applications.
For example, you can use queries similar to the following to create a table with autonomous embedding generation enabled, insert data, and then perform semantic search:
CREATE TABLE mydataset.products (
name STRING,
description STRING,
description_embedding STRUCT<result ARRAY<FLOAT64>, status STRING>
GENERATED ALWAYS AS (
AI.EMBED(description, connection_id => 'us.example_connection',
endpoint => 'text-embedding-005'))
STORED OPTIONS( asynchronous = TRUE ));
# Values in the description_embedding column are automatically generated.
INSERT INTO mydataset.products (name, description) VALUES
("Super slingers"), ("An exciting board game for the whole family"), ...;
SELECT * FROM AI.SEARCH(TABLE mydataset.products, 'description', "A really fun toy");
Before you begin
To enable autonomous embedding generation on a table, you must have the necessary permissions and connection.
Required roles
To get the permissions that you need to enable autonomous embedding generation, ask your administrator to grant you the following IAM roles:
-
To use a connection resource:
BigQuery Connections User (
roles/bigquery.connectionUser) on the connection -
To create a table:
BigQuery Data Editor (
roles/bigquery.dataEditor) on the table -
Grant the connection's service account the following role so that it can access models hosted in Vertex AI endpoints:
Vertex AI User (
roles/aiplatform.user) on the project that has the connection
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Create a connection and grant permission to a service account
To enable autonomous embedding generation on a table, you must
create a Cloud resource connection.
Then, grant
the Vertex AI User role
(roles/aiplatform.user) to the service account that was created when you
created the connection.
Create a table with autonomous embedding generation
You can use autonomous embedding generation to generate embeddings by using
the AI.EMBED function
in a
CREATE TABLE statement.
CREATE TABLE DATASET_ID.TABLE (
[COLUMN, ...]
STRING_COL STRING,
EMBEDDING_COL_NAME STRUCT<result ARRAY<FLOAT64>, status STRING>
GENERATED ALWAYS AS (
AI.EMBED(
STRING_COL,
connection_id => CONNECTION_ID,
endpoint => ENDPOINT)
)
STORED OPTIONS (asynchronous = TRUE)
);
Replace the following:
DATASET_ID: The name of the dataset in which you want to create the table.TABLE: The name of the table on which to create autonomous embedding generation.COLUMN, ...: Any columns that your table should contain besides the column that you want to automatically embed.STRING_COL: The name of theSTRINGcolumn that you want to automatically embed.EMBEDDING_COL_NAME: The name of the automatically generated embedding column.CONNECTION_ID: ASTRINGvalue that contains the name of a connection to use, such asmy_project.us.example_connection. You must grant the Vertex AI User role to the connection's service account in the project in which you create the table.ENDPOINT: aSTRINGvalue that specifies a supported Vertex AI text embedding model endpoint to use for the text embedding model. The endpoint value that you specify must include the model version, for exampletext-embedding-005. If you specify the model name rather than a URL, BigQuery ML automatically identifies the model and uses the model's full endpoint.
The background embedding generation job starts shortly after your table is created, or after you update data in the source column.
To track the progress of the embedding generation, you can use a query similar to the following:
SELECT
COUNT(*) AS total_num_rows,
COUNTIF(description_embedding IS NOT NULL
AND description_embedding.status = '') AS total_num_generated_embeddings
FROM
PROJECT_ID.DATASET_ID.TABLE;
After you create the table with embeddings, you can
create a vector index
on the STRUCT column that contains the automatically generated embedding.
Example
Suppose you are a large retailer that sells many different products. You have a table of product names and descriptions and you want to help your customers find the products they're looking for. The following queries show you how to set up autonomous embedding generation to assist with semantic search of your product descriptions.
First, create a dataset:
CREATE SCHEMA mydataset;
Next, create a table with autonomous embedding generation enabled to hold your
product information. The automatically generated column is called
description_embedding and it's based on the description column.
# Create a table of products and descriptions with a generated embedding column.
CREATE TABLE mydataset.products (
name STRING,
description STRING,
description_embedding STRUCT<result ARRAY<FLOAT64>, status STRING>
GENERATED ALWAYS AS (AI.EMBED(
description,
connection_id => 'us.example_connection',
endpoint => 'text-embedding-005'
))
STORED OPTIONS( asynchronous = TRUE )
);
The following query inserts some product names and descriptions into the table.
You don't specify a value for description_embedding because it's generated
automatically.
# Insert product descriptions into the table.
# The description_embedding column is automatically updated.
INSERT INTO mydataset.products (name, description) VALUES
("Lounger chair", "A comfortable chair for relaxing in."),
("Super slingers", "An exciting board game for the whole family."),
("Encyclopedia set", "A collection of informational books.");
You can optionally create a vector index on the table to speed up searching.
A vector index requires more than three rows, so the following query assumes
that you have inserted additional data. Every time you insert data, the
description_embedding column is automatically updated.
CREATE VECTOR INDEX my_index
ON mydataset.products(description_embedding)
OPTIONS(index_type = 'IVF');
Finally, you can use the
AI.SEARCH function
to perform semantic search on your products for a fun toy:
# Search for products that are fun to play with.
SELECT base.name, base.description, distance
FROM AI.SEARCH(TABLE mydataset.products, 'description', "A really fun toy");
/*------------------+----------------------------------------------+----------------------+
| name | description | distance |
+------------------+----------------------------------------------+----------------------+
| Super slingers | An exciting board game for the whole family. | 0.80954913893618929 |
| Lounger chair | A comfortable chair for relaxing in. | 0.938933930620146 |
| Encyclopedia set | A collection of informational books. | 1.1119297739353384 |
+------------------+----------------------------------------------+----------------------*/
Get information about automatically generated embedding columns
To verify that a column is an automatically generated embedding column, query
the
INFORMATION_SCHEMA.COLUMNS view.
The following query shows you information about all of your automatically generated embedding columns:
SELECT *
FROM PROJECT_ID.DATASET_ID.INFORMATION_SCHEMA.COLUMNS
WHERE is_generated = 'ALWAYS';
The generation_expression field shows you the call to the AI.EMBED function
that is used to generate the embeddings on the column.
Use your own reservation
By default, BigQuery uses on-demand slots to handle the
processing required to maintain the generated embedding column. To ensure
predictable and consistent performance, you can
optionally
create a reservation
and set the job_type to BACKGROUND. When a background reservation is
present, BigQuery uses it to maintain the generated embedding
column instead.
Troubleshooting
The generated embedding column contains two fields: result and status.
If an error occurs when BigQuery tries to generate an embedding
for a particular row in your table, then the result field is NULL and the
status field describes the error. For example, if the source column is NULL
then the result embedding is also NULL and the status is
NULL value is not supported for embedding generation.
A more severe error can stall embedding generation. In this case, you can
query the
INFORMATION_SCHEMA.JOBS view
for the background job and look at the information in the error_result field.
The job ID of a background embedding job is prefixed with gc_. For example,
the following query extracts all background jobs whose error result isn't
NULL:
SELECT * FROM `region-REGION.INFORMATION_SCHEMA.JOBS` j
WHERE EXISTS (
SELECT 1
FROM unnest(j.referenced_tables) t
WHERE
j.project_id = 'PROJECT_ID'
AND t.dataset_id = 'DATASET_ID'
AND t.table_id = 'TABLE'
)
AND starts_with(job_id, 'gc')
AND error_result IS NOT NULL
ORDER BY j.creation_time DESC;
Track costs
Autonomous embedding generation sends requests to Vertex AI, which can incur costs. To track the Vertex AI costs incurred by background embedding jobs, follow these steps:
- View your billing reports in Cloud Billing.
Use filters to refine your results.
For services, select Vertex AI.
To see the charges for a specific job, filter by label.
Set the key to
bigquery_ml_joband the value to the job ID of the embedding job. Background embedding jobs all have a prefix ofgc_.
It can take up to 24 hours for some charges to appear in Cloud Billing.
Limitations
- Each table supports at most one automatically generated embedding column.
- There is no indication that a column is automatically generated when you
view a table's schema using the Google Cloud console, the
bq showcommand, or theddlfield of theINFORMATION_SCHEMA.TABLESview. - You can't add generated embedding columns to an existing table by using
ALTER TABLE ADD COLUMN. - If you create a copy, clone, or snapshot of a table that has a generated embedding column, only the data is copied. The generation configuration doesn't apply to the new table, and updates to the source column of the new table won't result in new embeddings.
- If you restore a table that had autonomous embedding generation enabled from a snapshot, the embedding generation configuration isn't restored.
- You can create generated embedding columns only by using SQL. You can't use
the
bq mkorbq updatecommands to create generated embedding columns. - The source column of the generated column must be a
STRINGcolumn. After you create the generated embedding column, the following limitations apply:
- You can't drop or rename the source column, but you can still drop or rename the generated embedding column. If you drop the embedding column, then you can drop or rename the source column.
- You can't change the data type of the source column or generated embedding column.
You can't specify default values for automatically generated embedding columns.
You can't directly write to generated embedding columns by using these methods:
- DML
- Streaming writes
bq insertbq copy -a
Tables with generated embedding columns don't support any column-level security policies, such as policy tags.
When you call a search function, such as
VECTOR_SEARCHorAI.SEARCH, rows with missing embeddings in the base table are skipped during the search.You can't create a partitioned vector index on a table that has autonomous embedding generation enabled.
If you create a vector index on the automatically generated embedding column, then index training starts after at least 80% of the rows have generated embeddings. You can use the following query to check what percentage of embeddings on your table have been generated:
SELECT COUNTIF(description_embedding IS NOT NULL AND description_embedding.status = '') * 100.0 / COUNT(*) AS percent FROM PROJECT_ID.DATASET_ID.TABLE;
What's next
- Learn more about creating and managing vector indexes.
- See the Introduction to vector search.