Manage vector indexes
This document describes how to create and manage vector indexes to accelerate your vector searches.
A vector index is a data structure designed to let the
VECTOR_SEARCH function
execute more efficiently, especially on large datasets.
When using an index, VECTOR_SEARCH uses
Approximate Nearest Neighbor (ANN)
algorithms to reduce query latency and computational cost. While ANN introduces
a degree of approximation, meaning that
recall
might not be 100%, the performance improvements typically offer an advantage
for most applications.
Roles and permissions
To create a vector index, you need the
bigquery.tables.createIndex IAM permission
on the table where you're creating the index. To drop a vector index, you need
the bigquery.tables.deleteIndex permission. Each of the following predefined
IAM roles includes the permissions that you need to work with
vector indexes:
- BigQuery Data Owner (roles/bigquery.dataOwner)
- BigQuery Data Editor (roles/bigquery.dataEditor)
Choose a vector index type
BigQuery offers two vector index types, IVF and
TreeAH, each supporting
different use cases. BigQuery supports batching for vector
search by processing multiple rows of the input data in the
VECTOR_SEARCH function.
For small query batches, IVF indexes are preferred. For large query batches,
TreeAH indexes, which are built with Google's
ScaNN algorithm,
are preferred.
IVF Index
IVF is an inverted file index, which uses a k-means algorithm to cluster the
vector data, and then partitions the vector data based on those clusters. The
VECTOR_SEARCH function
can use these partitions to reduce the amount of data it needs to read in order
to determine a result.
TreeAH Index
The TreeAH index type is named for its combination of a tree-like structure and its use of Asymmetric Hashing (AH), a core quantization technique from the underlying ScaNN algorithm. A TreeAH index works as follows:
- The base table is divided into smaller, more manageable shards.
- A clustering model is trained, with the number of clusters derived from
the leaf_node_embedding_countoption in thetree_ah_optionsargument of theCREATE VECTOR INDEXstatement.
- The vectors are compressed with product quantization, a technique that reduces their memory usage. The compressed vectors are then stored in the index tables instead of the original vectors, thus reducing vector index sizes.
- When the VECTOR_SEARCHfunction runs, a candidate list for each query vector is efficiently computed using asymmetric hashing, which is hardware-optimized for approximate distance calculations. These candidates are then re-scored and re-ranked using exact embeddings.
The TreeAH algorithm is optimized for batch queries that process hundreds or more query vectors. The use of product quantization can significantly reduce latency and cost, potentially by orders of magnitude compared to IVF. However, due to increased overhead, the IVF algorithm might be better when you have a smaller number of query vectors.
We suggest you try the TreeAH index type if your use case meets the following criteria:
- Your table contains 200 million rows or fewer. 
- You frequently execute large batch queries involving hundreds or more query vectors. 
For small batch queries with the TreeAH index type, VECTOR_SEARCH might
revert to
brute-force search.
When this occurs, a
IndexUnusedReason
is provided to explain why the vector index was not utilized.
Create an IVF vector index
To create an IVF vector index, use the
CREATE VECTOR INDEX
data definition language (DDL) statement:
- Go to the BigQuery page. 
- In the query editor, run the following SQL statement: - To create a IVF vector index: - CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME ON DATASET_NAME.TABLE_NAME(COLUMN_NAME) STORING(STORED_COLUMN_NAME [, ...]) OPTIONS(index_type = 'IVF', distance_type = 'DISTANCE_TYPE', ivf_options = '{"num_lists":NUM_LISTS}') - Replace the following: - INDEX_NAME: the name of the vector index you're creating. Since the index is always created in the same project and dataset as the base table, there is no need to specify these in the name.
- DATASET_NAME: the name of the dataset that contains the table.
- TABLE_NAME: the name of the table that contains the column with embeddings data.
- COLUMN_NAME: the name of a column that contains the embeddings data. The column must have a type of- ARRAY<FLOAT64>. The column can't have any child fields. All elements in the array must be non-- NULL, and all values in the column must have the same array dimensions.
- STORED_COLUMN_NAME: the name of a top-level column in the table to store in the vector index. The column type can't be- RANGE. Stored columns are not used if the table has a row-level access policy or the column has a policy tag. For information about how to enable stored columns, see Store columns and pre-filter.
- DISTANCE_TYPE: specifies the default distance type to use when performing a vector search using this index. The supported values are- EUCLIDEAN,- COSINE, and- DOT_PRODUCT.- EUCLIDEANis the default.- The index creation itself always uses - EUCLIDEANdistance for training but the distance used in the- VECTOR_SEARCHfunction can be different.- If you specify a value for the - distance_typeargument of the- VECTOR_SEARCHfunction, that value is used instead of the- DISTANCE_TYPEvalue.
- NUM_LISTS: an- INT64value that specifies the number of lists that the IVF index clusters and then partitions your vector data into. This value must be 5,000 or less. During indexing, vectors are assigned to the list corresponding to their nearest cluster centroid. If you omit this argument, BigQuery determines a default value based on your data characteristics. The default value works well for most use cases.- NUM_LISTScontrols query tuning granularity. Higher values create more lists, so you can set the- fraction_lists_to_searchoption of the- VECTOR_SEARCHfunction to scan a smaller percentage of the index. For example, scanning 1% of 100 lists as opposed to scanning 10% of 10 lists. This enables finer control of the search speed and recall but slightly increases the indexing cost. Set this argument value based on how precisely you need to tune query scope.
 
The following example creates a vector index on the embedding column
of my_table:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF');
The following example creates a vector index on the embedding column
of my_table, and specifies the distance type to use and the IVF options:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF', distance_type = 'COSINE', ivf_options = '{"num_lists": 2500}')
Create a TreeAH vector index
To create a TreeAH vector index, use the
CREATE VECTOR INDEX
data definition language (DDL) statement:
- Go to the BigQuery page. 
- In the query editor, run the following SQL statement: - CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME ON DATASET_NAME.TABLE_NAME(COLUMN_NAME) STORING(STORED_COLUMN_NAME [, ...]) OPTIONS(index_type = 'TREE_AH', distance_type = 'DISTANCE_TYPE', tree_ah_options = '{"leaf_node_embedding_count":LEAF_NODE_EMBEDDING_COUNT, "normalization_type":"NORMALIZATION_TYPE"}') - Replace the following: - INDEX_NAME: the name of the vector index that you are creating. Since the index is always created in the same project and dataset as the base table, there is no need to specify these in the name.
- DATASET_NAME: the name of the dataset that contains the table.
- TABLE_NAME: the name of the table that contains the column with embeddings data.
- COLUMN_NAME: the name of a column that contains the embeddings data. The column must have a type of- ARRAY<FLOAT64>. The column can't have any child fields. All elements in the array must be non-- NULL, and all values in the column must have the same array dimensions.
- STORED_COLUMN_NAME: the name of a top-level column in the table to store in the vector index. The column type can't be- RANGE. Stored columns are not used if the table has a row-level access policy or the column has a policy tag. For information about how to enable stored columns, see Store columns and pre-filter.
- DISTANCE_TYPE: an optional argument that specifies the default distance type to use when performing a vector search using this index. The supported values are- EUCLIDEAN,- COSINE, and- DOT_PRODUCT.- EUCLIDEANis the default.- The index creation itself always uses - EUCLIDEANdistance for training but the distance used in the- VECTOR_SEARCHfunction can be different.- If you specify a value for the - distance_typeargument of the- VECTOR_SEARCHfunction, that value is used instead of the- DISTANCE_TYPEvalue.
- LEAF_NODE_EMBEDDING_COUNT: an- INT64value greater than or equal to 500 that specifies the approximate number of vectors in each leaf node of the tree that the TreeAH algorithm creates. The TreeAH algorithm divides the whole data space into a number of lists, with each list containing approximately- LEAF_NODE_EMBEDDING_COUNTdata points. A lower value creates more lists with fewer data points, while a larger value creates fewer lists with more data points. The default is 1,000, which is appropriate for most datasets.
- NORMALIZATION_TYPE: a- STRINGvalue. The supported values are- NONEor- L2. The default is- NONE. Normalization happens before any processing, for both the base table data and the query data, but doesn't modify the embedding column- COLUMN_NAMEin- TABLE_NAME. Depending on the dataset, the embedding model, and the distance type used during- VECTOR_SEARCH, normalizing the embeddings might improve recall.
 
The following example creates a vector index on the embedding column
of my_table, and specifies the distance type to use and the TreeAH options:
CREATE TABLE my_dataset.my_table(id INT64, embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS (index_type = 'TREE_AH', distance_type = 'EUCLIDEAN', tree_ah_options = '{"normalization_type": "L2"}');
Filtering
The following sections explain how pre-filters and post-filters affect vector search results, and also how to pre-filter by using stored columns and partitions in the vector index.
Pre-filters and post-filters
In BigQuery VECTOR_SEARCH operations, both pre-filtering and
post-filtering serve to refine search results, by applying conditions based on
metadata columns associated with the vector embeddings. It is important to
understand their differences, implementation, and impact in order to optimize
query performance, cost, and accuracy.
Pre-filtering and post-filtering are defined as follows:
- Pre-filtering: Applies filter conditions before the approximate nearest neighbor (ANN) search performs distance calculations on candidate vectors. This narrows the pool of vectors that are considered during the search. Consequently, pre-filtering often results in faster query times and reduced computational cost, as the ANN search evaluates fewer potential candidates.
- Post-filtering: Applies filter conditions after the initial
top_knearest neighbors have been identified by the ANN search. This refines the final result set based on the specified criteria.
The placement of your WHERE clause determines whether a filter acts as a
pre-filter or a post-filter.
To create a pre-filter, the WHERE clause of the query must apply to the
base table argument of the VECTOR_SEARCH function.
The predicate must apply to a stored column, otherwise it effectively becomes
a post-filter.
The following example shows how to create a pre-filter:
-- Pre-filter on a stored column. The index speeds up the query. SELECT * FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE type = 'animal'), 'embedding', TABLE my_dataset.my_testdata); -- Filter on a column that isn't stored. The index is used to search the -- entire table, and then the results are post-filtered. You might see fewer -- than 5 matches returned for some embeddings. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE id = 123), 'embedding', TABLE my_dataset.my_testdata, top_k => 5); -- Use pre-filters with brute force. The data is filtered and then searched -- with brute force for exact results. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE id = 123), 'embedding', TABLE my_dataset.my_testdata, options => '{"use_brute_force":true}');
To create a post-filter, the WHERE clause of the query must be applied outside
of the VECTOR_SEARCH function, so that it filters the results returned by
the search.
The following example shows how to create a post-filter:
-- Use post-filters. The index is used, but the entire table is searched and -- the post-filtering might reduce the number of results. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( TABLE my_dataset.my_table, 'embedding', TABLE my_dataset.my_testdata, top_k => 5) WHERE base.type = 'animal'; SELECT base.id, distance FROM VECTOR_SEARCH( TABLE mydataset.base_table, 'embedding', (SELECT embedding FROM mydataset.query_table), top_k => 10 ) WHERE type = 'document' AND year > 2022
When you use post-filtering, or when the base table filters you specify
reference non-stored columns and thus act as post-filters, the final result
set might contain fewer than top_k rows, potentially even zero rows,
if the predicate is selective. If you require a specific number of results
after filtering, consider specifying a larger top_k value or increasing the
fraction_lists_to_search value in the VECTOR_SEARCH call.
In some cases, especially if the pre-filter is very selective, pre-filtering
can also reduce the size of the result set. If this happens, try increasing the
fraction_lists_to_search value in the VECTOR_SEARCH call.
Pre-filter with stored columns
To further improve the efficiency of your vector index, you can specify columns
from your base table to store in your vector index. Using stored columns can
optimize queries that call the VECTOR_SEARCH function in the following ways:
- Instead of searching an entire table, you can call the - VECTOR_SEARCHfunction on a query statement that pre-filters the base table with a- WHEREclause. If your table has an index and you filter on only stored columns, then BigQuery optimizes the query by filtering the data before searching, and then using the index to search the smaller result set. If you filter on columns that aren't stored, then BigQuery applies the filter after the table is searched, or post-filters.
- The - VECTOR_SEARCHfunction outputs a struct called- basethat contains all columns from the base table. Without stored columns, a potentially expensive join is needed to retrieve the columns stored in- base. If you use an IVF index and your query only selects stored columns from- base, then BigQuery optimizes your query to eliminate that join. For TreeAH indexes, the join with the base table is not removed. Stored columns in TreeAH indexes are only used for pre-filtering purposes.
To store columns, list them in the STORING clause of the
CREATE VECTOR INDEX statement.
Storing columns increases the size of the vector index, so it's best
to store only the most frequently used or filtered columns.
The following example creates a vector index with stored columns, and then runs a vector search query that only selects stored columns:
-- Create a table that contains an embedding. CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>, type STRING, creation_time DATETIME, id INT64); -- Create a query table that contains an embedding. CREATE TABLE my_dataset.my_testdata(embedding ARRAY<FLOAT64>, test_id INT64); -- Create a vector index with stored columns. CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) STORING (type, creation_time) OPTIONS (index_type = 'IVF'); -- Select only stored columns from a vector search to avoid an expensive join. SELECT query, base.type, distance FROM VECTOR_SEARCH( TABLE my_dataset.my_table, 'embedding' TABLE my_dataset.my_testdata);
Stored column limitations
- If the mode, type, or schema of a column is changed in the base table, and if it is a stored column in the vector index, then there can be a delay before that change is reflected in the vector index. Until the updates have been applied to the index, the vector search queries use the modified stored columns from the base table.
- If you select a column of type STRUCTfrom thequeryoutput of aVECTOR_SEARCHquery on a table that has an index with stored columns, then the whole query might fail.
Pre-filter with partitions
If the table that you are creating the vector index on is partitioned, you can choose to also partition the vector index. Partitioning the vector index has the following benefits:
- Partition pruning is applied to the vector indexes in addition to the table partitions. Partition pruning occurs when the vector search uses a qualifying filter on the value of the partitioning column. This allows BigQuery to scan the partitions that match the filter and skip the remaining partitions. Partition pruning can decrease I/O costs. For more information on partition pruning, see Query partitioned tables.
- The vector search is less likely to miss results if you pre-filter on the partitioning column.
You can only partition TreeAH vector indexes.
Partitioning a vector index is only recommended if you use pre-filtering to limit most of your vector searches to a few partitions.
To create a partitioned index, use the PARTITION BY clause of the
CREATE VECTOR INDEX statement. The
PARTITION BY clause that you specify in the CREATE VECTOR INDEX statement
must be the same as the PARTITION BY clause specified in the
CREATE TABLE statement
of the table that you are creating the vector index on, as shown in the
following example:
-- Create a date-partitioned table. CREATE TABLE my_dataset.my_table( embeddings ARRAYid INT64, date DATE, ) PARTITION BY date; -- Create a partitioned vector index on the table. CREATE VECTOR INDEX my_index ON my_dataset.my_table(embeddings) PARTITION BY date OPTIONS(index_type='TREE_AH', distance_type='COSINE'); 
If the table uses integer range or time-unit column partitioning, the
partitioning column is stored in the vector index, which increases storage cost.
If a table column is used in both the STORING and PARTITION BY clauses of
the CREATE VECTOR INDEX statement, the column is stored only once.
To use the vector index partition, filter on the partitioning column in the
base table subquery of the VECTOR_SEARCH statement. In the following example,
the samples.items table is partitioned by the produced_date column,
so the base table subquery in the VECTOR_SEARCH statement filters on the
produced_date column:
SELECT query.id, base.id, distance FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE produced_date = '2025-01-01'), 'embedding', TABLE samples.test, distance_type => 'COSINE', top_k => 10 );
Examples
Create a partitioned vector index on a datetime-partitioned table:
-- Create a datetime-partitioned table. CREATE TABLE my_dataset.my_table( id INT64, produced_date DATETIME, embeddings ARRAY) PARTITION BY produced_date; -- Create a partitioned vector index on the table. CREATE VECTOR INDEX index0 ON my_dataset.my_table(embeddings) PARTITION BY produced_date OPTIONS(index_type='TREE_AH', distance_type='COSINE'); 
Create a partitioned vector index on a timestamp-partitioned table:
-- Create a timestamp-partitioned table. CREATE TABLE my_dataset.my_table( id INT64, produced_time TIMESTAMP, embeddings ARRAY) PARTITION BY TIMESTAMP_TRUNC(produced_time, HOUR); -- Create a partitioned vector index on the table. CREATE VECTOR INDEX index0 ON my_dataset.my_table(embeddings) PARTITION BY TIMESTAMP_TRUNC(produced_time, HOUR) OPTIONS(index_type='TREE_AH', distance_type='COSINE'); 
Create a partitioned vector index on an integer range-partitioned table:
-- Create a integer range-partitioned table. CREATE TABLE my_dataset.my_table( id INT64, embeddings ARRAY) PARTITION BY RANGE_BUCKET(id, GENERATE_ARRAY(-100, 100, 20)); -- Create a partitioned vector index on the table. CREATE VECTOR INDEX index0 ON my_dataset.my_table(embeddings) PARTITION BY RANGE_BUCKET(id, GENERATE_ARRAY(-100, 100, 20)) OPTIONS(index_type='TREE_AH', distance_type='COSINE'); 
Create a partitioned vector index on an ingestion time-partitioned table:
-- Create a ingestion time-partitioned table. CREATE TABLE my_dataset.my_table( id INT64, embeddings ARRAY) PARTITION BY TIMESTAMP_TRUNC(_PARTITIONTIME, DAY); -- Create a partitioned vector index on the table. CREATE VECTOR INDEX index0 ON my_dataset.my_table(embeddings) PARTITION BY TIMESTAMP_TRUNC(_PARTITIONTIME, DAY) OPTIONS(index_type='TREE_AH', distance_type='COSINE'); 
Pre-filtering limitations
- You can't use logical views in your pre-filter.
- If your pre-filter contains a subquery, it might interfere with index usage.
Understanding when data is indexed
Vector indexes are fully managed by BigQuery and are automatically refreshed when the indexed table changes.
Indexing is asynchronous. There is a delay between adding new rows to the
base table and the new rows being reflected in the index. However, the
VECTOR_SEARCH function still takes all rows into account and doesn't miss
unindexed rows. The function searches using the index for indexed records, and
uses brute force search for the records that aren't yet indexed.
If you create a vector index on a table that is smaller than 10 MB, then the
vector index isn't populated. Similarly, if you delete data from an indexed
table and the table size falls below 10 MB, then the vector index is temporarily
disabled. In this case, vector search queries don't use the index and the
indexUnusedReasons code in the
vectorSearchStatistics
section of the Job resource is BASE_TABLE_TOO_SMALL. Without the index,
VECTOR_SEARCH automatically falls back to using brute force to find the
nearest neighbors of embeddings.
If you delete the indexed column in a table, or rename the table itself, the vector index is automatically deleted.
Monitoring the status of vector indexes
You can monitor the health of your vector indexes by querying
INFORMATION_SCHEMA views. The following views contain metadata on vector
indexes:
- The - INFORMATION_SCHEMA.VECTOR_INDEXESview has information about the vector indexes in a dataset.- After the - CREATE VECTOR INDEXstatement completes, the index must still be populated before you can use it. You can use the- last_refresh_timeand- coverage_percentagecolumns to verify the readiness of a vector index. If the vector index isn't ready, you can still use the- VECTOR_SEARCHfunction on a table, it just might run more slowly without the index.
- The - INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNSview has information about the vector-indexed columns for all tables in a dataset.
- The - INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONSview has information about the options used by the vector indexes in a dataset.
Vector index examples
The following example shows all active vector indexes on tables in the dataset
my_dataset, located in the project my_project. It includes their names, the
DDL statements used to create them, and their coverage percentage. If an
indexed base table is less than 10 MB, then its index is not populated, in
which case the coverage_percentage value is 0.
SELECT table_name, index_name, ddl, coverage_percentage FROM my_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES WHERE index_status = 'ACTIVE';
The result is similar to the following:
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table_name | index_name | ddl                                                                                             | coverage_percentage |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table1     | indexa     | CREATE VECTOR INDEX `indexa` ON `my_project.my_dataset.table1`(embeddings)                      | 100                 |
|            |            | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"num_lists": 100}')   |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table2     | indexb     | CREATE VECTOR INDEX `indexb` ON `my_project.my_dataset.table2`(vectors)                         | 42                  |
|            |            | OPTIONS (distance_type = 'COSINE', index_type = 'IVF', ivf_options = '{"num_lists": 500}')      |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table3     | indexc     | CREATE VECTOR INDEX `indexc` ON `my_project.my_dataset.table3`(vectors)                         | 98                  |
|            |            | OPTIONS (distance_type = 'DOT_PRODUCT', index_type = 'TREE_AH',                                 |                     |
|            |            |          tree_ah_options = '{"leaf_node_embedding_count": 1000, "normalization_type": "NONE"}') |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
Vector index columns examples
The following query extracts information on columns that have vector indexes:
SELECT table_name, index_name, index_column_name, index_field_path FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;
The result is similar to the following:
+------------+------------+-------------------+------------------+ | table_name | index_name | index_column_name | index_field_path | +------------+------------+-------------------+------------------+ | table1 | indexa | embeddings | embeddings | | table2 | indexb | vectors | vectors | | table3 | indexc | vectors | vectors | +------------+------------+-------------------+------------------+
Vector index options examples
The following query extracts information on vector index options:
SELECT table_name, index_name, option_name, option_type, option_value FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;
The result is similar to the following:
+------------+------------+------------------+------------------+-------------------------------------------------------------------+
| table_name | index_name | option_name      | option_type      | option_value                                                      |
+------------+------------+------------------+------------------+-------------------------------------------------------------------+
| table1     | indexa     | index_type       | STRING           | IVF                                                               |
| table1     | indexa     | distance_type    | STRING           | EUCLIDEAN                                                         |
| table1     | indexa     | ivf_options      | STRING           | {"num_lists": 100}                                                |
| table2     | indexb     | index_type       | STRING           | IVF                                                               |
| table2     | indexb     | distance_type    | STRING           | COSINE                                                            |
| table2     | indexb     | ivf_options      | STRING           | {"num_lists": 500}                                                |
| table3     | indexc     | index_type       | STRING           | TREE_AH                                                           |
| table3     | indexc     | distance_type    | STRING           | DOT_PRODUCT                                                       |
| table3     | indexc     | tree_ah_options  | STRING           | {"leaf_node_embedding_count": 1000, "normalization_type": "NONE"} |
+------------+------------+------------------+------------------+-------------------------------------------------------------------+
Verifying vector index usage
Information on vector index usage is available in the job metadata of the job that ran the vector search query. You can view job metadata by using the Google Cloud console, the bq command-line tool, the BigQuery API, or the client libraries.
When you use the Google Cloud console, you can find vector index usage information in the Vector Index Usage Mode and Vector Index Unused Reasons fields.
When you use the bq tool or the BigQuery API, you can
find vector index usage information in the
VectorSearchStatistics
section of the Job resource.
The index usage mode indicates whether a vector index was used by providing one of the following values:
- UNUSED: No vector index was used.
- PARTIALLY_USED: Some- VECTOR_SEARCHfunctions in the query used vector indexes and some didn't.
- FULLY_USED: Every- VECTOR_SEARCHfunction in the query used a vector index.
When the index usage mode value is UNUSED or PARTIALLY_USED,
the index unused reasons indicate why vector indexes weren't used in the query.
For example, the following results returned by
bq show --format=prettyjson -j my_job_id shows that the index was not used
because the use_brute_force option was specified in the VECTOR_SEARCH
function:
"vectorSearchStatistics": {
  "indexUnusedReasons": [
    {
      "baseTable": {
        "datasetId": "my_dataset",
        "projectId": "my_project",
        "tableId": "my_table"
      },
      "code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION",
      "message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified."
    }
  ],
  "indexUsageMode": "UNUSED"
}
Index management options
To create indexes and have BigQuery maintain them, you have two options:
- Use the default shared slot pool: When the data you plan to index is below your per-organization limit, you can use the free shared slot pool for index management.
- Use your own reservation: To achieve more predictable and consistent indexing progress on your larger production workloads, you can use your own reservations for index management.
Use shared slots
If you have not configured your project to use a dedicated reservation for indexing, index management is handled in the free, shared slot pool, subject to the following constraints.
If you add data to a table which causes the total size of indexed
tables to exceed your organization's limit,
BigQuery pauses index management
for that table. When this happens, the index_status field in the
INFORMATION_SCHEMA.VECTOR_INDEXES view
displays PENDING DISABLEMENT and the index is queued for deletion. While
the index is pending disablement, it is
still used in queries and you are charged for the index storage.
After the index is deleted, the index_status field shows
the index as TEMPORARILY DISABLED. In this state, queries don't use the index,
and you are not charged for index storage. In this case, the
IndexUnusedReason code
is BASE_TABLE_TOO_LARGE.
If you delete data from the table and the total size of indexed tables
falls below the per-organization limit, then index management is resumed. The
index_status field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view is ACTIVE, queries can use the index, and you are charged for the
index storage.
You can use the INFORMATION_SCHEMA.SEARCH_INDEXES_BY_ORGANIZATION view
to understand your current consumption towards the per-organization limit in a
given region, broken down by projects and tables.
BigQuery does not make guarantees about the available capacity of the shared pool or the throughput of indexing you see. For production applications, you might want to use dedicated slots for your index processing.
Use your own reservation
Instead of using the default shared slot pool, you can optionally designate your own reservation to index your tables. Using your own reservation ensures predictable and consistent performance of index-management jobs, such as creation, refresh, and background optimizations.
- There are no table size limits when an indexing job runs in your reservation.
- Using your own reservation gives you flexibility in your index management. If you need to create a very large index or make a major update to an indexed table, you can temporarily add more slots to the assignment.
To index the tables in a project with a designated reservation,
create a reservation
in the region where your tables are located. Then, assign the project to the
reservation with the job_type set to BACKGROUND:
SQL
Use the
CREATE ASSIGNMENT DDL statement.
- In the Google Cloud console, go to the BigQuery page. 
- In the query editor, enter the following statement: - CREATE ASSIGNMENT `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID` OPTIONS ( assignee = 'projects/PROJECT_ID', job_type = 'BACKGROUND'); - Replace the following: - ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
- LOCATION: the location of the reservation
- RESERVATION_NAME: the name of the reservation
- ASSIGNMENT_ID: the ID of the assignment- The ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes. 
- PROJECT_ID: the ID of the project containing the tables to index. This project is assigned to the reservation.
 
- Click Run. 
For more information about how to run queries, see Run an interactive query.
bq
Use the bq mk command:
bq mk \
    --project_id=ADMIN_PROJECT_ID \
    --location=LOCATION \
    --reservation_assignment \
    --reservation_id=RESERVATION_NAME \
    --assignee_id=PROJECT_ID \
    --job_type=BACKGROUND \
    --assignee_type=PROJECT
Replace the following:
- ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
- LOCATION: the location of the reservation
- RESERVATION_NAME: the name of the reservation
- PROJECT_ID: the ID of the project to assign to this reservation
View your indexing jobs
A new indexing job is created every time an index is created or updated on
a single table. To view information about the job, query the
INFORMATION_SCHEMA.JOBS* views. You
can filter for indexing jobs by
setting job_type IS NULL AND SEARCH(job_id, '`search_index`') in the WHERE
clause of your query. The following example lists the five most recent indexing
jobs in the project my_project:
SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') ORDER BY creation_time DESC LIMIT 5;
Choose your reservation size
To choose the right number of slots for your reservation, you should consider when index-management jobs are run, how many slots they use, and what your usage looks like over time. BigQuery triggers an index-management job in the following situations:
- You create an index on a table.
- Data is modified in an indexed table.
- The schema of a table changes and this affects which columns are indexed.
- Index data and metadata are periodically optimized or updated.
The number of slots you need for an index-management job on a table depends on the following factors:
- The size of the table
- The rate of data ingestion to the table
- The rate of DML statements applied to the table
- The acceptable delay for building and maintaining the index
- The complexity of the index, typically determined by attributes of the data, such as the number of duplicate terms
Monitor Usage and Progress
The best way to assess the number of slots you need to efficiently run your
index-management jobs is to monitor your slot utilization and adjust the
reservation size accordingly. The following query produces the daily slot usage
for index-management jobs. Only the past 30 days are included in the
region us-west1:
SELECT TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date, -- Aggregate total_slots_ms used for index-management jobs in a day and divide -- by the number of milliseconds in a day. This value is most accurate for -- days with consistent slot usage. SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage FROM `region-us-west1`.INFORMATION_SCHEMA.JOBS job WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') GROUP BY usage_date ORDER BY usage_date DESC limit 30;
When there are insufficient slots to run index-management jobs, an index can become out of sync with its table and indexing jobs might fail. In this case, BigQuery rebuilds the index from scratch. To avoid having an out-of-sync index, ensure you have enough slots to support index updates from data ingestion and optimization. For more information on monitoring slot usage, see admin resource charts.
Rebuild a vector index
When table data changes significantly after a vector index is created, the vector index can become less efficient. When a vector index is less efficient, a vector search query that initially had high recall when using the index will have lower recall, because the data distribution shift in the base table isn't represented in the vector index.
If you want to improve recall without increasing search query latency, rebuild
the vector index. Alternatively, you can increase the value of the vector
search's fraction_lists_to_search option to improve recall, but this
typically makes the search query slower.
You can use the
VECTOR_INDEX.STATISTICS function
to calculate how much an indexed table's data has drifted between when a
vector index was created and the present. If table data has changed enough
to require a vector index rebuild, you can use the
ALTER VECTOR INDEX REBUILD statement
to rebuild the vector index.
Follow these steps to rebuild a vector index:
- Go to the BigQuery page. 
- In the query editor, run the following SQL statement to check the indexed table's data drift: - SELECT * FROM VECTOR_INDEX.STATISTICS(TABLE DATASET_NAME.TABLE_NAME); - Replace the following: - DATASET_NAME: the name of the dataset that contains the indexed table.
- TABLE_NAME: the name of the table that contains the vector index.
 - The function returns a - FLOAT64value in the range- [0,1). A lower value indicates less drift. Typically, a value of- 0.3or greater is considered a significant enough change to indicate that a vector index rebuild might be beneficial.
- If the - VECTOR_INDEX.STATISTICSfunction indicates that table data drift is significant, run the following SQL statement to rebuild the vector index:- ALTER VECTOR INDEX IF EXISTS INDEX_NAME ON DATASET_NAME.TABLE_NAME REBUILD; - Replace the following: - INDEX_NAME: the name of the vector index that you are rebuilding.
- DATASET_NAME: the name of the dataset that contains the indexed table.
- TABLE_NAME: the name of the table that contains the vector index.
 
Delete a vector index
When you no longer need a vector index or want to change which column is
indexed on a table, you can delete the index on that table by using the
DROP VECTOR INDEX DDL statement:
- Go to the BigQuery page. 
- In the query editor, run the following SQL statement: - DROP VECTOR INDEX INDEX_NAME ON DATASET_NAME.TABLE_NAME; - Replace the following: - INDEX_NAME: the name of the vector index that you are deleting.
- DATASET_NAME: the name of the dataset that contains the indexed table.
- TABLE_NAME: the name of the table that contains the vector index.
 
If an indexed table is deleted, its index is deleted automatically.
Export embeddings to Vertex AI Vector Search
To enable ultra-low latency online applications, use BigQuery integration with Vertex AI Vector Search to import your BigQuery embeddings into Vector Search and deploy low latency endpoints. For more information, see Import index data from BigQuery.
What's next
- For an overview of vector index use cases, pricing, and limitations, see the Introduction to vector search.
- Learn how to perform a vector search using the
VECTOR_SEARCHfunction.
- Learn more about the
CREATE VECTOR INDEXstatement.
- Try the Search embeddings with vector search tutorial.