Create and manage RUM index

This document shows you how to create the RUM extension and create indexes to optimize full-text search in AlloyDB for PostgreSQL. It provides examples for common use cases, including ranking, phrase searching, and sorting by timestamp.

Before you begin

To create the RUM extension, you must have the alloydbsuperuser database role.

The AlloyDB Admin (roles/alloydb.admin) IAM role grants full control over AlloyDB resources, but it doesn't grant the alloydbsuperuser database role. To create the extension, an administrator must explicitly grant you the alloydbsuperuser database role.

For more information about granting roles, see Add an IAM user or service account to a cluster.

Create the RUM extension

You must create the RUM extension once per database.

Connect to your AlloyDB database using psql or another client. For more information, see Connect to a cluster instance.
Run the following SQL command to create the extension:
```
CREATE EXTENSION IF NOT EXISTS rum;
```

Create a RUM index

To optimize full-text search queries, create a RUM index on your data. RUM offers several operator classes for different use cases.

Types of RUM operator classes

The following table summarizes the different RUM operator classes and their primary use cases.

Operator Class	Main Use Case	Limitations
`rum_tsvector_ops`	Standard full-text search with ranking and phrase search.	N/A
`rum_tsvector_hash_ops`	Smaller index and faster updates for full-text search.	Does not support prefix searching.
`rum_tsvector_addon_ops`	Full-text search sorted by another column.	N/A
`rum_anyarray_ops`	Searching within array columns.	N/A
`rum_<TYPE>_ops`	Indexing scalar types for distance-based queries.	N/A
`rum_tsvector_hash_addon_ops`	Hash-based full-text search sorted by another column.	Does not support prefix matching.
`rum_tsquery_ops`	Indexing stored `tsquery` values for reverse search.	N/A
`rum_anyarray_addon_ops`	Array search sorted by another column.	N/A

Index for basic full-text search

Use the rum_tsvector_ops operator class for standard text search that requires fast ranking and phrase search capabilities. This operator class stores the position of each lexeme in the index. The following example creates a table named documents with a content column.

Create a table named documents:

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  content TEXT NOT NULL,
  published_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Populate the documents table with sample data:

INSERT INTO documents (title, content) VALUES
  ('Title', 'This search engine is working as intended');

Add a generated tsvector column to your table. This column automatically stores the processed text and improves query performance:

ALTER TABLE documents
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

Create the RUM index on the new search_vector column:

CREATE INDEX idx_docs_rum
ON documents
USING rum (search_vector rum_tsvector_ops);

Query the table using the index. The <=> operator computes the relevance score, or distance, between the document and the query directly from the index, enabling fast sorting:

SELECT title, content
FROM documents
WHERE search_vector @@ to_tsquery('english', 'search <-> engine')
ORDER BY search_vector <=> to_tsquery('english', 'search <-> engine');

Populate documents table with more data:

INSERT INTO documents (title, content) VALUES ('Title1', 'English is my primary language.');
INSERT INTO documents (title, content) VALUES ('Title2', 'Google has a great engineering culture');

Run a prefix search query. This finds documents containing words that start with eng, such as engineer or english:
```
SELECT title, content
FROM documents
WHERE search_vector @@ to_tsquery('english', 'eng:*');
```

Index for optimized hash search

Use the rum_tsvector_hash_ops operator class to reduce index size and improve update speeds. This class stores a hash of each lexeme instead of the full lexeme. This approach results in a smaller index but does not support prefix searching. The following example assumes that you have a table named documents with a search_vector column.

Create the RUM index using the hash operator class:

CREATE INDEX idx_docs_rum_hash
ON documents
USING rum (search_vector rum_tsvector_hash_ops);

Populate documents table with more data:

INSERT INTO documents (title, content) VALUES ('Title3', 'That person was driving incredibly fast, however the routing was not very efficient');

Run a standard match query:

SELECT * FROM documents WHERE search_vector @@ to_tsquery('english', 'fast & efficient');

Index for search sorted by timestamp

Use the rum_tsvector_addon_ops operator class to optimize queries that filter by text and sort by another field, such as a timestamp. This pattern stores the additional field's value directly in the index, which avoids a slow sort operation after the search. The following example assumes that you have a table named documents with a search_vector column and a published_at column.

Create an index that includes the published_at timestamp:

CREATE INDEX idx_docs_rum_timestamp
ON documents
USING rum (search_vector rum_tsvector_addon_ops, published_at)
WITH (attach = 'published_at', to = 'search_vector');

Run a query that finds documents containing the word engine and sorts them by publication date. The index handles both the search and the sort efficiently:
```
SELECT title, published_at
FROM documents
WHERE search_vector @@ to_tsquery('english', 'engine')
ORDER BY published_at DESC;
```

Index for array search

Use the rum_anyarray_ops operator class to index array columns, such as a list of tags. This lets you efficiently query for arrays that overlap (&&), contain (@>), or are contained by (<@) other arrays. The following example adds a tags column to the documents table.

Add a tags column and populate it with data:

ALTER TABLE documents 
ADD COLUMN tags TEXT[];

INSERT INTO documents (title, content, tags) VALUES ( 'Title4', 'Sample Text', ARRAY['ai', 'ml'] );

Create the RUM index on a TEXT[] column named tags:

CREATE INDEX idx_tags_rum
ON documents
USING rum (tags rum_anyarray_ops);

Run a query to find documents that have either ai or ml in their tags:
```
SELECT * FROM documents WHERE tags && '{"ai", "ml"}';
```

Index for scalar types

Use the rum_<TYPE>_ops operator classes to index columns that contain continuous values, such as integers, timestamps, or floating-point numbers. These operator classes let you use the <=> operator to efficiently calculate the distance between values. The following example assumes that you have a table named documents.

Add a generic integer column, such as rating, to the documents table:

ALTER TABLE documents
ADD COLUMN rating INT;

UPDATE documents 
SET rating = floor(random() * 5 + 1);

Create a RUM index on the rating column:

CREATE INDEX idx_rating_rum
ON documents
USING rum (rating rum_int4_ops);

Run a query to find documents with a rating closest to the value 5:

SELECT title, rating
FROM documents
ORDER BY rating <=> 5;

Index for optimized hash search sorted by timestamp

Use the rum_tsvector_hash_addon_ops operator class to combine the benefits of a hash index with the sorting capabilities of an addon index. This class stores a hash of each lexeme along with the value of an additional column. This configuration supports efficient sorting by the additional column but does not support prefix matching. The following example assumes that you have a table named documents with a search_vector column and a published_at timestamp column.

Create a RUM index that uses the hash operator class and includes the published_at timestamp:

CREATE INDEX idx_docs_rum_hash_timestamp
ON documents
USING rum (search_vector rum_tsvector_hash_addon_ops, published_at)
WITH (attach = 'published_at', to = 'search_vector');

Run a query that finds documents containing engine and sorts them by publication date:

SELECT title, published_at
FROM documents
WHERE search_vector @@ to_tsquery('english', 'engine')
ORDER BY published_at DESC;

Index for stored queries

Use the rum_tsquery_ops operator class to index tsquery values. This lets you perform "reverse search," identifying which stored queries match a given input document. The following example creates a table named queries.

Create a table to store queries:

CREATE TABLE queries (
query_text tsquery
);
INSERT INTO queries (query_text) VALUES (plainto_tsquery('AlloyDB is fast!'));

Create a RUM index on the query_text column:

CREATE INDEX idx_queries_rum
ON queries
USING rum (query_text rum_tsquery_ops);

Run a query to find stored queries that match a document:

SELECT *
FROM queries
WHERE to_tsvector('english', 'AlloyDB is fast') @@ query_text;

Index for array search sorted by timestamp

Use the rum_anyarray_addon_ops operator class to index array columns along with an additional column for sorting. The following example assumes that you have a table named documents with a tags column and a published_at timestamp column.

Create a RUM index on the tags column that includes the published_at timestamp:

CREATE INDEX idx_tags_rum_timestamp
ON documents
USING rum (tags rum_anyarray_addon_ops, published_at)
WITH (attach = 'published_at', to = 'tags');

Run a query to find documents with the ai tag, sorted by publication date:

SELECT title, published_at
FROM documents
WHERE tags @> '{"ai"}'
ORDER BY published_at DESC;

What's next

Learn about Full-text search.
Learn how to Run a hybrid vector similarity search.