Spanner search indexes
can accelerate pattern matching expressions such as
LIKE,
STARTS_WITH,
ENDS_WITH,
and regular expression matching predicate
REGEXP_CONTAINS.
This page describes how to create and configure a search index using
TOKENIZE_NGRAMS
to accelerate pattern matching predicates.
Configure an n-gram TOKENLIST for pattern-matching acceleration
To enable pattern-matching expressions acceleration, tokenize a lower-cased
STRING column with
TOKENIZE_NGRAMS
and store the STRING column using the STORING clause in
GoogleSQL, or INCLUDE clause in PostgreSQL.
GoogleSQL
CREATE TABLE Albums (
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
AlbumTitle_Ngram_Tokens TOKENLIST AS (
TOKENIZE_NGRAMS(LOWER(AlbumTitle), ngram_size_min=>3, ngram_size_max=>4)) HIDDEN,
) PRIMARY KEY(AlbumId);
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Ngram_Tokens) STORING (AlbumTitle);
PostgreSQL
CREATE TABLE albums (
albumid bigint NOT NULL,
album_title varchar,
album_title_ngrams_tokens spanner.tokenlist GENERATED ALWAYS AS (
spanner.tokenize_ngrams(
lower(album_title),
ngram_size_min => 3,
ngram_size_max => 4
)
) VIRTUAL HIDDEN,
PRIMARY KEY(albumid));
CREATE SEARCH INDEX albumsidx ON
albums(album_title_ngrams_tokens) INCLUDE (album_title);
Automatic acceleration of queries with pattern-matching predicates
The query optimizer might choose to accelerate the following queries using
AlbumsIndex with AlbumTitle_Ngram_Tokens. Optionally, the query can provide
@{force_index = AlbumsIndex} to force the optimizer to use AlbumsIndex.
GoogleSQL
In GoogleSQL, we accelerate LIKE,
STARTS_WITH,
ENDS_WITH,
and
REGEXP_CONTAINS.
LIKEpredicate:SELECT AlbumId FROM Albums @{FORCE_INDEX=AlbumsIndex} WHERE AlbumTitle LIKE "%999%";STARTS_WITHpredicate:SELECT AlbumId FROM Albums @{FORCE_INDEX=AlbumsIndex} WHERE STARTS_WITH(AlbumTitle, "apple")ENDS_WITHpredicate:SELECT AlbumId FROM Albums @{FORCE_INDEX=AlbumsIndex} WHERE ENDS_WITH(AlbumTitle, "apple")REGEXP_CONTAINSpredicate:SELECT AlbumId FROM Albums @{FORCE_INDEX=AlbumsIndex} WHERE REGEXP_CONTAINS(AlbumTitle, r"(good|great)[ ]+morning")
PostgreSQL
In PostgreSQL, we accelerate
LIKE and
STARTS_WITH.
LIKEpredicate:SELECT albumid FROM albums /*@ FORCE_INDEX = albumsidx */ WHERE album_title like '%999%';STARTS_WITHpredicate:SELECT albumid FROM albums /*@ FORCE_INDEX = albumsidx */ WHERE starts_with(album_title, 'apple')
Prerequisites on acceleration
For Spanner to enable this acceleration, the following rules must be met:
- The index must store the
STRINGcolumn using theSTORINGclause in GoogleSQL, orINCLUDEclause in PostgreSQL. This prevents costly back-joins to the base table during post-filtering, which is critical for performance when the search over-retrieves documents. - The
STRINGcolumn must be tokenized usingTOKENIZE_NGRAMS. - The tokenization must apply to
LOWER(column_name)rather thancolumn_name. - The
LIKEpattern,STARTS_WITHprefix,ENDS_WITHsuffix, orREGEXP_CONTAINSregular expression must be specified as a constant literal. Query parameters are not supported to avoid acceleration on patterns that are too short. - The
LIKEpattern,STARTS_WITHprefix,ENDS_WITHsuffix, orREGEXP_CONTAINSregular expression must contain enough text for at least one n-gram. For exampler".*"doesn't qualify because there's no sequence of characters to match. Similarly, if the ngram minimum size is set to 3, theLIKEpredicate"%ab%"doesn't qualify because"ab"(size 2) is too short.
What's next
- Learn about finding approximate matches with fuzzy search.