Accelerate pattern-matching expressions

Spanner search indexes can accelerate pattern matching expressions such as LIKE, STARTS_WITH, ENDS_WITH, and regular expression matching predicate REGEXP_CONTAINS. This page describes how to create and configure a search index using TOKENIZE_NGRAMS to accelerate pattern matching predicates.

Configure an n-gram TOKENLIST for pattern-matching acceleration

To enable pattern-matching expressions acceleration, tokenize a lower-cased STRING column with TOKENIZE_NGRAMS and store the STRING column using the STORING clause in GoogleSQL, or INCLUDE clause in PostgreSQL.

GoogleSQL

CREATE TABLE Albums (
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
AlbumTitle_Ngram_Tokens TOKENLIST AS (
  TOKENIZE_NGRAMS(LOWER(AlbumTitle), ngram_size_min=>3, ngram_size_max=>4)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Ngram_Tokens) STORING (AlbumTitle);

PostgreSQL

CREATE TABLE albums (
albumid bigint NOT NULL,
album_title varchar,
album_title_ngrams_tokens spanner.tokenlist GENERATED ALWAYS AS (
  spanner.tokenize_ngrams(
    lower(album_title),
    ngram_size_min => 3,
    ngram_size_max => 4
  )
) VIRTUAL HIDDEN,
PRIMARY KEY(albumid));

CREATE SEARCH INDEX albumsidx ON
albums(album_title_ngrams_tokens) INCLUDE (album_title);

Automatic acceleration of queries with pattern-matching predicates

The query optimizer might choose to accelerate the following queries using AlbumsIndex with AlbumTitle_Ngram_Tokens. Optionally, the query can provide @{force_index = AlbumsIndex} to force the optimizer to use AlbumsIndex.

GoogleSQL

In GoogleSQL, we accelerate LIKE, STARTS_WITH, ENDS_WITH, and REGEXP_CONTAINS.

  • LIKE predicate:

    SELECT AlbumId
    FROM Albums @{FORCE_INDEX=AlbumsIndex}
    WHERE AlbumTitle LIKE "%999%";
    
  • STARTS_WITH predicate:

    SELECT AlbumId
    FROM Albums @{FORCE_INDEX=AlbumsIndex}
    WHERE STARTS_WITH(AlbumTitle, "apple")
    
  • ENDS_WITH predicate:

    SELECT AlbumId
    FROM Albums @{FORCE_INDEX=AlbumsIndex}
    WHERE ENDS_WITH(AlbumTitle, "apple")
    
  • REGEXP_CONTAINS predicate:

    SELECT AlbumId
    FROM Albums @{FORCE_INDEX=AlbumsIndex}
    WHERE REGEXP_CONTAINS(AlbumTitle, r"(good|great)[ ]+morning")
    

PostgreSQL

In PostgreSQL, we accelerate LIKE and STARTS_WITH.

  • LIKE predicate:

    SELECT albumid
    FROM albums /*@ FORCE_INDEX = albumsidx */
    WHERE album_title like '%999%';
    
  • STARTS_WITH predicate:

    SELECT albumid
    FROM albums /*@ FORCE_INDEX = albumsidx */
    WHERE starts_with(album_title, 'apple')
    

Prerequisites on acceleration

For Spanner to enable this acceleration, the following rules must be met:

  • The index must store the STRING column using the STORING clause in GoogleSQL, or INCLUDE clause in PostgreSQL. This prevents costly back-joins to the base table during post-filtering, which is critical for performance when the search over-retrieves documents.
  • The STRING column must be tokenized using TOKENIZE_NGRAMS.
  • The tokenization must apply to LOWER(column_name) rather than column_name.
  • The LIKE pattern, STARTS_WITH prefix, ENDS_WITH suffix, or REGEXP_CONTAINS regular expression must be specified as a constant literal. Query parameters are not supported to avoid acceleration on patterns that are too short.
  • The LIKE pattern, STARTS_WITH prefix, ENDS_WITH suffix, or REGEXP_CONTAINS regular expression must contain enough text for at least one n-gram. For example r".*" doesn't qualify because there's no sequence of characters to match. Similarly, if the ngram minimum size is set to 3, the LIKE predicate "%ab%" doesn't qualify because "ab" (size 2) is too short.

What's next