Accelerate pattern-matching expressions

Spanner search indexes can accelerate pattern matching expressions such as LIKE , STARTS_WITH , ENDS_WITH , and regular expression matching predicate REGEXP_CONTAINS . This page describes how to create and configure a search index using TOKENIZE_NGRAMS to accelerate pattern matching predicates.

Configure an n-gram TOKENLIST for pattern-matching acceleration

To enable pattern-matching expressions acceleration, tokenize a lower-cased STRING column with TOKENIZE_NGRAMS and store the STRING column using the STORING clause in GoogleSQL, or INCLUDE clause in PostgreSQL.

GoogleSQL

  CREATE 
  
 TABLE 
  
 Albums 
  
 ( 
 AlbumId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
 AlbumTitle 
  
 STRING 
 ( 
 MAX 
 ), 
 AlbumTitle_Ngram_Tokens 
  
 TOKENLIST 
  
 AS 
  
 ( 
  
 TOKENIZE_NGRAMS 
 ( 
 LOWER 
 ( 
 AlbumTitle 
 ), 
  
 ngram_size_min 
 = 
> 3 
 , 
  
 ngram_size_max 
 = 
> 4 
 )) 
  
 HIDDEN 
 , 
 ) 
  
 PRIMARY 
  
 KEY 
 ( 
 AlbumId 
 ); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 AlbumsIndex 
 ON 
  
 Albums 
 ( 
 AlbumTitle_Ngram_Tokens 
 ) 
  
 STORING 
  
 ( 
 AlbumTitle 
 ); 
 

PostgreSQL

  CREATE 
  
 TABLE 
  
 albums 
  
 ( 
 albumid 
  
 bigint 
  
 NOT 
  
 NULL 
 , 
 album_title 
  
 varchar 
 , 
 album_title_ngrams_tokens 
  
 spanner 
 . 
 tokenlist 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
  
 spanner 
 . 
 tokenize_ngrams 
 ( 
  
 lower 
 ( 
 album_title 
 ), 
  
 ngram_size_min 
  
 = 
>  
 3 
 , 
  
 ngram_size_max 
  
 = 
>  
 4 
  
 ) 
 ) 
  
 VIRTUAL 
  
 HIDDEN 
 , 
 PRIMARY 
  
 KEY 
 ( 
 albumid 
 )); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 albumsidx 
  
 ON 
 albums 
 ( 
 album_title_ngrams_tokens 
 ) 
  
 INCLUDE 
  
 ( 
 album_title 
 ); 
 

Automatic acceleration of queries with pattern-matching predicates

The query optimizer might choose to accelerate the following queries using AlbumsIndex with AlbumTitle_Ngram_Tokens . Optionally, the query can provide @{force_index = AlbumsIndex} to force the optimizer to use AlbumsIndex .

GoogleSQL

In GoogleSQL, we accelerate LIKE , STARTS_WITH , ENDS_WITH , and REGEXP_CONTAINS .

  • LIKE predicate:

      SELECT 
      
     AlbumId 
     FROM 
      
     Albums 
      
     @ 
     { 
     FORCE_INDEX 
     = 
     AlbumsIndex 
     } 
     WHERE 
      
     AlbumTitle 
      
     LIKE 
      
     "%999%" 
     ; 
     
    
  • STARTS_WITH predicate:

      SELECT 
      
     AlbumId 
     FROM 
      
     Albums 
      
     @ 
     { 
     FORCE_INDEX 
     = 
     AlbumsIndex 
     } 
     WHERE 
      
     STARTS_WITH 
     ( 
     AlbumTitle 
     , 
      
     "apple" 
     ) 
     
    
  • ENDS_WITH predicate:

      SELECT 
      
     AlbumId 
     FROM 
      
     Albums 
      
     @ 
     { 
     FORCE_INDEX 
     = 
     AlbumsIndex 
     } 
     WHERE 
      
     ENDS_WITH 
     ( 
     AlbumTitle 
     , 
      
     "apple" 
     ) 
     
    
  • REGEXP_CONTAINS predicate:

      SELECT 
      
     AlbumId 
     FROM 
      
     Albums 
      
     @ 
     { 
     FORCE_INDEX 
     = 
     AlbumsIndex 
     } 
     WHERE 
      
     REGEXP_CONTAINS 
     ( 
     AlbumTitle 
     , 
      
     r 
     "(good|great)[ ]+morning" 
     ) 
     
    

PostgreSQL

In PostgreSQL, we accelerate LIKE and STARTS_WITH .

  • LIKE predicate:

      SELECT 
      
     albumid 
     FROM 
      
     albums 
      
     /*@ FORCE_INDEX = albumsidx */ 
     WHERE 
      
     album_title 
      
     like 
      
     '%999%' 
     ; 
     
    
  • STARTS_WITH predicate:

      SELECT 
      
     albumid 
     FROM 
      
     albums 
      
     /*@ FORCE_INDEX = albumsidx */ 
     WHERE 
      
     starts_with 
     ( 
     album_title 
     , 
      
     'apple' 
     ) 
     
    

Prerequisites on acceleration

For Spanner to enable this acceleration, the following rules must be met:

  • The index must store the STRING column using the STORING clause in GoogleSQL, or INCLUDE clause in PostgreSQL. This prevents costly back-joins to the base table during post-filtering, which is critical for performance when the search over-retrieves documents.
  • The STRING column must be tokenized using TOKENIZE_NGRAMS .
  • The tokenization must apply to LOWER(column_name) rather than column_name .
  • The LIKE pattern, STARTS_WITH prefix, ENDS_WITH suffix, or REGEXP_CONTAINS regular expression must be specified as a constant literal. Query parameters are not supported to avoid acceleration on patterns that are too short.
  • The LIKE pattern, STARTS_WITH prefix, ENDS_WITH suffix, or REGEXP_CONTAINS regular expression must contain enough text for at least one n-gram. For example r".*" doesn't qualify because there's no sequence of characters to match. Similarly, if the ngram minimum size is set to 3, the LIKE predicate "%ab%" doesn't qualify because "ab" (size 2) is too short.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: