Find approximate matches with fuzzy search

This page describes how to use a fuzzy search as part of a full-text search .

In addition to performing exact token searches using the SEARCH and SEARCH_SUBSTRING functions, Spanner also supports approximate (or fuzzy) searches. Fuzzy searches find matching documents despite small differences between the query and the document.

Spanner supports the following types of fuzzy search:

  • N-grams-based approximate search
  • Phonetic search using Soundex

N-grams-based fuzzy search relies on the same substring tokenization that a substring search requires. The configuration of the tokenizer is important as it affects search quality and performance. The following example shows how to create a query with misspelled or differently spelled words to find approximate matches in the search index.

Schema

GoogleSQL

  CREATE 
  
 TABLE 
  
 Albums 
  
 ( 
  
 AlbumId 
  
 STRING 
 ( 
 MAX 
 ) 
  
 NOT 
  
 NULL 
 , 
  
 AlbumTitle 
  
 STRING 
 ( 
 MAX 
 ), 
  
 AlbumTitle_Tokens 
  
 TOKENLIST 
  
 AS 
  
 ( 
  
 TOKENIZE_SUBSTRING 
 ( 
 AlbumTitle 
 , 
  
 ngram_size_min 
 = 
> 2 
 , 
  
 ngram_size_max 
 = 
> 3 
 , 
  
 relative_search_types 
 = 
> [ 
 "word_prefix" 
 , 
  
 "word_suffix" 
 ])) 
  
 HIDDEN 
 ) 
  
 PRIMARY 
  
 KEY 
 ( 
 AlbumId 
 ); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 AlbumsIndex 
 ON 
  
 Albums 
 ( 
 AlbumTitle_Tokens 
 ) 
 STORING 
  
 ( 
 AlbumTitle 
 ); 
 

PostgreSQL

This example uses spanner.tokenize_substring .

  CREATE 
  
 TABLE 
  
 albums 
  
 ( 
  
 albumid 
  
 character 
  
 varying 
  
 NOT 
  
 NULL 
 , 
  
 albumtitle 
  
 character 
  
 varying 
 , 
  
 albumtitle_tokens 
  
 spanner 
 . 
 tokenlist 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
  
 spanner 
 . 
 tokenize_substring 
 ( 
 albumtitle 
 , 
  
 ngram_size_min 
 = 
> 2 
 , 
  
 ngram_size_max 
 = 
> 3 
 , 
  
 relative_search_types 
 = 
> '{word_prefix, word_suffix}' 
 :: 
 text 
 [])) 
  
 VIRTUAL 
  
 HIDDEN 
 , 
 PRIMARY 
  
 KEY 
 ( 
 albumid 
 )); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 albumsindex 
 ON 
  
 albums 
 ( 
 albumtitle_tokens 
 ) 
 INCLUDE 
  
 ( 
 albumtitle 
 ); 
 

Query

The following query finds the albums with titles that are the closest to "Hatel Kaliphorn", such as "Hotel California".

GoogleSQL

  SELECT 
  
 AlbumId 
 FROM 
  
 Albums 
 WHERE 
  
 SEARCH_NGRAMS 
 ( 
 AlbumTitle_Tokens 
 , 
  
 "Hatel Kaliphorn" 
 ) 
 ORDER 
  
 BY 
  
 SCORE_NGRAMS 
 ( 
 AlbumTitle_Tokens 
 , 
  
 "Hatel Kaliphorn" 
 ) 
  
 DESC 
 LIMIT 
  
 10 
 

PostgreSQL

This examples uses spanner.score_ngrams and spanner.search_ngrams .

  SELECT 
  
 albumid 
 FROM 
  
 albums 
 WHERE 
  
 spanner 
 . 
 search_ngrams 
 ( 
 albumtitle_tokens 
 , 
  
 'Hatel Kaliphorn' 
 ) 
 ORDER 
  
 BY 
  
 spanner 
 . 
 score_ngrams 
 ( 
 albumtitle_tokens 
 , 
  
 'Hatel Kaliphorn' 
 ) 
  
 DESC 
 LIMIT 
  
 10 
 

Optimize performance and recall for an n-grams-based approximate search

The sample query in the previous section searches in two phases, using two different functions:

  1. SEARCH_NGRAMS finds all candidate albums that have shared n-grams with the search query. For example, three-character n-grams for "California" include [cal, ali, lif, ifo, for, orn, rni, nia] and for "Kaliphorn" include [kal, ali, lip, iph, pho, hor, orn] . The shared n-grams in these data sets are [ali, orn] . By default, SEARCH_NGRAMS matches all documents with at least two shared n-grams, therefore "Kaliphorn" matches "California".
  2. SCORE_NGRAMS ranks matches by similarity. The similarity of two strings is defined as a ratio of distinct shared n-grams to distinct non-shared n-grams:
$$ \frac{shared\_ngrams}{total\_ngrams_{index} + total\_ngrams_{query} - shared\_ngrams} $$

Usually the search query is the same across both the SEARCH_NGRAMS and SCORE_NGRAMS functions. The recommended way to do this is to use the argument with query parameters rather than with string literals, and specify the same query parameter in the SEARCH_NGRAMS and SCORE_NGRAMS functions.

Spanner has three configuration arguments that can be used with SEARCH_NGRAMS :

  • The minimum and maximum sizes for n-grams are specified with the TOKENIZE_SUBSTRING (/spanner/docs/reference/standard-sql/search_functions#tokenize_substring) or TOKENIZE_NGRAMS functions. We don't recommend one character n-grams because they could match a very large number of documents. On the other hand, long n-grams cause SEARCH_NGRAMS to miss short misspelled words.
  • The minimum number of n-grams that SEARCH_NGRAMS must match (set with the min_ngrams and min_ngrams_percent arguments in SEARCH_NGRAMS ). Higher numbers typically make the query faster, but reduce recall.

In order to achieve a good balance between performance and recall, you can configure these arguments to fit the specific query and workload.

We also recommend including an inner LIMIT to avoid creating very expensive queries when a combination of popular n-grams is encountered.

GoogleSQL

  SELECT 
  
 AlbumId 
 FROM 
  
 ( 
  
 SELECT 
  
 AlbumId 
 , 
  
 SCORE_NGRAMS 
 ( 
 AlbumTitle_Tokens 
 , 
  
 @ 
 p 
 ) 
  
 AS 
  
 score 
  
 FROM 
  
 Albums 
  
 WHERE 
  
 SEARCH_NGRAMS 
 ( 
 AlbumTitle_Tokens 
 , 
  
 @ 
 p 
 ) 
  
 LIMIT 
  
 10000 
  
 # 
  
 inner 
  
 limit 
 ) 
 ORDER 
  
 BY 
  
 score 
  
 DESC 
 LIMIT 
  
 10 
  
 # 
  
 outer 
  
 limit 
 

PostgreSQL

This example uses query parameter $1 which is bound to 'Hatel Kaliphorn'.

  SELECT 
  
 albumid 
 FROM 
  
 ( 
  
 SELECT 
  
 albumid 
 , 
  
 spanner 
 . 
 score_ngrams 
 ( 
 albumtitle_tokens 
 , 
  
 $ 
 1 
 ) 
  
 AS 
  
 score 
  
 FROM 
  
 albums 
  
 WHERE 
  
 spanner 
 . 
 search_ngrams 
 ( 
 albumtitle_tokens 
 , 
  
 $ 
 1 
 ) 
  
 LIMIT 
  
 10000 
  
 ) 
  
 AS 
  
 inner_query 
 ORDER 
  
 BY 
  
 inner_query 
 . 
 score 
  
 DESC 
 LIMIT 
  
 10 
 

N-grams-based fuzzy search versus enhanced query mode

Alongside n-grams-based fuzzy search, the enhanced query mode also handles some misspelled words. Thus, there is some overlap between the two features. The following table summarizes the differences:

n-grams-based fuzzy search Enhanced query mode
Cost
Requires a more expensive substring tokenization based on n-grams Requires a less expensive full-text tokenization
Search query types
Works well with short documents with a few words, such as with a person name, city name, or product name Works equally well with any size documents and any size search queries
Partial words search
Performs a substring search that allows for misspellings Only supports a search for entire words ( SEARCH_SUBSTRING doesn't support the enhance_query argument)
Misspelled words
Supports misspelled words in either index or query Only supports misspelled words in the query
Corrections
Finds any misspelled matches, even if the match isn't a real word Corrects misspellings for common, well-known words

Perform a phonetic search with Soundex

Spanner provides the SOUNDEX function for finding words that are spelled differently, but sound the same. For example, SOUNDEX("steven") , SOUNDEX("stephen") and SOUNDEX("stefan") are all "s315", while SOUNDEX("stella") is "s340". SOUNDEX is case sensitive and only works for Latin-based alphabets.

Phonetic search with SOUNDEX can be implemented with a generated column and a search index as shown in the following example:

GoogleSQL

  CREATE 
  
 TABLE 
  
 Singers 
  
 ( 
  
 SingerId 
  
 INT64 
 , 
  
 AlbumTitle 
  
 STRING 
 ( 
 MAX 
 ), 
  
 AlbumTitle_Tokens 
  
 TOKENLIST 
  
 AS 
  
 ( 
 TOKENIZE_FULLTEXT 
 ( 
 AlbumTitle 
 )) 
  
 HIDDEN 
 , 
  
 Name 
  
 STRING 
 ( 
 MAX 
 ), 
  
 NameSoundex 
  
 STRING 
 ( 
 MAX 
 ) 
  
 AS 
  
 ( 
 LOWER 
 ( 
 SOUNDEX 
 ( 
 Name 
 ))), 
  
 NameSoundex_Tokens 
  
 TOKENLIST 
  
 AS 
  
 ( 
 TOKEN 
 ( 
 NameSoundex 
 )) 
  
 HIDDEN 
 ) 
  
 PRIMARY 
  
 KEY 
 ( 
 SingerId 
 ); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 SingersPhoneticIndex 
  
 ON 
  
 Singers 
 ( 
 AlbumTitle_Tokens 
 , 
  
 NameSoundex_Tokens 
 ); 
 

PostgreSQL

This example uses spanner.soundex .

  CREATE 
  
 TABLE 
  
 singers 
  
 ( 
  
 singerid 
  
 bigint 
 , 
  
 albumtitle 
  
 character 
  
 varying 
 , 
  
 albumtitle_tokens 
  
 spanner 
 . 
 tokenlist 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
 spanner 
 . 
 tokenize_fulltext 
 ( 
 albumtitle 
 )) 
  
 VIRTUAL 
  
 HIDDEN 
 , 
  
 name 
  
 character 
  
 varying 
 , 
  
 namesoundex 
  
 character 
  
 varying 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
 lower 
 ( 
 spanner 
 . 
 soundex 
 ( 
 name 
 ))) 
  
 VIRTUAL 
 , 
  
 namesoundex_tokens 
  
 spanner 
 . 
 tokenlist 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
 spanner 
 . 
 token 
 ( 
 lower 
 ( 
 spanner 
 . 
 soundex 
 ( 
 name 
 ))) 
  
 VIRTUAL 
  
 HIDDEN 
 , 
 PRIMARY 
  
 KEY 
 ( 
 singerid 
 )); 
 CREATE 
  
 SEARCH 
  
 INDEX 
  
 singersphoneticindex 
  
 ON 
  
 singers 
 ( 
 albumtitle_tokens 
 , 
  
 namesoundex_tokens 
 ); 
 

The following query matches "stefan" to "Steven" on SOUNDEX , along with AlbumTitle containing "cat":

GoogleSQL

  SELECT 
  
 SingerId 
 FROM 
  
 Singers 
 WHERE 
  
 NameSoundex 
  
 = 
  
 LOWER 
 ( 
 SOUNDEX 
 ( 
 "stefan" 
 )) 
  
 AND 
  
 SEARCH 
 ( 
 AlbumTitle_Tokens 
 , 
  
 "cat" 
 ) 
 

PostgreSQL

  SELECT 
  
 singerid 
 FROM 
  
 singers 
 WHERE 
  
 namesoundex 
  
 = 
  
 lower 
 ( 
 spanner 
 . 
 soundex 
 ( 
 'stefan' 
 )) 
  
 AND 
  
 spanner 
 . 
 search 
 ( 
 albumtitle_tokens 
 , 
  
 'cat' 
 ) 
 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: