This page describes how to use a fuzzy search as part of a full-text search .
In addition to performing exact token searches using the SEARCH
and SEARCH_SUBSTRING
functions, Spanner also supports approximate (or fuzzy) searches. Fuzzy
searches find matching documents despite small differences between the query and
the document.
Spanner supports the following types of fuzzy search:
- N-grams-based approximate search
- Phonetic search using Soundex
Use an n-grams-based approximate search
N-grams-based fuzzy search relies on the same substring tokenization that a substring search requires. The configuration of the tokenizer is important as it affects search quality and performance. The following example shows how to create a query with misspelled or differently spelled words to find approximate matches in the search index.
Schema
GoogleSQL
CREATE
TABLE
Albums
(
AlbumId
STRING
(
MAX
)
NOT
NULL
,
AlbumTitle
STRING
(
MAX
),
AlbumTitle_Tokens
TOKENLIST
AS
(
TOKENIZE_SUBSTRING
(
AlbumTitle
,
ngram_size_min
=
> 2
,
ngram_size_max
=
> 3
,
relative_search_types
=
> [
"word_prefix"
,
"word_suffix"
]))
HIDDEN
)
PRIMARY
KEY
(
AlbumId
);
CREATE
SEARCH
INDEX
AlbumsIndex
ON
Albums
(
AlbumTitle_Tokens
)
STORING
(
AlbumTitle
);
PostgreSQL
This example uses spanner.tokenize_substring
.
CREATE
TABLE
albums
(
albumid
character
varying
NOT
NULL
,
albumtitle
character
varying
,
albumtitle_tokens
spanner
.
tokenlist
GENERATED
ALWAYS
AS
(
spanner
.
tokenize_substring
(
albumtitle
,
ngram_size_min
=
> 2
,
ngram_size_max
=
> 3
,
relative_search_types
=
> '{word_prefix, word_suffix}'
::
text
[]))
VIRTUAL
HIDDEN
,
PRIMARY
KEY
(
albumid
));
CREATE
SEARCH
INDEX
albumsindex
ON
albums
(
albumtitle_tokens
)
INCLUDE
(
albumtitle
);
Query
The following query finds the albums with titles that are the closest to "Hatel Kaliphorn", such as "Hotel California".
GoogleSQL
SELECT
AlbumId
FROM
Albums
WHERE
SEARCH_NGRAMS
(
AlbumTitle_Tokens
,
"Hatel Kaliphorn"
)
ORDER
BY
SCORE_NGRAMS
(
AlbumTitle_Tokens
,
"Hatel Kaliphorn"
)
DESC
LIMIT
10
PostgreSQL
This examples uses spanner.score_ngrams
and spanner.search_ngrams
.
SELECT
albumid
FROM
albums
WHERE
spanner
.
search_ngrams
(
albumtitle_tokens
,
'Hatel Kaliphorn'
)
ORDER
BY
spanner
.
score_ngrams
(
albumtitle_tokens
,
'Hatel Kaliphorn'
)
DESC
LIMIT
10
Optimize performance and recall for an n-grams-based approximate search
The sample query in the previous section searches in two phases, using two different functions:
-
SEARCH_NGRAMS
finds all candidate albums that have shared n-grams with the search query. For example, three-character n-grams for "California" include[cal, ali, lif, ifo, for, orn, rni, nia]
and for "Kaliphorn" include[kal, ali, lip, iph, pho, hor, orn]
. The shared n-grams in these data sets are[ali, orn]
. By default,SEARCH_NGRAMS
matches all documents with at least two shared n-grams, therefore "Kaliphorn" matches "California". -
SCORE_NGRAMS
ranks matches by similarity. The similarity of two strings is defined as a ratio of distinct shared n-grams to distinct non-shared n-grams:
Usually the search query is the same across both the SEARCH_NGRAMS
and SCORE_NGRAMS
functions. The recommended way to do this is to use the argument
with query parameters
rather than with string literals, and specify the same query parameter in the SEARCH_NGRAMS
and SCORE_NGRAMS
functions.
Spanner has three configuration arguments that can be used with SEARCH_NGRAMS
:
- The minimum and maximum sizes for n-grams are specified with the
TOKENIZE_SUBSTRING
(/spanner/docs/reference/standard-sql/search_functions#tokenize_substring) orTOKENIZE_NGRAMS
functions. We don't recommend one character n-grams because they could match a very large number of documents. On the other hand, long n-grams causeSEARCH_NGRAMS
to miss short misspelled words. - The minimum number of n-grams that
SEARCH_NGRAMS
must match (set with themin_ngrams
andmin_ngrams_percent
arguments inSEARCH_NGRAMS
). Higher numbers typically make the query faster, but reduce recall.
In order to achieve a good balance between performance and recall, you can configure these arguments to fit the specific query and workload.
We also recommend including an inner LIMIT
to avoid creating very expensive
queries when a combination of popular n-grams is encountered.
GoogleSQL
SELECT
AlbumId
FROM
(
SELECT
AlbumId
,
SCORE_NGRAMS
(
AlbumTitle_Tokens
,
@
p
)
AS
score
FROM
Albums
WHERE
SEARCH_NGRAMS
(
AlbumTitle_Tokens
,
@
p
)
LIMIT
10000
#
inner
limit
)
ORDER
BY
score
DESC
LIMIT
10
#
outer
limit
PostgreSQL
This example uses query parameter $1
which is bound to 'Hatel Kaliphorn'.
SELECT
albumid
FROM
(
SELECT
albumid
,
spanner
.
score_ngrams
(
albumtitle_tokens
,
$
1
)
AS
score
FROM
albums
WHERE
spanner
.
search_ngrams
(
albumtitle_tokens
,
$
1
)
LIMIT
10000
)
AS
inner_query
ORDER
BY
inner_query
.
score
DESC
LIMIT
10
N-grams-based fuzzy search versus enhanced query mode
Alongside n-grams-based fuzzy search, the enhanced query mode also handles some misspelled words. Thus, there is some overlap between the two features. The following table summarizes the differences:
n-grams-based fuzzy search | Enhanced query mode | |
---|---|---|
Cost
|
Requires a more expensive substring tokenization based on n-grams | Requires a less expensive full-text tokenization |
Search query types
|
Works well with short documents with a few words, such as with a person name, city name, or product name | Works equally well with any size documents and any size search queries |
Partial words search
|
Performs a substring search that allows for misspellings | Only supports a search for entire words ( SEARCH_SUBSTRING
doesn't support the enhance_query
argument) |
Misspelled words
|
Supports misspelled words in either index or query | Only supports misspelled words in the query |
Corrections
|
Finds any misspelled matches, even if the match isn't a real word | Corrects misspellings for common, well-known words |
Perform a phonetic search with Soundex
Spanner provides the SOUNDEX
function for finding words that are spelled differently, but sound the same. For
example, SOUNDEX("steven")
, SOUNDEX("stephen")
and SOUNDEX("stefan")
are
all "s315", while SOUNDEX("stella")
is "s340". SOUNDEX
is case sensitive and
only works for Latin-based alphabets.
Phonetic search with SOUNDEX
can be implemented with a generated column and a
search index as shown in the following example:
GoogleSQL
CREATE
TABLE
Singers
(
SingerId
INT64
,
AlbumTitle
STRING
(
MAX
),
AlbumTitle_Tokens
TOKENLIST
AS
(
TOKENIZE_FULLTEXT
(
AlbumTitle
))
HIDDEN
,
Name
STRING
(
MAX
),
NameSoundex
STRING
(
MAX
)
AS
(
LOWER
(
SOUNDEX
(
Name
))),
NameSoundex_Tokens
TOKENLIST
AS
(
TOKEN
(
NameSoundex
))
HIDDEN
)
PRIMARY
KEY
(
SingerId
);
CREATE
SEARCH
INDEX
SingersPhoneticIndex
ON
Singers
(
AlbumTitle_Tokens
,
NameSoundex_Tokens
);
PostgreSQL
This example uses spanner.soundex
.
CREATE
TABLE
singers
(
singerid
bigint
,
albumtitle
character
varying
,
albumtitle_tokens
spanner
.
tokenlist
GENERATED
ALWAYS
AS
(
spanner
.
tokenize_fulltext
(
albumtitle
))
VIRTUAL
HIDDEN
,
name
character
varying
,
namesoundex
character
varying
GENERATED
ALWAYS
AS
(
lower
(
spanner
.
soundex
(
name
)))
VIRTUAL
,
namesoundex_tokens
spanner
.
tokenlist
GENERATED
ALWAYS
AS
(
spanner
.
token
(
lower
(
spanner
.
soundex
(
name
)))
VIRTUAL
HIDDEN
,
PRIMARY
KEY
(
singerid
));
CREATE
SEARCH
INDEX
singersphoneticindex
ON
singers
(
albumtitle_tokens
,
namesoundex_tokens
);
The following query matches "stefan" to "Steven" on SOUNDEX
, along with AlbumTitle
containing "cat":
GoogleSQL
SELECT
SingerId
FROM
Singers
WHERE
NameSoundex
=
LOWER
(
SOUNDEX
(
"stefan"
))
AND
SEARCH
(
AlbumTitle_Tokens
,
"cat"
)
PostgreSQL
SELECT
singerid
FROM
singers
WHERE
namesoundex
=
lower
(
spanner
.
soundex
(
'stefan'
))
AND
spanner
.
search
(
albumtitle_tokens
,
'cat'
)
What's next
- Learn about tokenization and Spanner tokenizers .
- Learn about search indexes .
- Learn about full-text search queries .