This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.
When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.
In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss . How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.
For more details about the approximate distance functions supported in Spanner, see the following GoogleSQL reference pages:
Query vector embeddings
Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index . You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index . You can then use any one of the three approximate distance functions to find the ANN.
Restrictions when using the approximate distance functions include the following:
- The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
- The approximate distance function output must be used in a
ORDER BY
clause as the sole sort key, and aLIMIT
must be specified after theORDER BY
. - The query must explicitly filter out rows that aren't indexed. In most cases,
this means that the query must include a
WHERE <column_name> IS NOT NULL
clause that matches the vector index definition, unless the column is already marked asNOT NULL
in the table definition.
For a detailed list of limitations, see the approximate distance function reference page .
Examples
Consider a Documents
table that has a DocEmbedding
column of precomputed
text embeddings from the DocContents
bytes column, and a NullableDocEmbedding
column populated from other sources that might be null.
CREATE
TABLE
Documents
(
UserId
INT64
NOT
NULL
,
DocId
INT64
NOT
NULL
,
Author
STRING
(
1024
),
DocContents
BYTES
(
MAX
),
DocEmbedding
ARRAY<FLOAT32>
NOT
NULL
,
NullableDocEmbedding
ARRAY<FLOAT32>
,
WordCount
INT64
)
PRIMARY
KEY
(
UserId
,
DocId
);
To search for the nearest 100 vectors to [1.0, 2.0, 3.0]
:
SELECT
DocId
FROM
Documents
WHERE
WordCount
>
1000
ORDER
BY
APPROX_EUCLIDEAN_DISTANCE
(
ARRAY<FLOAT32>
[
1.0
,
2.0
,
3.0
]
,
DocEmbedding
,
options
=
>
JSON
'{"num_leaves_to_search": 10}'
)
LIMIT
100
If the embedding column is nullable:
SELECT
DocId
FROM
Documents
WHERE
NullableDocEmbedding
IS
NOT
NULL
AND
WordCount
>
1000
ORDER
BY
APPROX_EUCLIDEAN_DISTANCE
(
ARRAY<FLOAT32>
[
1.0
,
2.0
,
3.0
]
,
NullableDocEmbedding
,
options
=
>
JSON
'{"num_leaves_to_search": 10}'
)
LIMIT
100
What's next
-
Learn more about Spanner vector indexes .
-
Learn more about the GoogleSQL
APPROXIMATE_COSINE_DISTANCE()
,APPROXIMATE_EUCLIDEAN_DISTANCE()
,APPROXIMATE_DOT_PRODUCT()
functions. -
Learn more about the GoogleSQL
VECTOR INDEX
statements . -
Learn more about vector index best practices .
-
Try the Getting started with Spanner Vector Search for a step-by-step example of using ANN.