This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.
When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.
In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss . How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.
For more details about the approximate distance functions supported in Spanner, see the following reference pages for your database dialect:
- GoogleSQL
- PostgreSQL
Query vector embeddings
Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index . You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index . You can then use any one of the three approximate distance functions to find the ANN.
Restrictions when using the approximate distance functions include the following:
- The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
- The approximate distance function output must be used in a
ORDER BYclause as the sole sort key, and aLIMITmust be specified after theORDER BY. - The query must explicitly filter out rows that aren't indexed. In most cases,
this means that the query must include a
WHERE <column_name> IS NOT NULLclause that matches the vector index definition, unless the column is already marked asNOT NULLin the table definition.
For a detailed list of limitations, see the approximate distance function reference page .
Examples
Consider a Documents
table that has a DocEmbedding
column of precomputed
text embeddings from the DocContents
bytes column, and a NullableDocEmbedding
column populated from other sources that might be null.
GoogleSQL
CREATE
TABLE
Documents
(
UserId
INT64
NOT
NULL
,
DocId
INT64
NOT
NULL
,
Author
STRING
(
1024
),
DocContents
BYTES
(
MAX
),
DocEmbedding
ARRAY<FLOAT32>
NOT
NULL
,
NullableDocEmbedding
ARRAY<FLOAT32>
,
WordCount
INT64
)
PRIMARY
KEY
(
UserId
,
DocId
);
PostgreSQL
CREATE
TABLE
documents
(
user_id
bigint
not
null
,
doc_id
bigint
not
null
,
author
varchar
(
1024
),
doc_contents
bytea
,
doc_embedding
float4
[]
not
null
,
nullable_doc_embedding
float4
[],
word_count
bigint
,
PRIMARY
KEY
(
user_id
,
doc_id
)
);
To search for the nearest 100 vectors to [1.0, 2.0, 3.0]
:
GoogleSQL
SELECT
DocId
FROM
Documents
WHERE
WordCount
>
1000
ORDER
BY
APPROX_EUCLIDEAN_DISTANCE
(
ARRAY<FLOAT32>
[
1.0
,
2.0
,
3.0
]
,
DocEmbedding
,
options
=
>
JSON
'{"num_leaves_to_search": 10}'
)
LIMIT
100
PostgreSQL
SELECT
doc_id
FROM
documents
WHERE
word_count
>
1000
ORDER
BY
spanner
.
approx_euclidean_distance
(
ARRAY
[
1.0
,
2.0
,
3.0
]
::
float4
[],
doc_embedding
,
options
=
> jsonb
'{"num_leaves_to_search": 10}'
)
LIMIT
100
;
To search for the nearest 100 vectors to an embedding that is generated by a SQL expression, use the following pattern. In this example, the query is finding the documents that are most similar to the embedding of UserId = 1 and DocId = 1:
GoogleSQL
WITH
emb
AS
(
SELECT
DocEmbedding
AS
value
FROM
Documents
WHERE
UserId
=
1
AND
DocId
=
1
LIMIT
1
)
SELECT
DocId
FROM
Documents
,
emb
ORDER
BY
APPROX_EUCLIDEAN_DISTANCE
(
emb
.
value
,
DocEmbedding
,
options
=
>
JSON
'{"num_leaves_to_search": 10}'
)
LIMIT
100
PostgreSQL
SELECT
documents
.
doc_id
FROM
documents
,
(
SELECT
doc_embedding
AS
value
FROM
documents
WHERE
user_id
=
1
AND
doc_id
=
1
LIMIT
1
)
vector
WHERE
documents
.
doc_embedding
IS
NOT
NULL
ORDER
BY
spanner
.
APPROX_EUCLIDEAN_DISTANCE
(
documents
.
doc_embedding
,
vector
.
value
,
options
=
> '{"num_leaves_to_search": 10}'
::
jsonb
)
LIMIT
100
If the embedding column is nullable:
GoogleSQL
SELECT
DocId
FROM
Documents
WHERE
NullableDocEmbedding
IS
NOT
NULL
AND
WordCount
>
1000
ORDER
BY
APPROX_EUCLIDEAN_DISTANCE
(
ARRAY<FLOAT32>
[
1.0
,
2.0
,
3.0
]
,
NullableDocEmbedding
,
options
=
>
JSON
'{"num_leaves_to_search": 10}'
)
LIMIT
100
PostgreSQL
SELECT
doc_id
FROM
documents
WHERE
nullable_doc_embedding
IS
NOT
NULL
AND
word_count
>
1000
ORDER
BY
spanner
.
approx_euclidean_distance
(
ARRAY
[
1.0
,
2.0
,
3.0
]
::
float4
[],
nullable_doc_embedding
,
options
=
> jsonb
'{"num_leaves_to_search": 10}'
)
LIMIT
100
;
What's next
-
Learn more about Spanner vector indexes .
-
Learn more about the approximate distance functions in GoogleSQL and PostgreSQL .
-
Learn more about index statements for GoogleSQL
VECTOR INDEXand PostgreSQLINDEX. -
Learn more about vector index best practices .
-
Try the Getting started with Spanner Vector Search for a step-by-step example of using ANN.

