This page describes how to perform a vector similarity search in Spanner by using the cosine distance, Euclidean distance, and dot product vector functions to find K-nearest neighbors. This information applies to both GoogleSQL-dialect databases and PostgreSQL-dialect databases. Before you read this page, it's important that you understand the following concepts:
- Euclidean distance : measures the shortest distance between two vectors.
- Cosine distance : measures the cosine of the angle between two vectors.
- Dot product
: calculates the
cosine of the angle multiplied by the product of corresponding vector
magnitudes. If you know that all the vector embeddings in your dataset are
normalized, then you can use
DOT_PRODUCT()
as a distance function. - K-nearest neighbors (KNN) : a supervised machine learning algorithm used to solve classification or regression problems.
You can use vector distance functions to perform K-nearest neighbors (KNN)
vector search for use cases like similarity search or retrieval-augmented
generation. Spanner supports the COSINE_DISTANCE()
, EUCLIDEAN_DISTANCE()
, and DOT_PRODUCT()
functions, which operate on vector
embeddings, allowing you to find the KNN of the input embedding.
For example, after you generate and save your operational Spanner data as vector embeddings , you can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.
All three distance functions take the arguments vector1
and vector2
, which
are of the type array<>
, and must consist of the same dimensions and have the
same length. For more details about these functions, see:
-
COSINE_DISTANCE()
in GoogleSQL -
EUCLIDEAN_DISTANCE()
in GoogleSQL -
DOT_PRODUCT()
in GoogleSQL - Mathematical functions in PostgreSQL
(
spanner.cosine_distance()
,spanner.euclidean_distance()
, andspanner.dot_product()
) - Choose among vector distance functions to measure vector embeddings similarity .
Examples
The following examples show KNN search, KNN search over partitioned data, and using a secondary index with KNN.
The examples all use EUCLIDEAN_DISTANCE()
. You can also use COSINE_DISTANCE()
. In addition, if all the vector embeddings in your dataset
are normalized, you can use DOT_PRODUCT()
as a distance function.
Example 1: KNN search
Consider a Documents
table that has a column ( DocEmbedding
) of precomputed
text embeddings from the DocContents
bytes column.
GoogleSQL
CREATE
TABLE
Documents
(
UserId
INT64
NOT
NULL
,
DocId
INT64
NOT
NULL
,
Author
STRING
(
1024
),
DocContents
BYTES
(
MAX
),
DocEmbedding
ARRAY<FLOAT32>
)
PRIMARY
KEY
(
UserId
,
DocId
);
PostgreSQL
CREATE
TABLE
Documents
(
UserId
bigint
NOT
NULL
,
DocId
bigint
NOT
NULL
,
Author
varchar
(
1024
),
DocContents
bytea
,
DocEmbedding
float4
[],
PRIMARY
KEY
(
UserId
,
DocId
)
);
Assuming that an input embedding for "baseball, but not professional baseball"
is the array [0.3, 0.3, 0.7, 0.7]
, you can find the top five nearest documents
that match, with the following query:
GoogleSQL
SELECT
DocId
,
DocEmbedding
FROM
Documents
ORDER
BY
EUCLIDEAN_DISTANCE
(
DocEmbedding
,
ARRAY<FLOAT32>
[
0.3
,
0.3
,
0.7
,
0.8
]
)
LIMIT
5
;
PostgreSQL
SELECT
DocId
,
DocEmbedding
FROM
Documents
ORDER
BY
spanner
.
euclidean_distance
(
DocEmbedding
,
'{0.3, 0.3, 0.7, 0.8}'
::
float4
[])
LIMIT
5
;
The expected results of this example:
Documents
+
---------------------------+-----------------+
|
DocId
|
DocEmbedding
|
+
---------------------------+-----------------+
|
24
|
[
8
,
...]
|
+
---------------------------+-----------------+
|
25
|
[
6
,
...]
|
+
---------------------------+-----------------+
|
26
|
[
3
.
2
,
...]
|
+
---------------------------+-----------------+
|
27
|
[
38
,
...]
|
+
---------------------------+-----------------+
|
14229
|
[
1
.
6
,
...]
|
+
---------------------------+-----------------+
Example 2: KNN search over partitioned data
The query in the previous example can be modified by adding conditions to the WHERE
clause to limit the vector search to a subset of your data. One common
application of this is to search over partitioned data, such as rows that belong
to a specific UserId
.
GoogleSQL
SELECT
UserId
,
DocId
,
DocEmbedding
FROM
Documents
WHERE
UserId
=
18
ORDER
BY
EUCLIDEAN_DISTANCE
(
DocEmbedding
,
ARRAY<FLOAT32>
[
0.3
,
0.3
,
0.7
,
0.8
]
)
LIMIT
5
;
PostgreSQL
SELECT
UserId
,
DocId
,
DocEmbedding
FROM
Documents
WHERE
UserId
=
18
ORDER
BY
spanner
.
euclidean_distance
(
DocEmbedding
,
'{0.3, 0.3, 0.7, 0.8}'
::
float4
[])
LIMIT
5
;
The expected results of this example:
Documents
+
-----------+-----------------+-----------------+
|
UserId
|
DocId
|
DocEmbedding
|
+
-----------+-----------------+-----------------+
|
18
|
234
|
[
12
,
...]
|
+
-----------+-----------------+-----------------+
|
18
|
12
|
[
1
.
6
,
...]
|
+
-----------+-----------------+-----------------+
|
18
|
321
|
[
22
,
...]
|
+
-----------+-----------------+-----------------+
|
18
|
432
|
[
3
,
...]
|
+
-----------+-----------------+-----------------+
Example 3: KNN search over secondary index ranges
If the WHERE
clause filter you're using isn't part of the table's primary key,
then you can create a secondary index to accelerate the operation with an index-only scan
.
GoogleSQL
CREATE
INDEX
DocsByAuthor
ON
Documents
(
Author
)
STORING
(
DocEmbedding
);
SELECT
Author
,
DocId
,
DocEmbedding
FROM
Documents
WHERE
Author
=
"Mark Twain"
ORDER
BY
EUCLIDEAN_DISTANCE
(
DocEmbedding
,
< embeddings
for
"book about the time traveling American"
> )
LIMIT
5
;
PostgreSQL
CREATE
INDEX
DocsByAuthor
ON
Documents
(
Author
)
INCLUDE
(
DocEmbedding
);
SELECT
Author
,
DocId
,
DocEmbedding
FROM
Documents
WHERE
Author
=
"Mark Twain"
ORDER
BY
spanner
.
euclidean_distance
(
DocEmbedding
,
< embeddings
for
"that book about the time traveling American"
> )
LIMIT
5
;
The expected results of this example:
Documents
+
------------+-----------------+-----------------+
|
Author
|
DocId
|
DocEmbedding
|
+
------------+-----------------+-----------------+
|
Mark
Twain
|
234
|
[
12
,
...]
|
+
------------+-----------------+-----------------+
|
Mark
Twain
|
12
|
[
1
.
6
,
...]
|
+
------------+-----------------+-----------------+
|
Mark
Twain
|
321
|
[
22
,
...]
|
+
------------+-----------------+-----------------+
|
Mark
Twain
|
432
|
[
3
,
...]
|
+
------------+-----------------+-----------------+
|
Mark
Twain
|
375
|
[
9
,
...]
|
+
------------+-----------------+-----------------+
What's next
-
Learn more about the GoogleSQL
COSINE_DISTANCE()
,EUCLIDEAN_DISTANCE()
,DOT_PRODUCT()
functions. -
Learn more about the PostgreSQL
spanner.cosine_distance()
,spanner.euclidean_distance()
,spanner.dot_product()
functions. -
Learn more about how to Choose among vector distance functions to measure vector embeddings similarity .