Find approximate nearest neighbors (ANN) and query vector embeddings

This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.

When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.

In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss . How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.

For more details about the approximate distance functions supported in Spanner, see the following reference pages for your database dialect:

GoogleSQL
PostgreSQL

Query vector embeddings

Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index . You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index . You can then use any one of the three approximate distance functions to find the ANN.

Restrictions when using the approximate distance functions include the following:

The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
The approximate distance function output must be used in a ORDER BY clause as the sole sort key, and a LIMIT must be specified after the ORDER BY .
The query must explicitly filter out rows that aren't indexed. In most cases, this means that the query must include a WHERE <column_name> IS NOT NULL clause that matches the vector index definition, unless the column is already marked as NOT NULL in the table definition.

For a detailed list of limitations, see the approximate distance function reference page .

Examples

Consider a Documents table that has a DocEmbedding column of precomputed text embeddings from the DocContents bytes column, and a NullableDocEmbedding column populated from other sources that might be null.

GoogleSQL

  CREATE 
  
 TABLE 
  
 Documents 
  
 ( 
  
 UserId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 DocId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 Author 
  
 STRING 
 ( 
 1024 
 ), 
  
 DocContents 
  
 BYTES 
 ( 
 MAX 
 ), 
  
 DocEmbedding 
  
 ARRAY<FLOAT32> 
  
 NOT 
  
 NULL 
 , 
  
 NullableDocEmbedding 
  
 ARRAY<FLOAT32> 
 , 
  
 WordCount 
  
 INT64 
 ) 
  
 PRIMARY 
  
 KEY 
  
 ( 
 UserId 
 , 
  
 DocId 
 );

PostgreSQL

  CREATE 
  
 TABLE 
  
 documents 
  
 ( 
  
 user_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 doc_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 author 
  
 varchar 
 ( 
 1024 
 ), 
  
 doc_contents 
  
 bytea 
 , 
  
 doc_embedding 
  
 float4 
 [] 
  
 not 
  
 null 
 , 
  
 nullable_doc_embedding 
  
 float4 
 [], 
  
 word_count 
  
 bigint 
 , 
  
 PRIMARY 
  
 KEY 
  
 ( 
 user_id 
 , 
  
 doc_id 
 ) 
 );

To search for the nearest 100 vectors to [1.0, 2.0, 3.0] :

GoogleSQL

  SELECT 
  
 DocId 
 FROM 
  
 Documents 
 WHERE 
  
 WordCount 
 > 
 1000 
 ORDER 
  
 BY 
  
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
  
 ARRAY<FLOAT32> 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 , 
  
 DocEmbedding 
 , 
  
 options 
  
 = 
>  
 JSON 
  
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100

PostgreSQL

  SELECT 
  
 doc_id 
 FROM 
  
 documents 
 WHERE 
  
 word_count 
 > 
 1000 
 ORDER 
  
 BY 
  
 spanner 
 . 
 approx_euclidean_distance 
 ( 
  
 ARRAY 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 :: 
 float4 
 [], 
  
 doc_embedding 
 , 
  
 options 
 = 
> jsonb 
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100 
 ;

To search for the nearest 100 vectors to an embedding that is generated by a SQL expression, use the following pattern. In this example, the query is finding the documents that are most similar to the embedding of UserId = 1 and DocId = 1:

GoogleSQL

  WITH 
  
 emb 
  
 AS 
  
 ( 
  
 SELECT 
  
 DocEmbedding 
  
 AS 
  
 value 
  
 FROM 
  
 Documents 
  
 WHERE 
  
 UserId 
  
 = 
  
 1 
  
 AND 
  
 DocId 
  
 = 
  
 1 
  
 LIMIT 
  
 1 
 ) 
 SELECT 
  
 DocId 
 FROM 
  
 Documents 
 , 
  
 emb 
 ORDER 
  
 BY 
  
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
  
 emb 
 . 
 value 
 , 
  
 DocEmbedding 
 , 
  
 options 
  
 = 
>  
 JSON 
  
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100

PostgreSQL

  SELECT 
  
 documents 
 . 
 doc_id 
 FROM 
  
 documents 
 , 
  
 ( 
  
 SELECT 
  
 doc_embedding 
  
 AS 
  
 value 
  
 FROM 
  
 documents 
  
 WHERE 
  
 user_id 
  
 = 
  
 1 
  
 AND 
  
 doc_id 
  
 = 
  
 1 
  
 LIMIT 
  
 1 
  
 ) 
  
 vector 
 WHERE 
  
 documents 
 . 
 doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
 ORDER 
  
 BY 
  
 spanner 
 . 
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
 documents 
 . 
 doc_embedding 
 , 
  
 vector 
 . 
 value 
 , 
  
 options 
 = 
> '{"num_leaves_to_search": 10}' 
 :: 
 jsonb 
 ) 
 LIMIT 
  
 100

If the embedding column is nullable:

GoogleSQL

  SELECT 
  
 DocId 
 FROM 
  
 Documents 
 WHERE 
  
 NullableDocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 WordCount 
 > 
 1000 
 ORDER 
  
 BY 
  
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
  
 ARRAY<FLOAT32> 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 , 
  
 NullableDocEmbedding 
 , 
  
 options 
  
 = 
>  
 JSON 
  
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100

PostgreSQL

  SELECT 
  
 doc_id 
 FROM 
  
 documents 
 WHERE 
  
 nullable_doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 word_count 
 > 
 1000 
 ORDER 
  
 BY 
  
 spanner 
 . 
 approx_euclidean_distance 
 ( 
  
 ARRAY 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 :: 
 float4 
 [], 
  
 nullable_doc_embedding 
 , 
  
 options 
 = 
> jsonb 
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100 
 ;

What's next

Learn more about Spanner vector indexes .
Learn more about the approximate distance functions in GoogleSQL and PostgreSQL .
Learn more about index statements for GoogleSQL VECTOR INDEX and PostgreSQL INDEX .
Learn more about vector index best practices .
Try the Getting started with Spanner Vector Search for a step-by-step example of using ANN.

Find approximate nearest neighbors (ANN) and query vector embeddings Stay organized with collections Save and categorize content based on your preferences.

Query vector embeddings

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

What's next

Find approximate nearest neighbors (ANN) and query vector embeddings