Find approximate nearest neighbors (ANN) and query vector embeddings

This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.

When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.

In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss . How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.

For more details about the approximate distance functions supported in Spanner, see the following GoogleSQL reference pages:

Query vector embeddings

Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index . You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index . You can then use any one of the three approximate distance functions to find the ANN.

Restrictions when using the approximate distance functions include the following:

  • The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
  • The approximate distance function output must be used in a ORDER BY clause as the sole sort key, and a LIMIT must be specified after the ORDER BY .
  • The query must explicitly filter out rows that aren't indexed. In most cases, this means that the query must include a WHERE <column_name> IS NOT NULL clause that matches the vector index definition, unless the column is already marked as NOT NULL in the table definition.

For a detailed list of limitations, see the approximate distance function reference page .

Examples

Consider a Documents table that has a DocEmbedding column of precomputed text embeddings from the DocContents bytes column, and a NullableDocEmbedding column populated from other sources that might be null.

  CREATE 
  
 TABLE 
  
 Documents 
  
 ( 
 UserId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
 DocId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
 Author 
  
 STRING 
 ( 
 1024 
 ), 
 DocContents 
  
 BYTES 
 ( 
 MAX 
 ), 
 DocEmbedding 
  
 ARRAY<FLOAT32> 
  
 NOT 
  
 NULL 
 , 
 NullableDocEmbedding 
  
 ARRAY<FLOAT32> 
 , 
 WordCount 
  
 INT64 
 ) 
  
 PRIMARY 
  
 KEY 
  
 ( 
 UserId 
 , 
  
 DocId 
 ); 
 

To search for the nearest 100 vectors to [1.0, 2.0, 3.0] :

  SELECT 
  
 DocId 
 FROM 
  
 Documents 
 WHERE 
  
 WordCount 
 > 
 1000 
 ORDER 
  
 BY 
  
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
  
 ARRAY<FLOAT32> 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 , 
  
 DocEmbedding 
 , 
  
 options 
  
 = 
>  
 JSON 
  
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100 
 

If the embedding column is nullable:

  SELECT 
  
 DocId 
 FROM 
  
 Documents 
 WHERE 
  
 NullableDocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 WordCount 
 > 
 1000 
 ORDER 
  
 BY 
  
 APPROX_EUCLIDEAN_DISTANCE 
 ( 
  
 ARRAY<FLOAT32> 
 [ 
 1.0 
 , 
  
 2.0 
 , 
  
 3.0 
 ] 
 , 
  
 NullableDocEmbedding 
 , 
  
 options 
  
 = 
>  
 JSON 
  
 '{"num_leaves_to_search": 10}' 
 ) 
 LIMIT 
  
 100 
 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: