Create and manage vector indexes

This page explains how to create and manage Spanner vector indexes, which use approximate nearest neighbor (ANN) search and tree-based structures to accelerate vector similarity searches on your data.

Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a specialized vector index. This index leverages Google Research's Scalable Nearest Neighbor (ScaNN) , a highly efficient nearest neighbor algorithm.

The vector index uses a tree-based structure to partition data and facilitate faster searches. Spanner offers both two-level and three-level tree configurations:

Two-level tree configuration: Leaf nodes ( num_leaves ) contain groups of closely related vectors along with their corresponding centroid. The root level consists of the centroids from all leaf nodes.
Three-level tree configuration: Similar in concept to a two-level tree, while introducing an additional branch layer ( num_branches ), from which leaf node centroids are further partitioned to form the root level ( num_leaves ).

Spanner picks an index for you. However, if you know that a specific index works best, then you can use the FORCE_INDEX hint to choose to use the most appropriate vector index for your use case.

For more information, see VECTOR INDEX statements for GoogleSQL and INDEX statements for PostgreSQL .

Limitations

You can't pre-split vector indexes. For more information, see Pre-splitting overview .

Create vector index

To optimize the recall and performance of a vector index, we recommend that you:

Create your vector index after most of the rows with embeddings are written to your database. You might also need to periodically rebuild the vector index after you insert new data. For more information, see Rebuild the vector index .
For GoogleSQL, use the STORING clause and for PostgreSQL, use the INCLUDE clause to store a copy of a column in the vector index. If a column value is stored in the vector index, then Spanner performs filtering at the index's leaf level to improve query performance. We recommend that you store a column if it's used in a filtering condition.
Use non-embedding key columns in the vector index. Key columns are similar to STORING or INCLUDE columns, but allow the query engine to perform filtering more efficiently during vector search. For more information, see Create vector index (GoogleSQL) or Index statements (PostgreSQL).

When you create your table, the embedding column must be an array of the FLOAT32 (GoogleSQL) or float4[] (PostgreSQL) data type (recommended), and have a vector length annotation ( vector_length=>N for GoogleSQL or VECTOR LENGTH N for PostgreSQL), indicating the dimension of the vectors.

The optimal vector length depends on your workload, dataset size, and available computational resources. Experiment with different dimensions to find the smallest size that maintains accuracy and performance for your application.

The following DDL statement creates a Documents table with an embedding column DocEmbedding with a vector length:

GoogleSQL

  CREATE 
  
 TABLE 
  
 Documents 
  
 ( 
  
 UserId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 DocId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 Author 
  
 STRING 
  
 ( 
 1024 
 ), 
  
 DocContents 
  
 Bytes 
 ( 
 MAX 
 ), 
  
 DocEmbedding 
  
 ARRAY<FLOAT32> 
 ( 
 vector_length 
 = 
> 128 
 ) 
  
 NOT 
  
 NULL 
 , 
  
 NullableDocEmbedding 
  
 ARRAY<FLOAT32> 
 ( 
 vector_length 
 = 
> 128 
 ), 
  
 WordCount 
  
 INT64 
 ) 
  
 PRIMARY 
  
 KEY 
  
 ( 
 DocId 
 );

PostgreSQL

  CREATE 
  
 TABLE 
  
 documents 
  
 ( 
  
 user_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 doc_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 author 
  
 varchar 
 ( 
 1024 
 ), 
  
 doc_contents 
  
 bytea 
 , 
  
 doc_embedding 
  
 float4 
 [] 
  
 VECTOR 
  
 LENGTH 
  
 128 
  
 not 
  
 null 
 , 
  
 nullable_doc_embedding 
  
 float4 
 [] 
  
 VECTOR 
  
 LENGTH 
  
 128 
 , 
  
 word_count 
  
 bigint 
 , 
  
 PRIMARY 
  
 KEY 
  
 ( 
 doc_id 
 ) 
 );

After you populate your Documents table, you can create a vector index with a two-level tree and 1000 leaf nodes on the Documents table with an embedding column DocEmbedding using the cosine distance:

GoogleSQL

  CREATE 
  
 VECTOR 
  
 INDEX 
  
 DocEmbeddingIndex 
  
 ON 
  
 Documents 
 ( 
 DocEmbedding 
 ) 
  
 STORING 
  
 ( 
 WordCount 
 ) 
  
 OPTIONS 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 tree_depth 
  
 = 
  
 2 
 , 
  
 num_leaves 
  
 = 
  
 1000 
 );

PostgreSQL

  CREATE 
  
 INDEX 
  
 doc_embedding_index 
  
 ON 
  
 documents 
  
 USING 
  
 scann 
 ( 
 doc_embedding 
 ) 
  
 INCLUDE 
  
 ( 
 word_count 
 ) 
  
 WITH 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 num_leaves 
  
 = 
  
 1000 
 ) 
  
 WHERE 
  
 doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
 ;

If your embedding column isn't marked as NOT NULL in the table definition, you must declare it with a WHERE COLUMN_NAME IS NOT NULL clause in the vector index definition, where COLUMN_NAME is the name of your embedding column. To create a vector index with a three-level tree and 1000000 leaf nodes on the nullable embedding column NullableDocEmbedding using the cosine distance:

GoogleSQL

  CREATE 
  
 VECTOR 
  
 INDEX 
  
 DocEmbeddingThreeLevelIndex 
  
 ON 
  
 Documents 
 ( 
 NullableDocEmbedding 
 ) 
  
 STORING 
  
 ( 
 WordCount 
 ) 
  
 WHERE 
  
 NullableDocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 OPTIONS 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 tree_depth 
  
 = 
  
 3 
 , 
  
 num_branches 
 = 
 1000 
 , 
  
 num_leaves 
  
 = 
  
 1000000 
 );

PostgreSQL

  CREATE 
  
 INDEX 
  
 doc_embedding_index 
  
 ON 
  
 documents 
  
 USING 
  
 scann 
 ( 
 nullable_doc_embedding 
 ) 
  
 INCLUDE 
  
 ( 
 word_count 
 ) 
  
 WITH 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 tree_depth 
  
 = 
  
 3 
 , 
  
 num_branches 
  
 = 
  
 1000 
 , 
  
 num_leaves 
  
 = 
  
 1000000 
 ) 
  
 WHERE 
  
 nullable_doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
 ;

Filter a vector index

You can also create a filtered vector index to find the most similar items in your database that match the filter condition. A filtered vector index selectively indexes rows that satisfy the specified filter conditions, improving search performance.

In the following example, the table Documents2 has a column called Category . In our vector search, we want to index the "Tech" category so we create a generated column that evaluates to NULL if the category condition isn't met.

GoogleSQL

  CREATE 
  
 TABLE 
  
 Documents2 
  
 ( 
  
 UserId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 DocId 
  
 INT64 
  
 NOT 
  
 NULL 
 , 
  
 DocName 
  
 STRING 
  
 ( 
 1024 
 ), 
  
 Author 
  
 STRING 
  
 ( 
 1024 
 ), 
  
 DocContents 
  
 Bytes 
 ( 
 MAX 
 ), 
  
 Category 
  
 STRING 
 ( 
 MAX 
 ), 
  
 NullIfFiltered 
  
 BOOL 
  
 AS 
  
 ( 
 IF 
 ( 
 Category 
  
 = 
  
 'Tech' 
 , 
  
 TRUE 
 , 
  
 NULL 
 )) 
  
 HIDDEN 
 , 
  
 DocEmbedding 
  
 ARRAY<FLOAT32> 
 ( 
 vector_length 
 = 
> 128 
 ) 
 ) 
  
 PRIMARY 
  
 KEY 
  
 ( 
 DocId 
 );

PostgreSQL

  CREATE 
  
 TABLE 
  
 documents2 
  
 ( 
  
 user_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 doc_id 
  
 bigint 
  
 not 
  
 null 
 , 
  
 doc_name 
  
 varchar 
 ( 
 1024 
 ), 
  
 author 
  
 varchar 
 ( 
 1024 
 ), 
  
 doc_contents 
  
 bytea 
 , 
  
 category 
  
 varchar 
 , 
  
 null_if_filtered 
  
 boolean 
  
 GENERATED 
  
 ALWAYS 
  
 AS 
  
 ( 
 CASE 
  
 WHEN 
  
 category 
  
 = 
  
 'Tech' 
  
 THEN 
  
 true 
  
 END 
 ) 
  
 VIRTUAL 
  
 HIDDEN 
 , 
  
 doc_embedding 
  
 float4 
 [] 
  
 VECTOR 
  
 LENGTH 
  
 128 
 , 
  
 PRIMARY 
  
 KEY 
  
 ( 
 doc_id 
 ) 
 );

Then, we create a vector index with a filter. The TechDocEmbeddingIndex vector index only indexes documents in the "Tech" category.

GoogleSQL

  CREATE 
  
 VECTOR 
  
 INDEX 
  
 TechDocEmbeddingIndex 
  
 ON 
  
 Documents2 
 ( 
 DocEmbedding 
 ) 
  
 STORING 
 ( 
 NullIfFiltered 
 ) 
  
 WHERE 
  
 DocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 NullIfFiltered 
  
 IS 
  
 NOT 
  
 NULL 
  
 OPTIONS 
  
 (...);

PostgreSQL

  CREATE 
  
 INDEX 
  
 tech_doc_embedding_index 
  
 ON 
  
 documents2 
  
 USING 
  
 scann 
 ( 
 doc_embedding 
 ) 
  
 INCLUDE 
  
 ( 
 null_if_filtered 
 ) 
  
 WITH 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 num_leaves 
  
 = 
  
 1000 
 ) 
  
 WHERE 
  
 doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 null_if_filtered 
  
 IS 
  
 NOT 
  
 NULL 
 ;

When Spanner runs the following query, which has filters that match the TechDocEmbeddingIndex , it automatically picks and is accelerated by TechDocEmbeddingIndex . The query only searches documents in the "Tech" category. You can also use the FORCE_INDEX hint ( @{FORCE_INDEX=TechDocEmbeddingIndex} for GoogleSQL or /*@ FORCE_INDEX = tech_doc_embedding_index */ for PostgreSQL) to force Spanner to use the index explicitly.

GoogleSQL

  SELECT 
  
 * 
 FROM 
  
 Documents2 
 WHERE 
  
 DocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 NullIfFiltered 
  
 IS 
  
 NOT 
  
 NULL 
 ORDER 
  
 BY 
  
 APPROX_ 
 (....) 
 LIMIT 
  
 10 
 ;

PostgreSQL

  SELECT 
  
 * 
 FROM 
  
 documents2 
 WHERE 
  
 doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 null_if_filtered 
  
 IS 
  
 NOT 
  
 NULL 
 ORDER 
  
 BY 
  
 spanner 
 . 
 approx_cosine_distance 
 ( 
 doc_embedding 
 , 
  
 ARRAY 
 [ 
 1.0 
 :: 
 float4 
 , 
  
 2.0 
 :: 
 float4 
 , 
  
 3.0 
 :: 
 float4 
 ]) 
 LIMIT 
  
 10 
 ;

To improve query performance, you can include non-embedding key columns in your vector index. This allows the query engine to more efficiently perform filtering during vector search.

In the index creation statement, you must list these additional key columns after the embedding column. For example, the following statement creates a vector index that includes the DocName and Author key columns for more efficient filtering:

GoogleSQL

  CREATE 
  
 VECTOR 
  
 INDEX 
  
 DocEmbeddingIndexWithKeys 
  
 ON 
  
 Documents2 
 ( 
 DocEmbedding 
 , 
  
 DocName 
 , 
  
 Author 
 ) 
  
 STORING 
 ( 
 NullIfFiltered 
 ) 
  
 WHERE 
  
 DocEmbedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 NullIfFiltered 
  
 IS 
  
 NOT 
  
 NULL 
  
 OPTIONS 
  
 (...);

PostgreSQL

  CREATE 
  
 INDEX 
  
 doc_embedding_index_with_keys 
  
 ON 
  
 documents2 
  
 USING 
  
 scann 
 ( 
 doc_embedding 
 , 
  
 doc_name 
 , 
  
 author 
 ) 
  
 INCLUDE 
  
 ( 
 null_if_filtered 
 ) 
  
 WITH 
  
 ( 
 distance_type 
  
 = 
  
 'COSINE' 
 , 
  
 num_leaves 
  
 = 
  
 1000 
 ) 
  
 WHERE 
  
 doc_embedding 
  
 IS 
  
 NOT 
  
 NULL 
  
 AND 
  
 null_if_filtered 
  
 IS 
  
 NOT 
  
 NULL 
 ;

What's next

Learn more about Spanner approximate nearest neighbors .
Learn more about the approximate distance functions in GoogleSQL and PostgreSQL .
Learn more about index statements for GoogleSQL VECTOR INDEX and PostgreSQL INDEX .
Learn more about vector index best practices .

Create and manage vector indexes Stay organized with collections Save and categorize content based on your preferences.

Limitations

Create vector index

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

Filter a vector index

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

GoogleSQL

PostgreSQL

What's next

Create and manage vector indexes