Perform similarity vector search in Bigtable by finding the K-nearest neighbors

Similarity vector search can help you identify similar concepts and contextual meaning in your Bigtable data, which means it can provide more relevant results when filtering for data stored within a specified key range. Example use cases include the following:

  • Semantic matching of messages for a particular user in inbox search.
  • Anomaly detection within a range of sensors.
  • Retrieving the most relevant documents within a set of known keys for retrieval augmented generation (RAG).
  • Personalization of search results to enhance a user's search experience by retrieving and ranking results based on their historical prompts and preferences that Bigtable stores.
  • Retrieval of similar conversation threads to find and display past conversations that are contextually similar to a user's current chat for a more personalized experience.
  • Prompt deduplication to identify identical or semantically similar prompts submitted by the same user and avoid redundant AI processing.

This page describes how to perform similarity vector search in Bigtable by using the cosine distance and Euclidean distance vector functions in GoogleSQL for Bigtable to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:

Bigtable supports the COSINE_DISTANCE() and EUCLIDEAN_DISTANCE() functions, which operate on vector embeddings, letting you find the KNN of the input embedding.

You can use the Vertex AI text embeddings APIs to generate and store your Bigtable data as vector embeddings. You can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.

Both distance functions take the arguments vector1 and vector2 , which are of the type array<> and must consist of the same dimensions and have the same length. For more details about these functions, see the following:

The code on this page demonstrate how to create embeddings, store them in Bigtable, and then perform a KNN search.

The example on this page uses EUCLIDEAN_DISTANCE() and the Bigtable client library for Python. However, you can also use COSINE_DISTANCE() and any client library that supports GoogleSQL for Bigtable, such as the Bigtable client library for Java .

Before you begin

Complete the following before you try the code samples.

Required roles

To get the permissions that you need to read and write to Bigtable, ask your administrator to grant you the following IAM role:

Set up your environment

  1. Download and install the Bigtable client library for Python. To use GoogleSQL for Bigtable functions, you must use python-bigtable version 2.26.0 or later. Instructions, including how to set up authentication, are at Python hello world .

  2. If you don't have a Bigtable instance, follow the steps at Create an instance .

  3. Identify your resource IDs. When you run the code, replace the following placeholders with the IDs of your Google Cloud project, Bigtable instance, and table:

    • PROJECT_ID
    • INSTANCE_ID
    • TABLE_ID

Create a table to store the text, embeddings, and search phrase

Create a table with two column families.

Python

  from 
  
 google.cloud 
  
 import 
  bigtable 
 
 from 
  
 google.cloud.bigtable 
  
 import 
 column_family 
 client 
 = 
  bigtable 
 
 . 
  Client 
 
 ( 
 project 
 = 
 PROJECT_ID 
 , 
 admin 
 = 
 True 
 ) 
 instance 
 = 
  client 
 
 . 
  instance 
 
 ( 
 INSTANCE_ID 
 ) 
 table 
 = 
 instance 
 . 
 table 
 ( 
 TABLE_ID 
 ) 
 column_families 
 = 
 { 
 "docs" 
 : 
 column_family 
 . 
  MaxVersionsGCRule 
 
 ( 
 2 
 ), 
 "search_phrase" 
 : 
 column_family 
 . 
  MaxVersionsGCRule 
 
 ( 
 2 
 )} 
 if 
 not 
 table 
 . 
 exists 
 (): 
 table 
 . 
 create 
 ( 
 column_families 
 = 
 column_families 
 ) 
 else 
 : 
 print 
 ( 
 "Table already exists" 
 ) 
 

Embed texts with a pre-trained, foundational model from Vertex

Generate the text and embeddings to store in Bigtable along with the associated keys. For additional documentation, see Get text embeddings or Get multimodal embeddings .

Python

  from 
  
 typing 
  
 import 
 List 
 , 
 Optional 
 from 
  
 vertexai.language_models 
  
 import 
 TextEmbeddingInput 
 , 
 TextEmbeddingModel 
 from 
  
 vertexai.generative_models 
  
 import 
 GenerativeModel 
 #defines which LLM that we should use to generate the text 
 model 
 = 
 GenerativeModel 
 ( 
 "gemini-1.5-pro-001" 
 ) 
 #First, use generative AI to create a list of 10 chunks for phrases 
 #This can be replaced with a static list of text items or your own data 
 chunks 
 = 
 [] 
 for 
 i 
 in 
 range 
 ( 
 10 
 ): 
 response 
 = 
 model 
 . 
 generate_content 
 ( 
 "Generate a paragraph between 10 and 20 words that is about about either 
 Bigtable 
 or 
 Generative 
 AI 
 " 
 ) 
 chunks 
 . 
 append 
 ( 
 response 
 . 
 text 
 ) 
 print 
 ( 
 response 
 . 
 text 
 ) 
 #create embeddings for the chunks of text 
 def 
  
 embed_text 
 ( 
 texts 
 : 
 List 
 [ 
 str 
 ] 
 = 
 chunks 
 , 
 task 
 : 
 str 
 = 
 "RETRIEVAL_DOCUMENT" 
 , 
 model_name 
 : 
 str 
 = 
 "text-embedding-004" 
 , 
 dimensionality 
 : 
 Optional 
 [ 
 int 
 ] 
 = 
 128 
 , 
 ) 
 - 
> List 
 [ 
 List 
 [ 
 float 
 ]]: 
  
 """Embeds texts with a pre-trained, foundational model.""" 
 model 
 = 
 TextEmbeddingModel 
 . 
 from_pretrained 
 ( 
 model_name 
 ) 
 inputs 
 = 
 [ 
 TextEmbeddingInput 
 ( 
 text 
 , 
 task 
 ) 
 for 
 text 
 in 
 texts 
 ] 
 kwargs 
 = 
 dict 
 ( 
 output_dimensionality 
 = 
 dimensionality 
 ) 
 if 
 dimensionality 
 else 
 {} 
 embeddings 
 = 
 model 
 . 
 get_embeddings 
 ( 
 inputs 
 , 
 ** 
 kwargs 
 ) 
 return 
 [ 
 embedding 
 . 
 values 
 for 
 embedding 
 in 
 embeddings 
 ] 
 embeddings 
 = 
 embed_text 
 () 
 print 
 ( 
 "embeddings created for text phrases" 
 ) 
 

Define functions that let you convert into byte objects

Bigtable is optimized for key-value pairs and generally stores data as byte objects. For more information about designing your data model for Bigtable, see Schema design best practices .

You need to convert the embeddings that come back from Vertex, which are stored as a list of floating point numbers in Python. You convert each element to big-endian IEEE 754 floating-point formation and then concatenate them together. The following function achieves this.

Python

  import 
  
 struct 
 def 
  
 floats_to_bytes 
 ( 
 float_list 
 ): 
  
 """ 
 Convert a list of floats to a bytes object, where each float is represented 
 by 4 big-endian bytes. 
 Parameters: 
 float_list (list of float): The list of floats to be converted. 
 Returns: 
 bytes: The resulting bytes object with concatenated 4-byte big-endian 
 representations of the floats. 
 """ 
 byte_array 
 = 
 bytearray 
 () 
 for 
 value 
 in 
 float_list 
 : 
 packed_value 
 = 
 struct 
 . 
 pack 
 ( 
 '>f' 
 , 
 value 
 ) 
 byte_array 
 . 
 extend 
 ( 
 packed_value 
 ) 
 # Convert bytearray to bytes 
 return 
 bytes 
 ( 
 byte_array 
 ) 
 

Write the embeddings to Bigtable

Convert the embeddings to byte objects, create a mutation, and then write the data to Bigtable.

Python

  from 
  
 google.cloud.bigtable.data 
  
 import 
  RowMutationEntry 
 
 from 
  
 google.cloud.bigtable.data 
  
 import 
  SetCell 
 
 mutations 
 = 
 [] 
 embeddings 
 = 
 embed_text 
 () 
 for 
 i 
 , 
 embedding 
 in 
 enumerate 
 ( 
 embeddings 
 ): 
 print 
 ( 
 embedding 
 ) 
 #convert each embedding into a byte object 
 vector 
 = 
 floats_to_bytes 
 ( 
 embedding 
 ) 
 #set the row key which will be used to pull the range of documents (ex. doc type or user id) 
 row_key 
 = 
 f 
 "doc_ 
 { 
 i 
 } 
 " 
 row 
 = 
 table 
 . 
 direct_row 
 ( 
 row_key 
 ) 
 #set the column for the embedding based on the byte object format of the embedding 
  row 
 
 . 
 set_cell 
 ( 
 "docs" 
 , 
 "embedding" 
 , 
 vector 
 ) 
 #store the text associated with vector in the same key 
  row 
 
 . 
 set_cell 
 ( 
 "docs" 
 , 
 "text" 
 , 
 chunks 
 [ 
 i 
 ]) 
  mutations 
 
 . 
 append 
 ( 
 row 
 ) 
 #write the rows to Bigtable 
 table 
 . 
 mutate_rows 
 ( 
 mutations 
 ) 
 

The vectors are stored as binary-encoded data that can be read from Bigtable using a conversion function from the BYTES type to ARRAY<FLOAT32> .

Here is the SQL query:

  SELECT 
  
 _key 
 , 
  
 TO_VECTOR32 
 ( 
 data 
 [ 
 'embedding' 
 ] 
 ) 
  
 AS 
  
 embedding 
 FROM 
  
 table 
  
 WHERE 
  
 _key 
  
 LIKE 
  
 'store123%' 
 ; 
 

In Python, you can use the GoogleSQL COSINE_DISTANCE function to find the similarity between your text embeddings and the search phrases that you give it. Since this computation can take time to process, use the Python client library's asynchronous data client to execute the SQL query.

Python

  from 
  
 google.cloud.bigtable.data 
  
 import 
 BigtableDataClientAsync 
 #first embed the search phrase 
 search_embedding 
 = 
 embed_text 
 ( 
 texts 
 = 
 [ 
 "Apache HBase" 
 ]) 
 query 
 = 
 """ 
 select _key, docs['text'] as description 
 FROM knn_intro 
 ORDER BY COSINE_DISTANCE(TO_VECTOR32(docs['embedding']), 
 {search_embedding} 
 ) 
 LIMIT 1; 
 """ 
 async 
 def 
  
 execute_query 
 (): 
 async 
 with 
 BigtableDataClientAsync 
 ( 
 project 
 = 
 PROJECT_ID 
 ) 
 as 
 client 
 : 
 local_query 
 = 
 query 
 async 
 for 
 row 
 in 
 await 
 client 
 . 
 execute_query 
 ( 
 query 
 . 
 format 
 ( 
 search_embedding 
 = 
 search_embedding 
 [ 
 0 
 ]), 
 INSTANCE_ID 
 ): 
 return 
 ( 
 row 
 [ 
 "_key" 
 ], 
 row 
 [ 
 "description" 
 ]) 
 await 
 execute_query 
 () 
 

The response that is returned is a generated text description that describes Bigtable.

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: