Perform similarity vector search in Bigtable by finding the K-nearest neighbors
Similarity vector search can help you identify similar concepts and contextual meaning in your Bigtable data, which means it can provide more relevant results when filtering for data stored within a specified key range. Example use cases include the following:
- Semantic matching of messages for a particular user in inbox search.
- Anomaly detection within a range of sensors.
- Retrieving the most relevant documents within a set of known keys for retrieval augmented generation (RAG).
- Personalization of search results to enhance a user's search experience by retrieving and ranking results based on their historical prompts and preferences that Bigtable stores.
- Retrieval of similar conversation threads to find and display past conversations that are contextually similar to a user's current chat for a more personalized experience.
- Prompt deduplication to identify identical or semantically similar prompts submitted by the same user and avoid redundant AI processing.
This page describes how to perform similarity vector search in Bigtable by using the cosine distance and Euclidean distance vector functions in GoogleSQL for Bigtable to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:
- Euclidean distance : measures the shortest distance between two vectors.
- Cosine distance : measures the cosine of the angle between two vectors.
- K-nearest neighbors (KNN) : a supervised machine learning algorithm used to solve classification or regression problems.
Bigtable supports the COSINE_DISTANCE()
and EUCLIDEAN_DISTANCE()
functions, which operate on vector embeddings, letting
you find the KNN of the input embedding.
You can use the Vertex AI text embeddings APIs to generate and store your Bigtable data as vector embeddings. You can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.
Both distance functions take the arguments vector1
and vector2
, which are of
the type array<>
and must consist of the same dimensions and have the same
length. For more details about these functions, see the following:
The code on this page demonstrate how to create embeddings, store them in Bigtable, and then perform a KNN search.
The example on this page uses EUCLIDEAN_DISTANCE()
and the
Bigtable client library for Python. However, you can also use COSINE_DISTANCE()
and any client library that supports
GoogleSQL for Bigtable, such as the Bigtable client library for
Java
.
Before you begin
Complete the following before you try the code samples.
Required roles
To get the permissions that you need to read and write to Bigtable, ask your administrator to grant you the following IAM role:
- Bigtable User
(
roles/bigtable.user
) on the Bigtable instance that you want to send requests to
Set up your environment
-
Download and install the Bigtable client library for Python. To use GoogleSQL for Bigtable functions, you must use
python-bigtable
version 2.26.0 or later. Instructions, including how to set up authentication, are at Python hello world . -
If you don't have a Bigtable instance, follow the steps at Create an instance .
-
Identify your resource IDs. When you run the code, replace the following placeholders with the IDs of your Google Cloud project, Bigtable instance, and table:
-
PROJECT_ID
-
INSTANCE_ID
-
TABLE_ID
-
Create a table to store the text, embeddings, and search phrase
Create a table with two column families.
Python
from
google.cloud
import
bigtable
from
google.cloud.bigtable
import
column_family
client
=
bigtable
.
Client
(
project
=
PROJECT_ID
,
admin
=
True
)
instance
=
client
.
instance
(
INSTANCE_ID
)
table
=
instance
.
table
(
TABLE_ID
)
column_families
=
{
"docs"
:
column_family
.
MaxVersionsGCRule
(
2
),
"search_phrase"
:
column_family
.
MaxVersionsGCRule
(
2
)}
if
not
table
.
exists
():
table
.
create
(
column_families
=
column_families
)
else
:
print
(
"Table already exists"
)
Embed texts with a pre-trained, foundational model from Vertex
Generate the text and embeddings to store in Bigtable along with the associated keys. For additional documentation, see Get text embeddings or Get multimodal embeddings .
Python
from
typing
import
List
,
Optional
from
vertexai.language_models
import
TextEmbeddingInput
,
TextEmbeddingModel
from
vertexai.generative_models
import
GenerativeModel
#defines which LLM that we should use to generate the text
model
=
GenerativeModel
(
"gemini-1.5-pro-001"
)
#First, use generative AI to create a list of 10 chunks for phrases
#This can be replaced with a static list of text items or your own data
chunks
=
[]
for
i
in
range
(
10
):
response
=
model
.
generate_content
(
"Generate a paragraph between 10 and 20 words that is about about either
Bigtable
or
Generative
AI
"
)
chunks
.
append
(
response
.
text
)
print
(
response
.
text
)
#create embeddings for the chunks of text
def
embed_text
(
texts
:
List
[
str
]
=
chunks
,
task
:
str
=
"RETRIEVAL_DOCUMENT"
,
model_name
:
str
=
"text-embedding-004"
,
dimensionality
:
Optional
[
int
]
=
128
,
)
-
> List
[
List
[
float
]]:
"""Embeds texts with a pre-trained, foundational model."""
model
=
TextEmbeddingModel
.
from_pretrained
(
model_name
)
inputs
=
[
TextEmbeddingInput
(
text
,
task
)
for
text
in
texts
]
kwargs
=
dict
(
output_dimensionality
=
dimensionality
)
if
dimensionality
else
{}
embeddings
=
model
.
get_embeddings
(
inputs
,
**
kwargs
)
return
[
embedding
.
values
for
embedding
in
embeddings
]
embeddings
=
embed_text
()
print
(
"embeddings created for text phrases"
)
Define functions that let you convert into byte objects
Bigtable is optimized for key-value pairs and generally stores data as byte objects. For more information about designing your data model for Bigtable, see Schema design best practices .
You need to convert the embeddings that come back from Vertex, which are stored as a list of floating point numbers in Python. You convert each element to big-endian IEEE 754 floating-point formation and then concatenate them together. The following function achieves this.
Python
import
struct
def
floats_to_bytes
(
float_list
):
"""
Convert a list of floats to a bytes object, where each float is represented
by 4 big-endian bytes.
Parameters:
float_list (list of float): The list of floats to be converted.
Returns:
bytes: The resulting bytes object with concatenated 4-byte big-endian
representations of the floats.
"""
byte_array
=
bytearray
()
for
value
in
float_list
:
packed_value
=
struct
.
pack
(
'>f'
,
value
)
byte_array
.
extend
(
packed_value
)
# Convert bytearray to bytes
return
bytes
(
byte_array
)
Write the embeddings to Bigtable
Convert the embeddings to byte objects, create a mutation, and then write the data to Bigtable.
Python
from
google.cloud.bigtable.data
import
RowMutationEntry
from
google.cloud.bigtable.data
import
SetCell
mutations
=
[]
embeddings
=
embed_text
()
for
i
,
embedding
in
enumerate
(
embeddings
):
print
(
embedding
)
#convert each embedding into a byte object
vector
=
floats_to_bytes
(
embedding
)
#set the row key which will be used to pull the range of documents (ex. doc type or user id)
row_key
=
f
"doc_
{
i
}
"
row
=
table
.
direct_row
(
row_key
)
#set the column for the embedding based on the byte object format of the embedding
row
.
set_cell
(
"docs"
,
"embedding"
,
vector
)
#store the text associated with vector in the same key
row
.
set_cell
(
"docs"
,
"text"
,
chunks
[
i
])
mutations
.
append
(
row
)
#write the rows to Bigtable
table
.
mutate_rows
(
mutations
)
Perform a KNN search using GoogleSQL for Bigtable
The vectors are stored as binary-encoded data that can be read from
Bigtable using a conversion function from the BYTES
type to ARRAY<FLOAT32>
.
Here is the SQL query:
SELECT
_key
,
TO_VECTOR32
(
data
[
'embedding'
]
)
AS
embedding
FROM
table
WHERE
_key
LIKE
'store123%'
;
In Python, you can use the GoogleSQL COSINE_DISTANCE
function
to find the similarity between your text embeddings and the search phrases that
you give it. Since this computation can take time to process, use the Python
client library's asynchronous data
client
to execute the SQL query.
Python
from
google.cloud.bigtable.data
import
BigtableDataClientAsync
#first embed the search phrase
search_embedding
=
embed_text
(
texts
=
[
"Apache HBase"
])
query
=
"""
select _key, docs['text'] as description
FROM knn_intro
ORDER BY COSINE_DISTANCE(TO_VECTOR32(docs['embedding']),
{search_embedding}
)
LIMIT 1;
"""
async
def
execute_query
():
async
with
BigtableDataClientAsync
(
project
=
PROJECT_ID
)
as
client
:
local_query
=
query
async
for
row
in
await
client
.
execute_query
(
query
.
format
(
search_embedding
=
search_embedding
[
0
]),
INSTANCE_ID
):
return
(
row
[
"_key"
],
row
[
"description"
])
await
execute_query
()
The response that is returned is a generated text description that describes Bigtable.