Search embeddings with vector search

This tutorial shows you how to perform a similarity search on embeddings stored in BigQuery tables by using the VECTOR_SEARCH function and optionally a vector index .

When you use VECTOR_SEARCH with a vector index, VECTOR_SEARCH uses the Approximate Nearest Neighbor method to improve vector search performance, with the trade-off of reducing recall and so returning more approximate results. Without a vector index, VECTOR_SEARCH uses brute force search to measure distance for every record.

Required permissions

To run this tutorial, you need the following Identity and Access Management (IAM) permissions:

To create a dataset, you need the bigquery.datasets.create permission.
To create a table, you need the following permissions:
- bigquery.tables.create
- bigquery.tables.updateData
- bigquery.jobs.create
To create a vector index, you need the bigquery.tables.createIndex permission on the table where you're creating the index.
To drop a vector index, you need the bigquery.tables.deleteIndex permission on the table where you're dropping the index.

Each of the following predefined IAM roles includes the permissions that you need to work with vector indexes:

BigQuery Data Owner ( roles/bigquery.dataOwner )
BigQuery Data Editor ( roles/bigquery.dataEditor )

Costs

The VECTOR_SEARCH function uses BigQuery compute pricing . You are charged for similarity search, using on-demand or editions pricing.

On-demand: You are charged for the amount of bytes scanned in the base table, the index, and the search query.
Editions pricing: You are charged for the slots required to complete the job within your reservation edition. Larger, more complex similarity calculations incur more charges.

Note: Using an index isn't supported in Standard editions .

For more information, see BigQuery pricing .

Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Note : If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project .
Enable the BigQuery API.

Enable the API

Create a dataset

Create a BigQuery dataset:

In the Google Cloud console, go to the BigQuery page.

Go to the BigQuery page
In the Explorerpane, click your project name.
Click View actions > Create dataset.
On the Create datasetpage, do the following:
- For Dataset ID, enter vector_search .
- For Location type, select Multi-region, and then select US (multiple regions in United States).
  
  The public datasets are stored in the US multi-region . For simplicity, store your dataset in the same location.
- Leave the remaining default settings as they are, and click Create dataset.

Create test tables

Create the patents table that contains patents embeddings, based on a subset of the Google Patents public dataset:

 CREATE 
  
 TABLE 
  
 vector_search 
 . 
 patents 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 `patents-public-data.google_patents_research.publications` 
 WHERE 
  
 ARRAY_LENGTH 
 ( 
 embedding_v1 
 ) 
  
 > 
  
 0 
  
 AND 
  
 publication_number 
  
 NOT 
  
 IN 
  
 ( 
 'KR-20180122872-A' 
 ) 
 LIMIT 
  
 1000000 
 ;

Create the patents2 table that contains a patent embedding to find nearest neighbors for:

 CREATE 
  
 TABLE 
  
 vector_search 
 . 
 patents2 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 `patents-public-data.google_patents_research.publications` 
 WHERE 
  
 publication_number 
  
 = 
  
 'KR-20180122872-A' 
 ;

Create a vector index

Create the my_index vector index on the embeddings_v1 column of the patents table:

 CREATE 
  
 OR 
  
 REPLACE 
  
 VECTOR 
  
 INDEX 
  
 my_index 
  
 ON 
  
 vector_search 
 . 
 patents 
 ( 
 embedding_v1 
 ) 
 STORING 
 ( 
 publication_number 
 , 
  
 title 
 ) 
 OPTIONS 
 ( 
 distance_type 
 = 
 'COSINE' 
 , 
  
 index_type 
 = 
 'IVF' 
 );

Wait several minutes for the vector index to be created, then run the following query and confirm that the coverage_percentage value is 100 :
```
 SELECT 
  
 * 
  
 FROM 
  
 vector_search 
 . 
 INFORMATION_SCHEMA 
 . 
 VECTOR_INDEXES 
 ; 
```

Use the `VECTOR_SEARCH` function with an index

After the vector index is created and populated, use the VECTOR_SEARCH function to find the nearest neighbor for the embedding in the embedding_v1 column in the patents2 table. This query uses the vector index in the search, so VECTOR_SEARCH uses an Approximate Nearest Neighbor method to find the embedding's nearest neighbor.

Use the VECTOR_SEARCH function with an index:

 SELECT 
  
 query 
 . 
 publication_number 
  
 AS 
  
 query_publication_number 
 , 
  
 query 
 . 
 title 
  
 AS 
  
 query_title 
 , 
  
 base 
 . 
 publication_number 
  
 AS 
  
 base_publication_number 
 , 
  
 base 
 . 
 title 
  
 AS 
  
 base_title 
 , 
  
 distance 
 FROM 
  
 VECTOR_SEARCH 
 ( 
  
 TABLE 
  
 vector_search 
 . 
 patents 
 , 
  
 'embedding_v1' 
 , 
  
 TABLE 
  
 vector_search 
 . 
 patents2 
 , 
  
 top_k 
  
 => 
  
 5 
 , 
  
 distance_type 
  
 => 
  
 'COSINE' 
 , 
  
 options 
  
 => 
  
 '{"fraction_lists_to_search": 0.005}' 
 );

The results look similar to the following:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        | 0.14471956347590609 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           | 0.17472108931171348 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

Use the `VECTOR_SEARCH` function with brute force

Use the VECTOR_SEARCH function to find the nearest neighbor for the embedding in the embedding_v1 column in the patents2 table. This query doesn't use the vector index in the search, so VECTOR_SEARCH finds the embedding's exact nearest neighbor.

 SELECT 
  
 query 
 . 
 publication_number 
  
 AS 
  
 query_publication_number 
 , 
  
 query 
 . 
 title 
  
 AS 
  
 query_title 
 , 
  
 base 
 . 
 publication_number 
  
 AS 
  
 base_publication_number 
 , 
  
 base 
 . 
 title 
  
 AS 
  
 base_title 
 , 
  
 distance 
 FROM 
  
 VECTOR_SEARCH 
 ( 
  
 TABLE 
  
 vector_search 
 . 
 patents 
 , 
  
 'embedding_v1' 
 , 
  
 TABLE 
  
 vector_search 
 . 
 patents2 
 , 
  
 top_k 
  
 => 
  
 5 
 , 
  
 distance_type 
  
 => 
  
 'COSINE' 
 , 
  
 options 
  
 => 
  
 '{"use_brute_force":true}' 
 );

The results look similar to the following:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        |  0.1447195634759062 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           |  0.1747210893117136 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

Evaluate recall

When you perform a vector search with an index, it returns approximate results, with the trade-off of reducing recall . You can compute recall by comparing the results returned by vector search with an index and by vector search with brute force. In this dataset, the publication_number value uniquely identifies a patent, so it is used for comparison.

 WITH 
  
 approx_results 
  
 AS 
  
 ( 
  
 SELECT 
  
 query 
 . 
 publication_number 
  
 AS 
  
 query_publication_number 
 , 
  
 base 
 . 
 publication_number 
  
 AS 
  
 base_publication_number 
  
 FROM 
  
 VECTOR_SEARCH 
 ( 
  
 TABLE 
  
 vector_search 
 . 
 patents 
 , 
  
 'embedding_v1' 
 , 
  
 TABLE 
  
 vector_search 
 . 
 patents2 
 , 
  
 top_k 
  
 => 
  
 5 
 , 
  
 distance_type 
  
 => 
  
 'COSINE' 
 , 
  
 options 
  
 => 
  
 '{"fraction_lists_to_search": 0.005}' 
 ) 
 ), 
  
 exact_results 
  
 AS 
  
 ( 
  
 SELECT 
  
 query 
 . 
 publication_number 
  
 AS 
  
 query_publication_number 
 , 
  
 base 
 . 
 publication_number 
  
 AS 
  
 base_publication_number 
  
 FROM 
  
 VECTOR_SEARCH 
 ( 
  
 TABLE 
  
 vector_search 
 . 
 patents 
 , 
  
 'embedding_v1' 
 , 
  
 TABLE 
  
 vector_search 
 . 
 patents2 
 , 
  
 top_k 
  
 => 
  
 5 
 , 
  
 distance_type 
  
 => 
  
 'COSINE' 
 , 
  
 options 
  
 => 
  
 '{"use_brute_force":true}' 
 ) 
 ) 
 SELECT 
  
 a 
 . 
 query_publication_number 
 , 
  
 SUM 
 ( 
 CASE 
  
 WHEN 
  
 a 
 . 
 base_publication_number 
  
 = 
  
 e 
 . 
 base_publication_number 
  
 THEN 
  
 1 
  
 ELSE 
  
 0 
  
 END 
 ) 
  
 / 
  
 5 
  
 AS 
  
 recall 
 FROM 
  
 exact_results 
  
 e 
  
 LEFT 
  
 JOIN 
  
 approx_results 
  
 a 
  
 ON 
  
 e 
 . 
 query_publication_number 
  
 = 
  
 a 
 . 
 query_publication_number 
 GROUP 
  
 BY 
  
 a 
 . 
 query_publication_number

If the recall is lower than you would like, you can increase the fraction_lists_to_search value, with the downside of potentially higher latency and resource usage. To tune your vector search, you can try multiple runs of VECTOR_SEARCH with different argument values, save the results to tables, and then compare the results.

Clean up

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete .
In the dialog, type the project ID, and then click Shut down to delete the project.

Search embeddings with vector search

Required permissions

Costs

Before you begin

Create a dataset

Create test tables

Create a vector index

Use the VECTOR_SEARCH function with an index

Use the VECTOR_SEARCH function with brute force

Evaluate recall

Clean up

Use the `VECTOR_SEARCH` function with an index

Use the `VECTOR_SEARCH` function with brute force