Bigtable to Vertex AI Vector Search template

The template for Bigtable to Vertex AI Vector Search files on Cloud Storage creates a batch pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in the JSON format. Use this template for vector embeddings.

Pipeline requirements

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before you run the pipeline.

Template parameters

Required parameters

  • bigtableProjectId: The ID for the Google Cloud project that contains the Bigtable instance that you want to read data from.
  • bigtableInstanceId: The ID of the Bigtable instance that contains the table.
  • bigtableTableId: The ID of the Bigtable table to read from.
  • outputDirectory: The Cloud Storage path where the output JSON files are stored. For example, gs://your-bucket/your-path/ .
  • idColumn: The fully qualified column name where the ID is stored. In the format cf:col or _key .
  • embeddingColumn: The fully qualified column name where the embeddings are stored. In the format cf:col or _key .

Optional parameters

  • filenamePrefix: The prefix of the JSON filename. For example: table1- . If no value is provided, defaults to part .
  • crowdingTagColumn: The fully qualified column name where the crowding tag is stored. In the format cf:col or _key .
  • embeddingByteSize: The byte size of each entry in the embeddings array. For float, use the value 4 . For double, use the value 8 . Defaults to 4 .
  • allowRestrictsMappings: The comma-separated, fully qualified column names for the columns to use as the allow restricts, with their aliases. In the format cf:col->alias .
  • denyRestrictsMappings: The comma-separated, fully qualified column names for the columns to use as the deny restricts, with their aliases. In the format cf:col->alias .
  • intNumericRestrictsMappings: The comma-separated, fully qualified column names of the columns to use as integer numeric_restricts, with their aliases. In the format cf:col->alias .
  • floatNumericRestrictsMappings: The comma-separated, fully qualified column names of the columns to use as float (4 bytes) numeric_restricts, with their aliases. In the format cf:col->alias .
  • doubleNumericRestrictsMappings: The comma-separated, fully qualified column names of the columns to use as double (8 bytes) numeric_restricts, with their aliases. In the format cf:col->alias .
  • bigtableAppProfileId: The ID of the Cloud Bigtable app profile to be used for the export. Defaults to: default.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint , select a value from the drop-down menu. The default region is us-central1 .

    For a list of regions where you can run a Dataflow job, see Dataflow locations .

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to Vector Embeddings template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job .

gcloud CLI

In your shell or terminal, run the template:

gcloud  
dataflow  
 jobs 
  
run  
 JOB_NAME 
  
 \ 
  
--gcs-location = 
gs://dataflow-templates- REGION_NAME 
/ VERSION 
/Cloud_Bigtable_to_Vector_Embeddings  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--region = 
 REGION_NAME 
  
 \ 
  
--parameters  
 \ 
  
 bigtableProjectId 
 = 
 BIGTABLE_PROJECT_ID 
, \ 
  
 bigtableInstanceId 
 = 
 BIGTABLE_INSTANCE_ID 
, \ 
  
 bigtableTableId 
 = 
 BIGTABLE_TABLE_ID 
, \ 
  
 filenamePrefix 
 = 
 FILENAME_PREFIX 
, \ 
  
 idColumn 
 = 
 ID_COLUMN 
, \ 
  
 embeddingColumn 
 = 
 EMBEDDING_COLUMN 
, \ 

Replace the following:

  • JOB_NAME : a unique job name of your choice
  • VERSION : the version of the template that you want to use

    You can use the following values:

  • REGION_NAME : the region where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID : the project ID
  • BIGTABLE_INSTANCE_ID : the instance ID
  • BIGTABLE_TABLE_ID : the table ID
  • FILENAME_PREFIX : the JSON file prefix
  • ID_COLUMN : the ID column
  • EMBEDDING_COLUMN : the embeddings column

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch .

 POST 
  
 h 
 tt 
 ps 
 : 
 //dataflow.googleapis.com/v1b3/projects/ PROJECT_ID 
/locations/ LOCATION 
/templates:launch?gcsPath=gs://dataflow-templates- LOCATION 
/ VERSION 
/Cloud_Bigtable_to_Vector_Embeddings { 
  
 "jobName" 
 : 
  
 " JOB_NAME 
" 
 , 
  
 "parameters" 
 : 
  
 { 
  
 "bigtableProjectId" 
 : 
  
 " BIGTABLE_PROJECT_ID 
" 
 , 
  
 "bigtableInstanceId" 
 : 
  
 " BIGTABLE_INSTANCE_ID 
" 
 , 
  
 "bigtableTableId" 
 : 
  
 " BIGTABLE_TABLE_ID 
" 
 , 
  
 "filenamePrefix" 
 : 
  
 " FILENAME_PREFIX 
" 
 , 
  
 "idColumn" 
 : 
  
 " ID_COLUMN 
" 
 , 
  
 "embeddingColumn" 
 : 
  
 " EMBEDDING_COLUMN 
" 
 , 
  
 }, 
  
 "environment" 
 : 
  
 { 
  
 "maxWorkers" 
 : 
  
 "10" 
  
 } 
 } 
 

Replace the following:

  • PROJECT_ID : the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME : a unique job name of your choice
  • VERSION : the version of the template that you want to use

    You can use the following values:

  • LOCATION : the region where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID : the project ID
  • BIGTABLE_INSTANCE_ID : the instance ID
  • BIGTABLE_TABLE_ID : the table ID
  • FILENAME_PREFIX : the JSON file prefix
  • ID_COLUMN : the ID column
  • EMBEDDING_COLUMN : the embeddings column

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: