Embedding Ingestion and Vector Search with Apache Beam and BigQuery

Introduction

This Colab demonstrates how to use the Apache Beam RAG package to generate embeddings, ingest them into BigQuery, and perform vector similarity search .

The notebook is divided into two main parts:

Basic Example: Using the default schema for simple vector search
Advanced Example: Using a custom schema and metadata filtering

Example: Product Catalog

We'll work with a sample e-commerce dataset representing a product catalog. Each product has:

Structured fields: id , name , category , price , etc.
Detailed text descriptions:Longer text describing the product's features.
Additional metadata: brand , features , dimensions , etc.

Setup and Prerequisites

This example requires:

A Google Cloud project with BigQuery enabled
Apache Beam 2.64.0 or later

Install Packages and Dependencies

First, let's install the Python packages required for the embedding and ingestion pipeline:

  # Apache Beam with GCP support 
 
 pip  
install  
apache_beam [ 
interactive,gcp ] 
> = 
 2 
.64.0  
--quiet 
  # Huggingface sentence-transformers for embedding models 
 
 pip  
install  
sentence-transformers  
--quiet

Authenticate to Google Cloud

To connect to BigQuery, we authenticate with Google Cloud.

  PROJECT_ID 
 = 
 "" 
 # @param {type:"string"} 
 # Authentication and project setup 
 from 
  
 google.colab 
  
 import 
 auth 
 auth 
 . 
 authenticate_user 
 ( 
 project_id 
 = 
 PROJECT_ID 
 )

Create BigQuery Dataset

Let's set up a BigQuery dataset and table to store our embeddings:

  DATASET_ID 
 = 
 "" 
 # @param {type:"string"} 
 TEMP_GCS_LOCATION 
 = 
 "gs://" 
 # @param {type:"string"}

  from 
  
 google.cloud 
  
 import 
  bigquery 
 
 # Create BigQuery client 
 client 
 = 
  bigquery 
 
 . 
  Client 
 
 ( 
 project 
 = 
 PROJECT_ID 
 ) 
 # Create dataset 
 dataset_ref 
 = 
 client 
 . 
  dataset 
 
 ( 
 DATASET_ID 
 ) 
 try 
 : 
 client 
 . 
  get_dataset 
 
 ( 
 dataset_ref 
 ) 
 print 
 ( 
 f 
 "Dataset 
 { 
 DATASET_ID 
 } 
 already exists" 
 ) 
 except 
 Exception 
 : 
 dataset 
 = 
  bigquery 
 
 . 
  Dataset 
 
 ( 
 dataset_ref 
 ) 
 dataset 
 . 
 location 
 = 
 "US" 
 dataset 
 = 
 client 
 . 
  create_dataset 
 
 ( 
 dataset 
 ) 
 print 
 ( 
 f 
 "Created dataset 
 { 
 DATASET_ID 
 } 
 " 
 )

Importing Pipeline Components

We import the following for configuring our embedding ingestion pipeline:

Chunk , the structured that represents embeddable content with metadata
BigQueryVectorWriterConfig for configuring write behavior

  # Embedding-specific imports 
 from 
  
 apache_beam.ml.rag.ingestion.bigquery 
  
 import 
 BigQueryVectorWriterConfig 
 , 
 SchemaConfig 
 from 
  
 apache_beam.ml.rag.ingestion.base 
  
 import 
 VectorDatabaseWriteTransform 
 from 
  
 apache_beam.ml.rag.types 
  
 import 
 Chunk 
 , 
 Content 
 from 
  
 apache_beam.ml.rag.embeddings.huggingface 
  
 import 
 HuggingfaceTextEmbeddings 
 from 
  
 apache_beam.ml.rag.enrichment.bigquery_vector_search 
  
 import 
 ( 
 BigQueryVectorSearchParameters 
 , 
 BigQueryVectorSearchEnrichmentHandler 
 ) 
 # Apache Beam core 
 import 
  
 apache_beam 
  
 as 
  
 beam 
 from 
  
 apache_beam.options 
  
 import 
 pipeline_options 
 from 
  
 apache_beam.options.pipeline_options 
  
 import 
 PipelineOptions 
 from 
  
 apache_beam.ml.transforms.base 
  
 import 
 MLTransform 
 from 
  
 apache_beam.transforms.enrichment 
  
 import 
 Enrichment

Define helper functions

To run streaming examples we define helper functions to

Create a PubSub topic
Publish messages to a PubSub topic in a background thread

  # Set up PubSub topic 
 from 
  
 google.api_core.exceptions 
  
 import 
 AlreadyExists 
 from 
  
 google.cloud 
  
 import 
 pubsub_v1 
 import 
  
 threading 
 import 
  
 time 
 import 
  
 datetime 
 import 
  
 json 
 def 
  
 create_pubsub_topic 
 ( 
 project_id 
 , 
 topic 
 ): 
 publisher 
 = 
 pubsub_v1 
 . 
  PublisherClient 
 
 () 
 topic_path 
 = 
 publisher 
 . 
 topic_path 
 ( 
 project_id 
 , 
 topic 
 ) 
 try 
 : 
 topic 
 = 
 publisher 
 . 
 create_topic 
 ( 
 request 
 = 
 { 
 "name" 
 : 
 topic_path 
 }) 
 print 
 ( 
 f 
 "Created topic: 
 { 
 topic 
 . 
 name 
 } 
 " 
 ) 
 except 
 AlreadyExists 
 : 
 print 
 ( 
 f 
 "Topic 
 { 
 topic_path 
 } 
 already exists." 
 ) 
 return 
 topic_path 
 def 
  
 publisher_function 
 ( 
 project_id 
 , 
 topic 
 , 
 sample_data 
 ): 
  
 """Function that publishes sample queries to a PubSub topic. 
 This function runs in a separate thread and continuously publishes 
 messages to simulate real-time user queries. 
 """ 
 publisher 
 = 
 pubsub_v1 
 . 
  PublisherClient 
 
 () 
 topic_path 
 = 
 publisher 
 . 
 topic_path 
 ( 
 project_id 
 , 
 topic 
 ) 
 time 
 . 
 sleep 
 ( 
 15 
 ) 
 for 
 message 
 in 
 sample_data 
 : 
 # Convert to JSON and publish 
 data 
 = 
 json 
 . 
 dumps 
 ( 
 message 
 ) 
 . 
 encode 
 ( 
 'utf-8' 
 ) 
 try 
 : 
  publish 
 
er . 
  publish 
 
 ( 
 topic_path 
 , 
 data 
 ) 
 except 
 Exception 
 : 
 pass 
 # Silently continue on error 
 # Wait 7 seconds before next message 
 time 
 . 
 sleep 
 ( 
 7 
 )

Quick start: Embedding Generation and Ingestion with Default Schema

Create Sample Product Catalog Data

First, we create a sample product catalog with descriptions to be embedded

Create product catalog data with rich descriptions for semantic search

  PRODUCTS_DATA 
 = 
 [ 
 { 
 "id" 
 : 
 "laptop-001" 
 , 
 "name" 
 : 
 "UltraBook Pro X15" 
 , 
 "description" 
 : 
 "Powerful ultralight laptop featuring a 15-inch 4K OLED display, 12th Gen Intel i9 processor, 32GB RAM, and 1TB SSD. Perfect for creative professionals, developers, and power users who need exceptional performance in a slim form factor. Includes Thunderbolt 4 ports, all-day battery life, and advanced cooling system." 
 , 
 "category" 
 : 
 "Electronics" 
 , 
 "subcategory" 
 : 
 "Laptops" 
 , 
 "price" 
 : 
 1899.99 
 , 
 "brand" 
 : 
 "TechMaster" 
 , 
 "features" 
 : 
 [ 
 "4K OLED Display" 
 , 
 "Intel i9" 
 , 
 "32GB RAM" 
 , 
 "1TB SSD" 
 , 
 "Thunderbolt 4" 
 ], 
 "weight" 
 : 
 "3.5 lbs" 
 , 
 "dimensions" 
 : 
 "14.1 x 9.3 x 0.6 inches" 
 }, 
 { 
 "id" 
 : 
 "phone-001" 
 , 
 "name" 
 : 
 "Galaxy Ultra S23" 
 , 
 "description" 
 : 
 "Flagship smartphone with a stunning 6.8-inch Dynamic AMOLED display, 200MP camera system, and 5nm processor. Features 8K video recording, 5G connectivity, and all-day battery life. Water and dust resistant with IP68 rating. Perfect for photography enthusiasts, mobile gamers, and professionals who need reliable performance." 
 , 
 "category" 
 : 
 "Electronics" 
 , 
 "subcategory" 
 : 
 "Smartphones" 
 , 
 "price" 
 : 
 1199.99 
 , 
 "brand" 
 : 
 "Samsung" 
 , 
 "features" 
 : 
 [ 
 "200MP Camera" 
 , 
 "6.8-inch AMOLED" 
 , 
 "5G" 
 , 
 "IP68 Water Resistant" 
 , 
 "8K Video" 
 ], 
 "weight" 
 : 
 "8.2 oz" 
 , 
 "dimensions" 
 : 
 "6.4 x 3.1 x 0.35 inches" 
 }, 
 { 
 "id" 
 : 
 "headphones-001" 
 , 
 "name" 
 : 
 "SoundSphere Pro" 
 , 
 "description" 
 : 
 "Premium wireless noise-cancelling headphones with spatial audio technology and adaptive EQ. Features 40 hours of battery life, memory foam ear cushions, and voice assistant integration. Seamlessly switch between devices with multi-point Bluetooth connectivity. Ideal for audiophiles, frequent travelers, and professionals working in noisy environments." 
 , 
 "category" 
 : 
 "Electronics" 
 , 
 "subcategory" 
 : 
 "Audio" 
 , 
 "price" 
 : 
 349.99 
 , 
 "brand" 
 : 
 "AudioTech" 
 , 
 "features" 
 : 
 [ 
 "Active Noise Cancellation" 
 , 
 "Spatial Audio" 
 , 
 "40hr Battery" 
 , 
 "Bluetooth 5.2" 
 , 
 "Voice Assistant" 
 ], 
 "weight" 
 : 
 "9.8 oz" 
 , 
 "dimensions" 
 : 
 "7.5 x 6.8 x 3.2 inches" 
 }, 
 { 
 "id" 
 : 
 "coffee-001" 
 , 
 "name" 
 : 
 "BrewMaster 5000" 
 , 
 "description" 
 : 
 "Smart coffee maker with precision temperature control, customizable brewing profiles, and app connectivity. Schedule brewing times, adjust strength, and receive maintenance alerts from your smartphone. Features a built-in grinder, 12-cup capacity, and thermal carafe to keep coffee hot for hours. Perfect for coffee enthusiasts and busy professionals." 
 , 
 "category" 
 : 
 "Home & Kitchen" 
 , 
 "subcategory" 
 : 
 "Appliances" 
 , 
 "price" 
 : 
 199.99 
 , 
 "brand" 
 : 
 "HomeBarista" 
 , 
 "features" 
 : 
 [ 
 "Smart App Control" 
 , 
 "Built-in Grinder" 
 , 
 "Thermal Carafe" 
 , 
 "Customizable Brewing" 
 , 
 "12-cup Capacity" 
 ], 
 "weight" 
 : 
 "12.5 lbs" 
 , 
 "dimensions" 
 : 
 "10.5 x 8.2 x 14.3 inches" 
 }, 
 { 
 "id" 
 : 
 "chair-001" 
 , 
 "name" 
 : 
 "ErgoFlex Executive Chair" 
 , 
 "description" 
 : 
 "Ergonomic office chair with dynamic lumbar support, adjustable armrests, and breathable mesh back. Features 5-point adjustability, premium cushioning, and smooth-rolling casters. Designed to reduce back pain and improve posture during long work sessions. Ideal for home offices, professionals, and anyone who sits for extended periods." 
 , 
 "category" 
 : 
 "Furniture" 
 , 
 "subcategory" 
 : 
 "Office Furniture" 
 , 
 "price" 
 : 
 329.99 
 , 
 "brand" 
 : 
 "ComfortDesign" 
 , 
 "features" 
 : 
 [ 
 "Lumbar Support" 
 , 
 "Adjustable Armrests" 
 , 
 "Mesh Back" 
 , 
 "5-point Adjustment" 
 , 
 "Premium Cushioning" 
 ], 
 "weight" 
 : 
 "45 lbs" 
 , 
 "dimensions" 
 : 
 "28 x 25 x 45 inches" 
 }, 
 { 
 "id" 
 : 
 "sneakers-001" 
 , 
 "name" 
 : 
 "CloudStep Running Shoes" 
 , 
 "description" 
 : 
 "Lightweight performance running shoes with responsive cushioning, breathable knit upper, and carbon fiber plate for energy return. Features adaptive arch support, reflective elements for visibility, and durable rubber outsole. Designed for marathon runners, daily joggers, and fitness enthusiasts seeking comfort and performance." 
 , 
 "category" 
 : 
 "Apparel" 
 , 
 "subcategory" 
 : 
 "Footwear" 
 , 
 "price" 
 : 
 159.99 
 , 
 "brand" 
 : 
 "AthleteElite" 
 , 
 "features" 
 : 
 [ 
 "Responsive Cushioning" 
 , 
 "Carbon Fiber Plate" 
 , 
 "Breathable Knit" 
 , 
 "Adaptive Support" 
 , 
 "Reflective Elements" 
 ], 
 "weight" 
 : 
 "8.7 oz" 
 , 
 "dimensions" 
 : 
 "12 x 4.5 x 5 inches" 
 }, 
 { 
 "id" 
 : 
 "blender-001" 
 , 
 "name" 
 : 
 "NutriBlend Pro" 
 , 
 "description" 
 : 
 "High-performance blender with 1200W motor, variable speed control, and pre-programmed settings for smoothies, soups, and nut butters. Features stainless steel blades, 64oz BPA-free container, and pulse function. Includes personal blending cups for on-the-go nutrition. Perfect for health-conscious individuals, busy families, and culinary enthusiasts." 
 , 
 "category" 
 : 
 "Home & Kitchen" 
 , 
 "subcategory" 
 : 
 "Appliances" 
 , 
 "price" 
 : 
 149.99 
 , 
 "brand" 
 : 
 "KitchenPro" 
 , 
 "features" 
 : 
 [ 
 "1200W Motor" 
 , 
 "Variable Speed" 
 , 
 "Pre-programmed Settings" 
 , 
 "64oz Container" 
 , 
 "Personal Cups" 
 ], 
 "weight" 
 : 
 "11.8 lbs" 
 , 
 "dimensions" 
 : 
 "8.5 x 9.5 x 17.5 inches" 
 }, 
 { 
 "id" 
 : 
 "camera-001" 
 , 
 "name" 
 : 
 "ProShot X7 Mirrorless Camera" 
 , 
 "description" 
 : 
 "Professional mirrorless camera with 45MP full-frame sensor, 8K video recording, and advanced autofocus system with subject recognition. Features in-body stabilization, weather sealing, and dual card slots. Includes a versatile 24-105mm lens. Ideal for professional photographers, videographers, and serious enthusiasts seeking exceptional image quality." 
 , 
 "category" 
 : 
 "Electronics" 
 , 
 "subcategory" 
 : 
 "Cameras" 
 , 
 "price" 
 : 
 2499.99 
 , 
 "brand" 
 : 
 "OptiView" 
 , 
 "features" 
 : 
 [ 
 "45MP Sensor" 
 , 
 "8K Video" 
 , 
 "Advanced Autofocus" 
 , 
 "In-body Stabilization" 
 , 
 "Weather Sealed" 
 ], 
 "weight" 
 : 
 "1.6 lbs (body only)" 
 , 
 "dimensions" 
 : 
 "5.4 x 3.8 x 3.2 inches" 
 }, 
 { 
 "id" 
 : 
 "watch-001" 
 , 
 "name" 
 : 
 "FitTrack Ultra Smartwatch" 
 , 
 "description" 
 : 
 "Advanced fitness smartwatch with continuous health monitoring, GPS tracking, and 25-day battery life. Features ECG, blood oxygen monitoring, sleep analysis, and 30+ sport modes. Water-resistant to 50m with a durable sapphire crystal display. Perfect for athletes, fitness enthusiasts, and health-conscious individuals tracking wellness metrics." 
 , 
 "category" 
 : 
 "Electronics" 
 , 
 "subcategory" 
 : 
 "Wearables" 
 , 
 "price" 
 : 
 299.99 
 , 
 "brand" 
 : 
 "FitTech" 
 , 
 "features" 
 : 
 [ 
 "ECG Monitor" 
 , 
 "GPS Tracking" 
 , 
 "25-day Battery" 
 , 
 "Blood Oxygen" 
 , 
 "30+ Sport Modes" 
 ], 
 "weight" 
 : 
 "1.6 oz" 
 , 
 "dimensions" 
 : 
 "1.7 x 1.7 x 0.5 inches" 
 }, 
 { 
 "id" 
 : 
 "backpack-001" 
 , 
 "name" 
 : 
 "Voyager Pro Travel Backpack" 
 , 
 "description" 
 : 
 "Premium travel backpack with anti-theft features, expandable capacity, and dedicated laptop compartment. Features water-resistant materials, hidden pockets, and ergonomic design with padded straps. Includes USB charging port and luggage pass-through. Ideal for business travelers, digital nomads, and adventure seekers needing secure, organized storage." 
 , 
 "category" 
 : 
 "Travel" 
 , 
 "subcategory" 
 : 
 "Luggage" 
 , 
 "price" 
 : 
 129.99 
 , 
 "brand" 
 : 
 "TrekGear" 
 , 
 "features" 
 : 
 [ 
 "Anti-theft Design" 
 , 
 "Expandable" 
 , 
 "Laptop Compartment" 
 , 
 "Water Resistant" 
 , 
 "USB Charging Port" 
 ], 
 "weight" 
 : 
 "2.8 lbs" 
 , 
 "dimensions" 
 : 
 "20 x 12 x 8 inches" 
 } 
 ] 
 print 
 ( 
 f 
 "Created product catalog with 
 { 
 len 
 ( 
 PRODUCTS_DATA 
 ) 
 } 
 products" 
 )

Database Column	Chunk Field	Description
id	chunk.id	Unique identifier
embedding	chunk.embedding.dense_embedding	Vector representation
content	chunk.content.text	Text that was embedded
metadata	chunk.metadata	Additional data as RECORD

Embedding Ingestion and Vector Search with Apache Beam and BigQuery Stay organized with collections Save and categorize content based on your preferences.

Introduction

Example: Product Catalog

Setup and Prerequisites

Install Packages and Dependencies

Authenticate to Google Cloud

Create BigQuery Dataset

Importing Pipeline Components

Define helper functions

Quick start: Embedding Generation and Ingestion with Default Schema

Create Sample Product Catalog Data

Create product catalog data with rich descriptions for semantic search

Create BigQuery Table

Define Pipeline components

Map products to Chunks

Generate embeddings with HuggingFace

Write to BigQuery

Assemble and Run Pipeline

Verify Embeddings

Quick start: Vector Search

Define Sample Queries

Setup PubSub Steaming Source

Define pipeline components

Process PubSub messages

Configure embedding model

Configure vector search

Log the enriched query

Run the Basic Search Pipeline

Advanced: Embedding Generation and Ingestion with Custom Schema

Create Product Dataset with Multiple Items per Category

Create BigQuery Table with Custom Schema

Define Pipeline components

Convert product dictionary

Generate embeddings with HuggingFace

Configure BigQuery Vector Writer

Assemble and Run pipeline

Verify Custom Schema Embeddings

Advanced: Vector Search with Metadata Filter

Sample Queries with Filter Requirements

Create PubSub Topic

Define Pipeline components

Process PubSub messages

Configure embedding model

Configure Vector Search with Metadata Filter

Log the enriched query

Run Vector Search with metadata filter Pipeline

Whats next?

Embedding Ingestion and Vector Search with Apache Beam and BigQuery