This tutorial describes how to migrate data from a third-party vector database
to AlloyDB for PostgreSQL using aLangChainVectorStore.
This tutorial assumes that the data in third-party vector databases was created
using a LangChain VectorStore integration. If you put information into one of
the following databases without using LangChain, you might need to edit the
below provided scripts to match your data's schema.
The following vector databases are supported:
To generate a cost estimate based on your projected usage,
use thepricing calculator.
New Google Cloud users might be eligible for afree trial.
When you finish the tasks that are described in this document, you can avoid
continued billing by deleting the resources that you created. For more information, seeClean up.
Before you begin
Make sure that you have one of the following LangChain third-party
database vector stores:
In theConfirm projectstep, clickNextto confirm the name of the project you are going to make changes to.
In theEnable APIsstep, clickEnableto enable the following:
AlloyDB API
Compute Engine API
Service Networking API
Required roles
To get the permissions that you need to complete the tasks in this tutorial,
have the following Identity and Access Management (IAM) roles which allow for table
creation and data insertion:
Owner (roles/owner) or Editor (roles/editor)
If the user is not an owner or editor, the following IAM roles
and PostgreSQL privileges are required:
importweaviate# For a locally running weaviate instance, use `weaviate.connect_to_local()`weaviate_client=weaviate.connect_to_weaviate_cloud(cluster_url=weaviate_cluster_url,auth_credentials=weaviate.auth.AuthApiKey(weaviate_api_key),)
results=pinecone_index.list_paginated(prefix="",namespace=pinecone_namespace,limit=pinecone_batch_size)ids=[v.idforvinresults.vectors]ifids:# Prevents yielding an empty list.yieldids# Check BOTH pagination and pagination.nextwhileresults.paginationisnotNoneandresults.pagination.get("next")isnotNone:pagination_token=results.pagination.get("next")results=pinecone_index.list_paginated(prefix="",pagination_token=pagination_token,namespace=pinecone_namespace,limit=pinecone_batch_size,)# Extract and yield the next batch of IDsids=[v.idforvinresults.vectors]ifids:# Prevents yielding an empty list.yieldids
And then fetch records by ID from the Pinecone index:
importuuid# Iterate through the IDs and download their contentsforidsinid_iterator:all_data=pinecone_index.fetch(ids=ids,namespace=pinecone_namespace)ids=[]embeddings=[]contents=[]metadatas=[]# Process each vector in the current batchfordocinall_data["vectors"].values():# You might need to update this data translation logic according to one or more of your field namesifpinecone_id_column_nameindoc:# pinecone_id_column_name stores the unqiue identifier for the contentids.append(doc[pinecone_id_column_name])else:# Generate a uuid if pinecone_id_column_name is missing in sourceids.append(str(uuid.uuid4()))# values is the vector embedding of the contentembeddings.append(doc["values"])# Check if pinecone_content_column_name exists in metadata before accessingifpinecone_content_column_nameindoc.metadata:# pinecone_content_column_name stores the content which was encodedcontents.append(str(doc.metadata[pinecone_content_column_name]))# Remove pinecone_content_column_name after processingdeldoc.metadata[pinecone_content_column_name]else:# Handle the missing pinecone_content_column_name field appropriatelycontents.append("")# metadata is the additional contextmetadatas.append(doc["metadata"])# Yield the current batch of resultsyieldids,contents,embeddings,metadatas
Weaviate
# Iterate through the IDs and download their contentsweaviate_collection=weaviate_client.collections.get(weaviate_collection_name)ids:list[str]=[]content:list[Any]=[]embeddings:list[list[float]]=[]metadatas:list[Any]=[]foriteminweaviate_collection.iterator(include_vector=True):# You might need to update this data translation logic according to one or more of your field names# uuid is the unqiue identifier for the contentids.append(str(item.uuid))# weaviate_text_key is the content which was encodedcontent.append(item.properties[weaviate_text_key])# vector is the vector embedding of the contentembeddings.append(item.vector["default"])# type: ignoredelitem.properties[weaviate_text_key]# type: ignore# properties is the additional contextmetadatas.append(item.properties)iflen(ids)>=weaviate_batch_size:# Yield the current batch of resultsyieldids,content,embeddings,metadatas# Reset lists to start a new batchids=[]content=[]embeddings=[]metadatas=[]
Chroma
# Iterate through the IDs and download their contentsoffset=0whileTrue:# You might need to update this data translation logic according to one or more of your field names# documents is the content which was encoded# embeddings is the vector embedding of the content# metadatas is the additional contextdocs=chromadb_client.get(include=["metadatas","documents","embeddings"],limit=chromadb_batch_size,offset=offset,)iflen(docs["documents"])==0:break# ids is the unqiue identifier for the contentyielddocs["ids"],docs["documents"],docs["embeddings"].tolist(),docs["metadatas"]offset+=chromadb_batch_size
Qdrant
# Iterate through the IDs and download their contentsoffset=NonewhileTrue:docs,offset=qdrant_client.scroll(collection_name=qdrant_collection_name,with_vectors=True,limit=qdrant_batch_size,offset=offset,with_payload=True,)ids:List[str]=[]contents:List[Any]=[]embeddings:List[List[float]]=[]metadatas:List[Any]=[]fordocindocs:ifdoc.payloadanddoc.vector:# You might need to update this data translation logic according to one or more of your field names# id is the unqiue identifier for the contentids.append(str(doc.id))# page_content is the content which was encodedcontents.append(doc.payload["page_content"])# vector is the vector embedding of the contentembeddings.append(doc.vector)# type: ignore# metatdata is the additional contextmetadatas.append(doc.payload["metadata"])yieldids,contents,embeddings,metadatasifnotoffset:break
Milvus
# Iterate through the IDs and download their contentsiterator=milvus_client.query_iterator(collection_name=milvus_collection_name,filter='pk >= "0"',output_fields=["pk","text","vector","idv"],batch_size=milvus_batch_size,)whileTrue:ids=[]content=[]embeddings=[]metadatas=[]page=iterator.next()iflen(page)==0:iterator.close()breakforiinrange(len(page)):# You might need to update this data translation logic according to one or more of your field namesdoc=page[i]# pk is the unqiue identifier for the contentids.append(doc["pk"])# text is the content which was encodedcontent.append(doc["text"])# vector is the vector embedding of the contentembeddings.append(doc["vector"])deldoc["pk"]deldoc["text"]deldoc["vector"]# doc is the additional contextmetadatas.append(doc)yieldids,content,embeddings,metadatas
Initialize the AlloyDB table
Define the embedding service.
The VectorStore interface requires an embedding service. This workflow doesn't
generate new embeddings, so theFakeEmbeddingsclass is used to avoid any
costs.
Pinecone
# The VectorStore interface requires an embedding service. This workflow does not# generate new embeddings, therefore FakeEmbeddings class is used to avoid any costs.fromlangchain_core.embeddingsimportFakeEmbeddingsembeddings_service=FakeEmbeddings(size=vector_size)
Weaviate
# The VectorStore interface requires an embedding service. This workflow does not# generate new embeddings, therefore FakeEmbeddings class is used to avoid any costs.fromlangchain_core.embeddingsimportFakeEmbeddingsembeddings_service=FakeEmbeddings(size=vector_size)
Chroma
# The VectorStore interface requires an embedding service. This workflow does not# generate new embeddings, therefore FakeEmbeddings class is used to avoid any costs.fromlangchain_core.embeddingsimportFakeEmbeddingsembeddings_service=FakeEmbeddings(size=vector_size)
Qdrant
# The VectorStore interface requires an embedding service. This workflow does not# generate new embeddings, therefore FakeEmbeddings class is used to avoid any costs.fromlangchain_core.embeddingsimportFakeEmbeddingsembeddings_service=FakeEmbeddings(size=vector_size)
Milvus
# The VectorStore interface requires an embedding service. This workflow does not# generate new embeddings, therefore FakeEmbeddings class is used to avoid any costs.fromlangchain_core.embeddingsimportFakeEmbeddingsembeddings_service=FakeEmbeddings(size=vector_size)
fromlangchain_google_alloydb_pgimportAlloyDBEnginealloydb_engine=awaitAlloyDBEngine.afrom_instance(project_id=project_id,region=region,cluster=cluster,instance=instance,database=db_name,user=db_user,password=db_pwd,ip_type=IPTypes.PUBLIC,# Optionally use IPTypes.PRIVATE)
Create a table to copy data into, if it doesn't already exist.
Pinecone
fromlangchain_google_alloydb_pgimportColumnawaitalloydb_engine.ainit_vectorstore_table(table_name=alloydb_table,vector_size=vector_size,# Customize the ID column types if not using the UUID data type# id_column=Column("langchain_id", "TEXT"), # Default is Column("langchain_id", "UUID")# overwrite_existing=True, # Drop the old table and Create a new vector store table)
Weaviate
awaitalloydb_engine.ainit_vectorstore_table(table_name=alloydb_table,vector_size=vector_size,# Customize the ID column types with `id_column` if not using the UUID data type)
Chroma
awaitalloydb_engine.ainit_vectorstore_table(table_name=alloydb_table,vector_size=vector_size,# Customize the ID column types with `id_column` if not using the UUID data type)
Qdrant
awaitalloydb_engine.ainit_vectorstore_table(table_name=alloydb_table,vector_size=vector_size,# Customize the ID column types with `id_column` if not using the UUID data type)
Milvus
awaitalloydb_engine.ainit_vectorstore_table(table_name=alloydb_table,vector_size=vector_size,# Customize the ID column types with `id_column` if not using the UUID data type)
Initialize a vector store object
This code adds additional vector embedding metadata to thelangchain_metadatacolumn in a JSON format.
To make filtering more efficient, organize this metadata into separate columns.
For more information, seeCreate a custom Vector Store.
To initialize a vector store object, run the following command:
To avoid incurring charges to your Google Cloud account for the resources used in this
tutorial, either delete the project that contains the resources, or keep the project and
delete the individual resources.
In the Google Cloud console, go to theClusterspage.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-10-27 UTC."],[],[]]