Stay organized with collectionsSave and categorize content based on your preferences.
Managed migration is an automated feature that helps you migrate data from a
self-managed Hive Metastore to a Dataproc Metastore service, without
any sizable down time (otherwise known as aflag day).
Managed Migration Architecture
The following diagram provides the high-level architecture of a managed
migration.
Managed migration flow
To complete a managed migration, your service runs through two migration
processes—start migrationandcomplete migration.
You can cancel a migration at any time with thecancel migrationprocess.
There are also a number of operational commands you can run, which aren't
required to complete a migration. For example,list migrationsordelete
migrations.
As your service moves through this process, it also moves between variousmigration statesandmigration phases. These states and phases represent the
processes that are occurring in the background. For example, theMIGRATINGstate indicates that your service is actively transferring data from your
Cloud SQL database to Dataproc Metastore.
Start Migration
Dataproc Metastore establishes a connection with your
private IP Cloud SQL instance. After the connection is made,
Dataproc Metastore uses the Cloud SQL instance as it's
Hive Metastore (HMS) backend database. It also remains as the source of
truth for your data during the migration. Metadata reads and writes still
occur in Cloud SQL when the migration is active.
A change data capture (CDC) pipeline is started. This pipeline keeps the
Cloud SQL instance in your project and Spanner in the
Dataproc Metastore managed project in sync. This means that
all changes to the HMS database in the Cloud SQL instance are captured
through Datastream and written to the
Dataproc Metastore Spanner database.
Once the start migration process is successful, you can start routing
data workloads to Dataproc Metastore. At this point, Cloud SQL is
still the source of truth for your data.
Complete migration
After you finish moving your workloads to Dataproc Metastore, you
can complete the migration. When acomplete migrationprocess is called,
the following occurs:
Dataproc Metastore transitions into a read-only mode until thecomplete migrationprocess finishes.
The CDC stream transfers all in-flight data to Dataproc Metastore.
Dataproc Metastore connects to Spanner and disconnects
from Cloud SQL. Dataproc Metastore now acts as the source of
truth for your HMS data.
Proxy and pipeline considerations
Proxies
Dataproc Metastore uses aCloud SQL Auth proxychained to a SOCKS5 proxy to connect to your private IP Cloud SQL instance.
The SOCKS5 proxy servers are exposed through a service attachment as shown
in previousarchitecture diagram.
Each migration requires a dedicated NAT subnet. This is because
a NAT subnet can't have more than one service attachment.
To avoid cross-region latency issues, provide subnets that are in the
same region as your Cloud SQL instance to host the SOCKS5 proxy. For example,proxy_subnetandnat_subnet.
Change data capture pipeline
The change data capture pipeline uses VPC peering to establish a connection
between Datastream and private IP Cloud SQL
For each migration, a new private connection is created and a new
peering connection is established.
The VPC network hosting the Cloud SQL instance has as many
peering connections as there are active migrations. Make sure that your
VPC network has the capacity to host all of the necessary peering connections.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eManaged migration automates the transfer of data from a self-managed Hive Metastore to a Dataproc Metastore service with minimal downtime.\u003c/p\u003e\n"],["\u003cp\u003eThe migration process involves two main processes: \u003cem\u003estart migration\u003c/em\u003e, where a connection is established with Cloud SQL and a change data capture (CDC) pipeline begins, and \u003cem\u003ecomplete migration\u003c/em\u003e, which finalizes the transition to Dataproc Metastore.\u003c/p\u003e\n"],["\u003cp\u003eDuring the \u003cem\u003estart migration\u003c/em\u003e process, Cloud SQL remains the source of truth for data while a CDC pipeline keeps it synchronized with Dataproc Metastore.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003cem\u003ecomplete migration\u003c/em\u003e process transitions Dataproc Metastore to read-only mode, transfers any remaining in-flight data, and then makes Dataproc Metastore the new source of truth, disconnecting it from Cloud SQL.\u003c/p\u003e\n"],["\u003cp\u003eEach managed migration utilizes a dedicated NAT subnet and establishes a new private connection with VPC peering to facilitate the change data capture pipeline between Datastream and the private IP Cloud SQL instance.\u003c/p\u003e\n"]]],[],null,["# About managed migration\n\nManaged migration is an automated feature that helps you migrate data from a\nself-managed Hive Metastore to a Dataproc Metastore service, without\nany sizable down time (otherwise known as a [flag day](https://en.wikipedia.org/wiki/Flag_day_(computing))).\n\nManaged Migration Architecture\n------------------------------\n\nThe following diagram provides the high-level architecture of a managed\nmigration.\n\n### Managed migration flow\n\nTo complete a managed migration, your service runs through two migration\nprocesses---[*start migration*](/dataproc-metastore/docs/use-managed-migration#start_migration) and [*complete migration*](/dataproc-metastore/docs/use-managed-migration#complete_migration).\nYou can cancel a migration at any time with the *cancel migration* process.\nThere are also a number of operational commands you can run, which aren't\nrequired to complete a migration. For example, [*list migrations*](/dataproc-metastore/docs/use-managed-migration#list_migrations) or [*delete\nmigrations*](/dataproc-metastore/docs/use-managed-migration#delete_migrations).\n\nAs your service moves through this process, it also moves between various\n*migration states* and *migration phases* . These states and phases represent the\nprocesses that are occurring in the background. For example, the `MIGRATING`\nstate indicates that your service is actively transferring data from your\nCloud SQL database to Dataproc Metastore.\n\n**Start Migration**\n\n- **Dataproc Metastore establishes a connection with your\n private IP Cloud SQL instance**. After the connection is made,\n Dataproc Metastore uses the Cloud SQL instance as it's\n Hive Metastore (HMS) backend database. It also remains as the source of\n truth for your data during the migration. Metadata reads and writes still\n occur in Cloud SQL when the migration is active.\n\n- **A change data capture (CDC) pipeline is started**. This pipeline keeps the\n Cloud SQL instance in your project and Spanner in the\n Dataproc Metastore managed project in sync. This means that\n all changes to the HMS database in the Cloud SQL instance are captured\n through Datastream and written to the\n Dataproc Metastore Spanner database.\n\nOnce the start migration process is successful, you can start routing\ndata workloads to Dataproc Metastore. At this point, Cloud SQL is\nstill the source of truth for your data.\n\n**Complete migration**\n\nAfter you finish moving your workloads to Dataproc Metastore, you\ncan complete the migration. When a *complete migration* process is called,\nthe following occurs:\n\n- Dataproc Metastore transitions into a read-only mode until the *complete migration* process finishes.\n- The CDC stream transfers all in-flight data to Dataproc Metastore.\n- Dataproc Metastore connects to Spanner and disconnects from Cloud SQL. Dataproc Metastore now acts as the source of truth for your HMS data.\n\nProxy and pipeline considerations\n---------------------------------\n\n**Proxies**\n\nDataproc Metastore uses a [Cloud SQL Auth proxy](/sql/docs/mysql/sql-proxy)\nchained to a SOCKS5 proxy to connect to your private IP Cloud SQL instance.\nThe SOCKS5 proxy servers are exposed through a service attachment as shown\nin previous [architecture diagram](#how-migration-works).\n\n- Each migration requires a dedicated NAT subnet. This is because\n a NAT subnet can't have more than one service attachment.\n\n- To avoid cross-region latency issues, provide subnets that are in the\n same region as your Cloud SQL instance to host the SOCKS5 proxy. For example,\n `proxy_subnet` and `nat_subnet`.\n\n**Change data capture pipeline**\n\nThe change data capture pipeline uses VPC peering to establish a connection\nbetween Datastream and private IP Cloud SQL\n\n- For each migration, a new private connection is created and a new\n peering connection is established.\n\n- The VPC network hosting the Cloud SQL instance has as many\n peering connections as there are active migrations. Make sure that your\n VPC network has the capacity to host all of the necessary peering connections.\n\nWhat's next\n-----------\n\n- [Prerequisites for managed migration](/dataproc-metastore/docs/managed-migration-prerequisites)\n- [Use managed migration](/dataproc-metastore/docs/use-managed-migration)"]]