In this page, you find answers to frequently asked questions (FAQ) about using Datastream. This FAQ is associated with:
- General questions about Datastream and change data capture (CDC)
- General source behavior and limitations
- MySQL source behavior and limitations
- Oracle source behavior and limitations
- PostgreSQL source behavior and limitations (including AlloyDB for PostgreSQL)
- SQL Server source behavior and limitations
- Salesforce source behavior and limitations
- BigQuery destination behavior
- Cloud Storage destination behavior
- Common use cases for using Datastream
- How Datastream integrates with Google Cloud data services
- Security and connectivity
- Monitoring Datastream
- Pricing
- Additional information about Datastream
General questions about Datastream and change data capture (CDC)
Datastream is a serverless and easy-to-use change data capture (CDC) and replication service. Datastream enables seamless replication of data from relational database sources such as AlloyDB for PostgreSQL, PostgreSQL, MySQL, SQL Server and Oracle, as well as from Salesforce and MongoDB, directly into BigQuery. Datastream offers the scale, speed, security, and simplicity that modern businesses need. With a serverless, autoscaling architecture, Datastream lets you set up an ELT (Extract, Load, and Transform) pipeline for low-latency data replication enabling near real-time insights.
Datastream also has the ability to synchronize data streams across heterogeneous databases and applications reliably and with minimal latency. You can integrate the service with Dataflow templates to replicate databases into Cloud SQL or Spanner for database synchronization, or take advantage of the event stream directly from Cloud Storage to realize event-driven architectures.
- CDC: Capturing ongoing changes from the source in real time.
- Backfill: Providing a historical snapshot of data that exists in a table.
CDC is an approach to data integration that lets you integrate and analyze data faster, using fewer system resources. It's a method for pulling only the latest changes (updates, inserts, or deletes) from a data source, often by reading the log of changes that the source keeps for its own internal transactional integrity.
CDC is a highly-efficient mechanism for limiting the impact on the source when loading new data into operational data stores and data warehouses, and it eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or near real-time streaming of data changes into a data destination.
CDC can be used in many use cases that derive value from constant access to data changes as they happen, such as analytics, synchronizing data across geographically distributed systems, and event-driven architectures.
In addition to ongoing changes, Datastream also uses backfill to pull all existing data from a source, and then stream the data into a destination. As a result, the destination is "backfilled" with all historical data from the source. There are two types of backfill:
- Incremental : The default type of backfill for tables. During incremental backfill, Datastream fetches data from ranges of rows in multiple batches, and streams the data to a destination batch by batch.
- Full dump : During full dump backfill, Datastream pulls all the data at once, and streams it into a destination.
For information about backfill constraints, see the known limitations pages for the respective source types:
- MySQL
- Oracle
- PostgreSQL (including AlloyDB for PostgreSQL)
- SQL Server
- Salesforce ( Preview )
- MongoDB ( Preview )
You can monitor the backfill status per object:
- In the Google Cloud console, go to the Streams page.
- Click the stream that you want to monitor.
- Click the Objects tab.
The CDC status is effectively the stream status. If the CDC operation fails, then the entire stream fails.
Additional object-level metrics are available in Cloud Monitoring. For more information, see the Cloud Monitoring documentation .
A change stream is a sequence of events that Datastream emits to a destination for the purposes of downstream consumption. The result of a change stream being written to Cloud Storage is a set of files that contain change log events from a specific table over a period of time. The events represent inserts, updates, and deletes to that table, and the events can be consumed and processed downstream of Cloud Storage by services like Dataflow for use cases like event-driven architectures.
Replication in the Datastream context means a 1:1, up-to-date representation of the source table in the destination. For example, a table in Oracle whose changes are replicated continuously to a BigQuery table, where the BigQuery table is kept up-to-date using the changes streamed from the Oracle table. Replication leverages change streams by consuming each change event and using it to update the destination. Datastream enables direct replication into BigQuery, and supports additional destinations like Cloud SQL and Spanner through integration with Dataflow, by leveraging templates that pull the change stream events from Cloud Storage and updating the destination tables accordingly.
CDC is a highly efficient mechanism for limiting the impact on the source when new data is loaded into destination data stores and data warehouses. CDC also eliminates the need for bulk-load updating and inconvenient batch windows by enabling incremental loading or near real-time streaming of data changes into a destination.
In addition, Datastream minimizes the impact of an initial backfill by limiting the number of simultaneous backfill tasks, and offering you the control to decide which objects to backfill, and when to backfill them.
- Create a stream in a new region or project with the same configuration as the existing stream, but don't select the Backfill historical data checkbox.
- Start the stream that you created.
- After the stream that you created has a status of
RUNNING
, pause the existing stream . - Optionally, modify the new stream by selecting the Backfill historical data checkbox. Existing data in tables added to the stream in the future will be streamed from the source into the destination.
INSERT
operation will have a corresponding DELETE
operation. Datastream reads these operations from the log files.To view a listing of the regions where Datastream is available, see IP allowlists and regions .
General source behavior and limitations
Datastream supports streaming data from Oracle, MySQL, PostgreSQL, AlloyDB for PostgreSQL, SQL Server, Salesforce ( Preview ), and MongoDB ( Preview ) sources, both cloud-hosted and self-managed. For information about source-specific versions, see the following pages:
- MySQL source
- Oracle source
- PostgreSQL source (including AlloyDB for PostgreSQL)
- SQL Server source
- Salesforce source ( Preview )
- MongoDB source ( Preview )
For MySQL, Datastream processes the MySQL binary log to extract change events.
For Oracle, Datastream uses LogMiner and supplemental logging settings to extract data from Oracle's redo logs.
For PostgreSQL and AlloyDB for PostgreSQL, Datastream relies on PostgreSQL's logical decoding feature. Logical decoding exposes all changes committed to the database and allows consuming and processing these changes.
For SQL Server, Datastream tracks data manipulation language (DML) changes using transaction logs .
There are general limitations, and limitations that apply to specific database sources.
General limitations include:
- Event size: Maximum 100 MB row size limit for Cloud Storage destination and 20 MB row size limit for BigQuery.
- Some data definition language (DDL) operations aren't supported during replication, including:
- Dropping a column from the middle of a table. This may cause a data discrepancy because values are associated with the wrong column.
- Changing the data type of a column. This may cause a data discrepancy because data isn't mapped properly to the correct Datastream unified type, and the data may get corrupted.
- Cascading deletes are ignored.
- Table truncation is ignored.
For source-specific limitations, see the following pages:
- MySQL limitations
- Oracle limitations
- PostgreSQL limitations (including AlloyDB for PostgreSQL and AlloyDB Omni)
- SQL Server limitations
- Salesforce limitations
- MongoDB limitations
When writing to a schemaless destination, such as Cloud Storage, Datastream simplifies downstream processing of data across sources by normalizing data types across all sources. Datastream takes the original source data type (for example, a MySQL or PostgreSQL NUMERIC
type or an Oracle NUMBER
type), and normalizes it into a Datastream unified type
.
Unified types represent a loss-less superset of all possible source types, and normalization means that data from different sources can be processed with minimum effort and queried downstream in a source-agnostic way.
force=true
flag and skipping the validation checks. Note that skipping validations doesn't guarantee that the stream will be able to run, and it may still fail if the database is missing the necessary configuration or permissions.You can fix a stream with the Failed
status by addressing the errors reported by Datastream. For more information, see Troubleshoot a stream
.
You can also recover a stream with the Failed permanently
status. For more information, see Recover a stream
.
No. Datastream doesn't need to lock the tables.
MySQL source behavior and limitations
Yes, Datastream supports read replica instances for Cloud SQL for MySQL versions 5.7 and 8.0.
You can enable binary logging on these instances. To do so, execute the following gcloud CLI command:
gcloud sql instances patch INSTANCE_NAME -- enable-bin-log
For more information, see Binary logging on read replicas .
You must grant the following permissions to your Datastream user to replicate data from a MySQL database:
-
SELECT
-
REPLICATION SLAVE
-
REPLICATION CLIENT
No, Datastream does not support compressing transaction events with binary log transaction compression. Disable binary log transaction compression to use the change data capture functionality for MySQL in Datastream.
SELECT
permission on all its tables and databases. Can you still use Datastream?
You can grant the SELECT
permission to the particular tables that you want to replicate or to the entire schema in a database. To do so, run one of the following commands:
GRANT SELECT ON DATABASE_NAME . TABLE_NAME TO USER_NAME @'%';
GRANT SELECT ON DATABASE_NAME .* TO USER_NAME @'%';
Replace the following:
- DATABASE_NAME : The name of your MySQL database.
- TABLE_NAME
: The name of the table to which you're granting the
SELECT
permission. - USER_NAME
: The name of the user to whom you're granting the
SELECT
permission.
If you want to restrict the permissions to a particular hostname, replace the %
character with the specific IP address or address range.
Oracle source behavior and limitations
Datastream supports encryption of data in-transit based on Oracle Net Services
. Datastream runs in REQUESTED
mode.
Yes, Datastream supports the TCPS protocol for SSL/TLS-encrypted connections to Oracle sources.
GRANT SELECT ANY TABLE
permission. Can you still use Datastream?
If you can't grant the GRANT SELECT ANY TABLE
permission, then you can grant SELECT
permissions to the following list of tables:
- ALL_COL_COMMENTS
- ALL_CONS_COLUMNS
- ALL_CONSTRAINTS
- ALL_DB_LINKS
- ALL_EXTERNAL_TABLES
- ALL_IND_COLUMNS
- ALL_INDEXES
- ALL_LOG_GROUPS
- ALL_MVIEWS
- ALL_OBJECTS
- ALL_PART_TABLES
- ALL_SEQUENCES
- ALL_SOURCE
- ALL_SYNONYMS
- ALL_TAB_COLS
- ALL_TAB_COLUMNS
- ALL_TAB_COMMENTS
- ALL_TABLES
- ALL_TRIGGERS
- ALL_TRIGGER_COLS
- ALL_TYPES
- ALL_USERS
- ALL_VIEWS
- DATABASE_PROPERTIES
- DBA_ROLE_PRIVS
- DUAL
- PRODUCT_COMPONENT_VERSION
- ROLE_SYS_PRIVS
- USER_ROLE_PRIVS
- USER_TAB_PRIVS
To stream changes using Datastream, you also need read access to all tables included in the stream.
To start the stream, bypass the validation using the API. For more information see Start a stream .
PostgreSQL source behavior and limitations
Question | Answer |
---|---|
How does Datastream get data out of PostgreSQL? | Datastream uses PostgreSQL's logical decoding feature to extract the change events from PostgreSQL |
Does Datastream support replication from AlloyDB for PostgreSQL or AlloyDB Omni sources? | Yes. Datastream supports both of these sources. For information about configuring your source AlloyDB for PostgreSQL database, see Configure an AlloyDB for PostgreSQL database . For information about configuring your source AlloyDB Omni database, see Configure a self-managed PostgreSQL database . |
Does Datastream support replication from a PostgreSQL read replica instance? | No, Datastream doesn't support replication from read replica instances for PostgreSQL. |
Does Datastream support Row-Level Security (RLS) ? | Datastream doesn't support replication of tables with RLS enabled. However, you can bypass this limitation by using the BYPASSRLS
clause: ALTER USER USER_NAME BYPASSRLS; Replace USER_NAME with the name of the user that Datastream uses to connect to the database, and for whom you want to bypass RLS policies. |
Does Datastream support concurrent CDC operations when replicating from a PostgreSQL source? | No. PostgreSQL replication relies on the concept of replication slots, and replication slots don't support concurrent connections. There can only be one CDC task reading from the replication slot at a time. For example, if you delete a large number of records from the source database, the replication slot might get overloaded with delete events. Any subsequent changes to the source are delayed until the delete events already in the replication slot queue are processed. For information about the suggested approach when replicating large transaction data, see Diagnose issues . |
Does Datastream support Secure Sockets Layer (SSL) and Transport Layer Security (TLS) encryption? | Yes, Datastream supports SSL/TLS-encrypted connections. |
SQL Server source behavior and limitations
You must grant the following roles and permissions to your Datastream user to replicate data from a SQL Server database:
For the change tables CDC method:
-
db_owner
-
db_denydatawriter
Additionally, for the transaction logs CDC method:
-
SELECT
permissions on the database and for thesys.fn_dblog
function -
SELECT
permissions on thedbo.sysjobs
table -
VIEW SERVER STATE
For detailed information about configuring your source, see the respective configuration pages for your database type.
You can select one of the two available CDC methods, based on your source SQL Server database configuration:
- Transaction logs: select this method to process changes directly from the database logs. This method provides the best performance and is more efficient, but requires additional configuration steps which may not be supported by your database.
- Change tables: select this method to process changes from dedicated change tables. This method is easier to configure and has fewer limitations, but supports lower throughput and increases the load on your database.
For more information, see Overview of SQL Server as a source .
Salesforce source behavior and limitations
Question | Answer |
---|---|
What's a Salesforce org? | A Salesforce organization, also referred to as an org , is your dedicated deployment of Salesforce, similar to a database instance. An org has a defined set of licensed users and stores all customer data and applications. |
What are Salesforce objects, records and fields? | An object is a class entity, and is similar to a table in relational databases nomenclature. They can be standard or custom. Standard objects are included in Salesforce by default, while custom objects are created by users to store data specific to their use-case. A record is a specific instance of an object, similar to a row in relational databases. A field is a specific attribute of an object, similar to a column in relational databases. |
BigQuery destination behavior
- In the Append-onlymode, data is appended in BigQuery as a stream of changes, with a separate row for each
INSERT
,UPDATE-INSERT
,UPDATE-DELETE
, andDELETE
event. - In the Mergemode (default), when Datastream creates a table in BigQuery, it sets the primary keys on the table according to the primary keys defined in the source. BigQuery processes the events and applies changes to the underlying BigQuery table based on the source table primary keys. If the source table doesn't have a primary key, it's treated as if the Append-onlymode was used.
_
. Can you change this behavior so that the tables in BigQuery follow the same structure as they have on the source database? _
, as per the table naming convention
in BigQuery.NULL
values in the destination. You can then manually delete these columns in BigQuery.Cloud Storage destination behavior
Each event contains several metadata fields that uniquely identify the row in the log files, and let you sort the events. These fields include:
For Oracle sources:
-
rs_id
(record set ID), which is actually composed of three values (for example,0x0073c9.000a4e4c.01d0
). Thers_id
uniquely identifies the record within the redo log. -
ssn
(SQL sequence number), which is used when the record in the redo log is too long. Because of this, Oracle LogMiner splits the record into multiple rows.
For MySQL sources:
-
log_file
, from which Datastream pulls events in CDC replication. -
log_position
, which identifies the offset of the record in the MySQL binary log.
For PostgreSQL sources:
-
source_timestamp
, which is the timestamp when the record changed on the source. If thesource_timestamp
values for the events you're sorting are identical, then comparelsn
values. -
lsn
(log sequence number), which identifies the offset of the record in the WAL log file.
For more information about metadata fields, see Source-specific metadata .
UPDATE-DELETE
and an UPDATE-INSERT
. The UPDATE-DELETE
event represents the data before
the update and UPDATE-INSERT
represents the data after
the update. For more information about source-specific metadata, see Events and streams
.Use cases
Datastream is a CDC and replication service, which means it's flexible across various use cases that can benefit from access to continuously streaming change data. The most common use cases for Datastream are:
- Analytics: By directly replicating data into BigQuery, customers can access up-to-date operational data in BigQuery. Customers can use this continuously updated data in BigQuery to build up-to-date dashboards over their data. This can be used, for example, for monitoring of systems and for deriving up to date insights about the state of their business.
- Database replication and synchronization scenarios: By integrating Datastream with Dataflow templates for loading data into Cloud SQL or Spanner, you can get up-to-date replication of your source data in these databases. You can use this continuously updated data in the destination databases for a low downtime database migration from the source to the destination, or for hybrid-cloud configurations , where the source and destination reside in different hosting environments.
- Event-driven architectures: Modern microservices-based architectures rely on central hubs of data that are updated with events continuously from across your organization to be event-driven. By continuously writing event data into destinations such as BigQuery and Cloud Storage, you can build event-driven architectures that are based on consumption of event data from these destinations.
Integrations
Datastream complements and enhances the Google Cloud data suite by providing CDC data replication from sources to various Google Cloud services. By integrating seamlessly with these services, Datastream fits into the larger Google Cloud ecosystem.
Datastream integrates with the following data services:
- BigQuery: Datastream uses the BigQuery Write API to integrate with BigQuery directly. Datastream writes the change events directly to the BigQuery dataset where the changes are merged continuously (UPSERT) with the existing data.
- Dataflow: Datastream can integrate with Dataflow using Dataflow templates. These templates read data from Cloud Storage and load that data into BigQuery, Cloud SQL for PostgreSQL, or Spanner. The purpose of these templates is to maintain up-to-date replicated source tables in the destinations. The templates are available in the Dataflow UI, and are built for Datastream-generated files for out-of-the-box processing.
- Cloud Storage: Datastream integrates with Cloud Storage by writing to it as a change streaming destination.
- Cloud SQL and Spanner: by using Datastream and Dataflow templates, you can maintain up-to-date replicated tables in the databases.
Security and connectivity
There are three types of connectivity methods that you can configure:
- IP allowlist: This method provides you with public connectivity by allowlisting Datastream's regional IP addresses on the source. When you create your stream, Datastream displays these IP addresses in the UI.
- Forward SSH tunnel: This method provides you with secure connectivity over public networks by leveraging a customer-configured SSH bastion in addition to allowlisting regional IP addresses.
- Private connectivity over VPC peering: Use this method to connect to your Google Cloud-hosted databases over Google Cloud's internal network, or take advantage of an existing VPN or Interconnect connection by establishing VPC peering between Datastream's private network and your organization's Google Cloud VPC.
Datastream lets specify which specific data elements (schemas, tables, and columns) of your source you want to stream into a destination, and which elements you want to exclude from being streamed.
Database logs may contain change data from elements that were excluded in your stream definition. Because you can't filter these elements at the source, Datastream will read, but ignore, any data associated with the elements.
roles/compute.networkAdmin
role to the Datastream service account to create a private connectivity configuration on a shared VPC?networkAdmin
role is required only to create the VPC peering. After the peering is established, you no longer need the role. If your organization doesn't permit granting the networkAdmin
role to the Datastream service account, then create a custom role with the following specific permissions: Select IP ranges dynamically
- compute.routes.get
- compute.routes.list
- compute.subnetworks.get
- compute.subnetworks.list
Create peered networks
- compute.globalOperations.get
- compute.networks.addPeering
- compute.networks.removePeering
- compute.networks.get
Reserve IP addresses
- compute.globalAddresses.get
- compute.globalAddresses.create
- compute.globalAddresses.createInternal
- compute.globalAddresses.delete
- compute.globalAddresses.deleteInternal
- compute.networks.use
- compute.networks.listPeeringRoutes
Monitor Datastream
Question | Answer |
---|---|
How do I know when all of my historical data has been copied to the destination? | Datastream provides information about its current status in the log files. A log entry is created to indicate when a table is done backfilling. |
Latency occasionally goes up and then gradually decreases over time. Is this to be expected? | Datastream scales up automatically when event throughput generated by the source increases. However, to protect both the source database as well as Datastream, there are limits to how many concurrent connections and processes Datastream can open at any time. Temporary spikes in latency are expected when there's a significant increase in event throughput, and are expected to decrease gradually as the backlog of events is processed. |
Pricing
Question | Answer |
---|---|
How is Datastream priced? | Datastream is priced based on the volume (GB) of data processed from your source into a destination. To learn more about pricing information for Datastream, see Pricing . |
How do you calculate the size of the data? | Billing is calculated based on the size of the data that's processed by Datastream. Datastream charges only for data that's streamed into the destination. |
If you use Datastream with BigQuery or Dataflow, then what do you pay for? | Each service is priced and charged for separately. |
Additional information
Question | Answer |
---|---|
What if I have additional questions or issues with using Datastream? | Google's support team can provide support if you're having issues with using Datastream. In addition, the troubleshooting guide handles common issues that you may face while using Datastream. |
What's next
- Find out more about Datastream .
- Learn about key concepts and features of Datastream.
- Learn about sources that Datastream supports.
- Learn about destinations that Datastream supports.