Events and streams

The data hierarchy in Datastream is:

  • A stream, which is comprised of a data source and a destination.
  • An object, which is a portion of a stream, such as a table from a specific database.
  • An event, which is a single change generated by a specific object, such as a database insert.

Streams, objects, and events have data and metadata associated with them. This data and metadata can be used for different purposes.

About events

Each event consists of three types of data:

  • Event data: This represents the change to the data itself from the object originating from the stream source. Every event contains the entirety of the row that changed.
  • Generic metadata: This metadata appears on every event generated by Datastream which is used for actions, such as removing duplicate data in the destination.
  • Source-specific metadata: This metadata appears on every event generated by a specific stream source. This metadata varies by source.

Event data

Event data is the payload of every change from a given object originating from a stream source.

Events are in either the Avro or JSON format.

When working with the Avro format, for each column, the event contains the column index and value. Using the column index, the column name and unified type can be retrieved from the schema in the Avro header.

When working with the JSON format, for each column, the event contains the column name and value.

Event metadata can be used to collect information about the event's origin, as well as to remove duplicate data in the destination and order events by the downstream consumer.

The following tables list and describe the fields and data types for generic and source-specific event metadata.

This metadata is consistent across streams of all types.

Field
Avro type
JSON type
Description
stream_name
string
string
The unique stream name as defined at creation time.
read_method
string
string

Indicates if the data was read from the source using a change data capture (CDC) method, as part of historical backfill, or as part of a supplementation task that's created when a transaction is rolled back during CDC replication.

Possible values include:

  • oracle-cdc-logminer
  • oracle-backfill
  • oracle-supplementation
  • mysql-cdc-binlog
  • mysql-backfill-incremental
  • mysql-backfill-fulldump
  • postgres-cdc-wal
  • postgresql-backfill
  • salesforce-cdc
  • salesforce-backfill
object
string
string
The name used to group different types of events, typically the name of the table or object in the source.
schema_key
string
string
The unique identifier for the unified schema of the event.
uuid
string
string
A unique identifier for the event generated by Datastream.
read_timestamp
string
The timestamp (UTC) when the record was read by Datastream (the epoch timestamp in milliseconds).
source_timestamp
string
The timestamp (UTC) when the record changed on the source (the epoch timestamp in milliseconds).
sort_keys
{"type": "array", "items": ["string", "long"]}
array
An array of values that can be used to sort events in the order in which they occurred.

This metadata is associated with CDC and backfill events from a source database. To view this metadata, select a source from the drop-down menu that follows.

Source
Field
Avro type
JSON type
Description
MySQL
log_file
string
string
The log file that Datastream pulls events from in CDC replication.
MySQL
log_position
long
long
The log position (offset) in the MySQL binary log.
MySQL
primary_keys
string array
string array
The list of (one or more) column names that make up the tables primary key. If the table doesn't have a primary key, then this field is empty.
MySQL
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
MySQL
database
string
string
The database associated with the event.
MySQL
table
string
string
The table associated with the event.
MySQL
change_type
string
string

The type of change ( INSERT , UPDATE-INSERT , UPDATE-DELETE , and DELETE ) that the event represents.

Oracle
log_file
string
string
The log file that Datastream pulls events from in CDC replication.
Oracle
scn
long
long
The log position (offset) in the Oracle transaction log.
Oracle
row_id
string
string
Oracle's row_id .
Oracle
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
Oracle
database
string
string
The database associated with the event.
Oracle
schema
string
string
The schema associated with the table from the event.
Oracle
table
string
string
The table associated with the event.
Oracle
change_type
string
string

The type of change ( INSERT , UPDATE-INSERT , UPDATE-DELETE , and DELETE ) that the event represents.

Oracle
tx_id
string
string
The transaction ID to which the event belongs.
Oracle
rs_id
string
string
The record set ID. The coupling of rs_id and ssn uniquely identifies a row in V$LOGMNR_CONTENTS . rs_id uniquely identifies the redo record that generated the row.
Oracle
ssn
long
long
A SQL sequence number. This number is used with rs_id and uniquely identifies a row in V$LOGMNR_CONTENTS .
PostgreSQL
schema
string
string
The schema associated with the table from the event.
PostgreSQL
table
string
string
The table associated with the event.
PostgreSQL
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
PostgreSQL
change_type
string
string
The type of change ( INSERT , UPDATE , DELETE ) that the event represents.
PostgreSQL
tx_id
string
string
The transaction ID to which the event belongs.
PostgreSQL
lsn
string
string
The log sequence number for the current entry.
PostgreSQL
primary_keys
string array
string array
The list of (one or more) column names that make up the tables primary key. If the table doesn't have a primary key, then this field is empty.
SQL Server
table
string
string
The table associated with the event.
SQL Server
database
long
long
The database associated with the event.
SQL Server
schema
string array
string array
The schema associated with the table from the event.
SQL Server
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
SQL Server
lsn
string
string
The log sequence number for the event.
SQL Server
tx_id
string
string
The transaction ID to which the event belongs.
SQL Server
physical_location
integer array
integer array
The physical location of the log record described by three integers: file ID, page ID and slot ID of the record.
SQL Server
replication_index
string array
string array
The list of column names of an index that can uniquely identify a row in the table.
SQL Server
change_type
string
string

The type of change ( INSERT , UPDATE , DELETE ) that the event represents.

Salesforce
object_name
string
string

The name of the Salesforce object associated with the event.

Salesforce
domain
string
string

The name of the domain associated with the event.

Salesforce
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
Salesforce
change_type
string
string

The type of change ( INSERT , UPDATE , DELETE ) that the event represents.

Salesforce
primary_keys
string array
string array
The list of column names that make up the tables primary key. If the table doesn't have a primary key, then this field is empty.
MongoDB
database
string
string
The database associated with the event.
MongoDB
collection
string
string
The collection associated with the event. Collections are analogous to tables in relational databases.
MongoDB
change_type
string
string
The type of change ( CREATE , UPDATE , and DELETE ) that the event represents.
MongoDB
is_deleted
boolean
boolean
  • A true value indicates that the row was deleted in the source.
  • A false value signifies that the row wasn't deleted.
MongoDB
primary_keys
string array
string array
The _id field which acts as the primary key for each document in a collection.

Example of an event flow

This flow illustrates the events generated by three consecutive operations: INSERT , UPDATE , and DELETE , on a single row in a SAMPLE table for a source database.

TIME THIS_IS_MY_PK (int) FIELD1 (nchar nullable) FIELD2 (nchar non-null)>
0
1231535353 foo TLV
1
1231535353 NULL TLV

INSERT (T0)

The message payload consists of the entirety of the new row.

  { 
  
 "stream_name" 
 : 
  
 "projects/myProj/locations/myLoc/streams/Oracle-to-Source" 
 , 
  
 "read_method" 
 : 
  
 "oracle-cdc-logminer" 
 , 
  
 "object" 
 : 
  
 "SAMPLE.TBL" 
 , 
  
 "uuid" 
 : 
  
 "d7989206-380f-0e81-8056-240501101100" 
 , 
  
 "read_timestamp" 
 : 
  
 "2019-11-07T07:37:16.808Z" 
 , 
  
 "source_timestamp" 
 : 
  
 "2019-11-07T02:15:39" 
 , 
  
  
 "source_metadata" 
 : 
  
 { 
  
 "log_file" 
 : 
  
 "" 
  
 "scn" 
 : 
  
 15869116216871 
 , 
  
 "row_id" 
 : 
  
 "AAAPwRAALAAMzMBABD" 
 , 
  
 "is_deleted" 
 : 
  
 false 
 , 
  
 "database" 
 : 
  
 "DB1" 
 , 
  
 "schema" 
 : 
  
 "ROOT" 
 , 
  
 "table" 
 : 
  
 "SAMPLE" 
  
 "change_type" 
 : 
  
 "INSERT" 
 , 
  
 "tx_id" 
 : 
  
  
 "rs_id" 
 : 
  
 "0x0073c9.000a4e4c.01d0" 
 , 
  
 "ssn" 
 : 
  
 67 
 , 
  
 }, 
  
 "payload" 
 : 
  
 { 
  
 "THIS_IS_MY_PK" 
 : 
  
 "1231535353" 
 , 
  
 "FIELD1" 
 : 
  
 "foo" 
 , 
  
 "FIELD2" 
 : 
  
 "TLV" 
 , 
  
 } 
 } 
 

UPDATE (T1)

The message payload consists of the entirety of the new row. It doesn't include previous values.

  { 
  
 "stream_name" 
 : 
  
 "projects/myProj/locations/myLoc/streams/Oracle-to-Source" 
 , 
  
 "read_method" 
 : 
  
 "oracle-cdc-logminer" 
 , 
  
 "object" 
 : 
  
 "SAMPLE.TBL" 
 , 
  
 "uuid" 
 : 
  
 "e6067366-1efc-0a10-a084-0d8701101101" 
 , 
  
 "read_timestamp" 
 : 
  
 "2019-11-07T07:37:18.808Z" 
 , 
  
 "source_timestamp" 
 : 
  
 "2019-11-07T02:17:39" 
 , 
  
  
 "source_metadata" 
 : 
  
 { 
  
 "log_file" 
 : 
  
  
 "scn" 
 : 
  
 15869150473224 
 , 
  
 "row_id" 
 : 
  
 "AAAGYPAATAAPIC5AAB" 
 , 
  
 "is_deleted" 
 : 
  
 false 
 , 
  
 "database" 
 : 
  
 "schema" 
 : 
  
 "ROOT" 
 , 
  
 "table" 
 : 
  
 "SAMPLE" 
  
 "change_type" 
 : 
  
 "UPDATE" 
 , 
  
 "tx_id" 
 : 
  
 "rs_id" 
 : 
  
 "0x006cf4.00056b26.0010" 
 , 
  
 "ssn" 
 : 
  
 0 
 , 
  
 }, 
  
 "payload" 
 : 
  
 { 
  
 "THIS_IS_MY_PK" 
 : 
  
 "1231535353" 
 , 
  
 "FIELD1" 
 : 
  
 null 
 , 
  
 "FIELD2" 
 : 
  
 "TLV" 
 , 
  
 } 
 } 
 

DELETE (T2)

The message payload consists of the entirety of the new row.

  { 
  
 "stream_name" 
 : 
  
 "projects/myProj/locations/myLoc/streams/Oracle-to-Source" 
 , 
  
 "read_method" 
 : 
  
 "oracle-cdc-logminer" 
 , 
  
 "object" 
 : 
  
 "SAMPLE.TBL" 
 , 
  
 "uuid" 
 : 
  
 "c504f4bc-0ffc-4a1a-84df-6aba382fa651" 
 , 
  
 "read_timestamp" 
 : 
  
 "2019-11-07T07:37:20.808Z" 
 , 
  
 "source_timestamp" 
 : 
  
 "2019-11-07T02:19:39" 
 , 
  
 "source_metadata" 
 : 
  
 { 
  
 "log_file" 
 : 
  
  
 "scn" 
 : 
  
 158691504732555 
 , 
  
 "row_id" 
 : 
  
 "AAAGYPAATAAPIC5AAC" 
 , 
  
 "is_deleted" 
 : 
  
 true 
 , 
  
 "database" 
 : 
  
 "schema" 
 : 
  
 "ROOT" 
 , 
  
 "table" 
 : 
  
 "SAMPLE" 
  
 "change_type" 
 : 
  
 "DELETE" 
 , 
  
 "tx_id" 
 : 
  
 "rs_id" 
 : 
  
 "0x006cf4.00056b26.0011" 
 , 
  
 "ssn" 
 : 
  
 0 
 , 
  
 }, 
  
 "payload" 
 : 
  
 { 
  
 "THIS_IS_MY_PK" 
 : 
  
 "1231535353" 
 , 
  
 "FIELD1" 
 : 
  
 null 
 , 
  
 "FIELD2" 
 : 
  
 "TLV" 
 , 
  
 } 
 } 
 

Ordering and consistency

This section covers how Datastream handles ordering and consistency.

Ordering

Datastream doesn't guarantee ordering, but each event contains the full row of data and the timestamp of when the data was written to the source. In BigQuery, out-of-order events are merged in the correct sequence automatically. BigQuery uses the event metadata and an internal change sequence number (CSN) to apply the events to the table in the correct order. In Cloud Storage, events from the same time can span more than one file.

Events that are generated out of order happen by design when events are backfilled for the initial backfill of data that's created when the stream is initiated.

Ordering can be inferred on a source-by-source basis.

Source Description
MySQL

Events that are part of the initial backfill have the read_method field starting with mysql-backfill . There's no implication to the order in which events are received within the backfill because they can be consumed in any order.

Events that are part of the ongoing replication have the read_method field set to mysql-cdc-binlog .

The order can be inferred by the combination of the log_file field and the log_position field that's offset from the log file. This combination provides a unique, incrementally increasing number that identifies the order of operation in the database.

Oracle

Events that are part of the initial backfill have the read_method field starting with oracle-backfill . There's no implication to the order in which events are received within the backfill because they can be consumed in any order.

Events that are part of the ongoing replication have the read_method field set to oracle-cdc-logminer .

The order can be inferred by the combination of the rs_id (the record set ID) field and the ssn (a SQL sequence number) field. This combination provides a unique, incrementally increasing number that identifies the order of operation in the database.

PostgreSQL

Events that are part of the initial backfill have the read_method field starting with postgresql-backfill . There's no implication to the order in which events are received within the backfill because they can be consumed in any order.

Events that are part of the ongoing replication have the read_method field set to postgres-cdc-wal .

The order can be inferred by the combination of the source_timestamp field and the lsn (log sequence number) field. This combination provides a unique, incrementally increasing number that identifies the order of operation in the database.

SQL Server

Events that are part of the initial backfill have the read_method field starting with sqlserver-backfill . There's no implication to the order in which events are received within the backfill because they can be consumed in any order.

Events that are part of the ongoing replication have the read_method field set to sqlserver-cdc .

The order can be inferred by the combination of the source_timestamp field and the lsn (log sequence number) field. This combination provides a unique, incrementally increasing number that identifies the order of operation in the database.

Salesforce ( Preview )

The order can be determined by using the source_timestamp of the record as the sort key. The timestamp in Salesforce has one second resolution, but there can't be two change events for the same record happening at the same second.

MongoDB ( Preview )

The order can be determined by using the ts field in the operation log or the clusterTime field in the change stream for the record. The fields are unique for every record.

Consistency

Datastream ensures that the data from the source database is delivered to the destination at least once. No event is missed, but there's a possibility of duplicate events in the stream. The window for duplicate events should be on the order of minutes, and the universally unique identifier (UUID) of the event in the event metadata can be used to detect duplicates.

When database log files contain uncommitted transactions, if any transactions are rolled back, then the database reflects this in the log files as "reverse" data manipulation language (DML) operations. For example, a rolled-back INSERT operation will have a corresponding DELETE operation. Datastream reads these operations from the log files.

About streams

Every stream has metadata that describes both the stream and the source from which it pulls data. This metadata includes information such as the stream name, the source and destination connection profiles.

To view the complete definition of the Stream object, see the API Reference documentation.

Stream state and status

A stream can be in one of the following states:

  • Not started
  • Starting
  • Running
  • Draining
  • Paused
  • Failed
  • Failed permanently

You can use the logs to find additional status information, such as the tables backfilling or number of rows processed. You can also use the FetchStreamErrors API to retrieve errors.

The discover API returns objects which represent the structure of the objects defined in the data source or destination that's represented by the connection profile. Each object has metadata on the object itself, as well as for every field of data that it pulls. This metadata is available using the discover API .

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: