The data hierarchy in Datastream is:
- A stream, which is comprised of a data source and a destination.
- An object, which is a portion of a stream, such as a table from a specific database.
- An event, which is a single change generated by a specific object, such as a database insert.
Streams, objects, and events have data and metadata associated with them. This data and metadata can be used for different purposes.
About events
Each event consists of three types of data:
- Event data: This represents the change to the data itself from the object originating from the stream source. Every event contains the entirety of the row that changed.
- Generic metadata: This metadata appears on every event generated by Datastream which is used for actions, such as removing duplicate data in the destination.
- Source-specific metadata: This metadata appears on every event generated by a specific stream source. This metadata varies by source.
Event data
Event data is the payload of every change from a given object originating from a stream source.
Events are in either the Avro or JSON format.
When working with the Avro format, for each column, the event contains the column index and value. Using the column index, the column name and unified type can be retrieved from the schema in the Avro header.
When working with the JSON format, for each column, the event contains the column name and value.
Event metadata can be used to collect information about the event's origin, as well as to remove duplicate data in the destination and order events by the downstream consumer.
The following tables list and describe the fields and data types for generic and source-specific event metadata.
Generic metadata
This metadata is consistent across streams of all types.
stream_name
read_method
Indicates if the data was read from the source using a change data capture (CDC) method, as part of historical backfill, or as part of a supplementation task that's created when a transaction is rolled back during CDC replication.
Possible values include:
-
oracle-cdc-logminer
-
oracle-backfill
-
oracle-supplementation
-
mysql-cdc-binlog
-
mysql-backfill-incremental
-
mysql-backfill-fulldump
-
postgres-cdc-wal
-
postgresql-backfill
-
salesforce-cdc
-
salesforce-backfill
object
schema_key
uuid
read_timestamp
source_timestamp
sort_keys
{"type": "array", "items": ["string", "long"]}
Source-specific metadata
This metadata is associated with CDC and backfill events from a source database. To view this metadata, select a source from the drop-down menu that follows.
log_file
log_position
primary_keys
is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
database
table
change_type
The type of change ( INSERT
, UPDATE-INSERT
, UPDATE-DELETE
, and DELETE
) that the event represents.
log_file
scn
is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
database
schema
table
change_type
The type of change ( INSERT
, UPDATE-INSERT
, UPDATE-DELETE
, and DELETE
) that the event represents.
tx_id
rs_id
rs_id
and ssn
uniquely identifies a row in V$LOGMNR_CONTENTS
. rs_id
uniquely identifies the redo record that generated the row.ssn
rs_id
and uniquely identifies a row in V$LOGMNR_CONTENTS
.schema
table
is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
change_type
INSERT
, UPDATE
, DELETE
) that the event represents.
tx_id
lsn
primary_keys
table
database
schema
is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
lsn
tx_id
physical_location
replication_index
change_type
The type of change ( INSERT
, UPDATE
, DELETE
) that the event represents.
object_name
The name of the Salesforce object associated with the event.
domain
The name of the domain associated with the event.
is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
change_type
The type of change ( INSERT
, UPDATE
, DELETE
) that the event represents.
primary_keys
database
collection
change_type
CREATE
, UPDATE
, and DELETE
) that the event represents.is_deleted
- A
true
value indicates that the row was deleted in the source. - A
false
value signifies that the row wasn't deleted.
primary_keys
_id
field which acts as the primary key for each document in a collection.Example of an event flow
This flow illustrates the events generated by three consecutive operations: INSERT
, UPDATE
, and DELETE
, on a single row in a SAMPLE
table for a source database.
TIME | THIS_IS_MY_PK (int) | FIELD1 (nchar nullable) | FIELD2 (nchar non-null)> |
---|---|---|---|
0
|
1231535353 | foo | TLV |
1
|
1231535353 | NULL | TLV |
INSERT (T0)
The message payload consists of the entirety of the new row.
{
"stream_name"
:
"projects/myProj/locations/myLoc/streams/Oracle-to-Source"
,
"read_method"
:
"oracle-cdc-logminer"
,
"object"
:
"SAMPLE.TBL"
,
"uuid"
:
"d7989206-380f-0e81-8056-240501101100"
,
"read_timestamp"
:
"2019-11-07T07:37:16.808Z"
,
"source_timestamp"
:
"2019-11-07T02:15:39"
,
"source_metadata"
:
{
"log_file"
:
""
"scn"
:
15869116216871
,
"row_id"
:
"AAAPwRAALAAMzMBABD"
,
"is_deleted"
:
false
,
"database"
:
"DB1"
,
"schema"
:
"ROOT"
,
"table"
:
"SAMPLE"
"change_type"
:
"INSERT"
,
"tx_id"
:
"rs_id"
:
"0x0073c9.000a4e4c.01d0"
,
"ssn"
:
67
,
},
"payload"
:
{
"THIS_IS_MY_PK"
:
"1231535353"
,
"FIELD1"
:
"foo"
,
"FIELD2"
:
"TLV"
,
}
}
UPDATE (T1)
The message payload consists of the entirety of the new row. It doesn't include previous values.
{
"stream_name"
:
"projects/myProj/locations/myLoc/streams/Oracle-to-Source"
,
"read_method"
:
"oracle-cdc-logminer"
,
"object"
:
"SAMPLE.TBL"
,
"uuid"
:
"e6067366-1efc-0a10-a084-0d8701101101"
,
"read_timestamp"
:
"2019-11-07T07:37:18.808Z"
,
"source_timestamp"
:
"2019-11-07T02:17:39"
,
"source_metadata"
:
{
"log_file"
:
"scn"
:
15869150473224
,
"row_id"
:
"AAAGYPAATAAPIC5AAB"
,
"is_deleted"
:
false
,
"database"
:
"schema"
:
"ROOT"
,
"table"
:
"SAMPLE"
"change_type"
:
"UPDATE"
,
"tx_id"
:
"rs_id"
:
"0x006cf4.00056b26.0010"
,
"ssn"
:
0
,
},
"payload"
:
{
"THIS_IS_MY_PK"
:
"1231535353"
,
"FIELD1"
:
null
,
"FIELD2"
:
"TLV"
,
}
}
DELETE (T2)
The message payload consists of the entirety of the new row.
{
"stream_name"
:
"projects/myProj/locations/myLoc/streams/Oracle-to-Source"
,
"read_method"
:
"oracle-cdc-logminer"
,
"object"
:
"SAMPLE.TBL"
,
"uuid"
:
"c504f4bc-0ffc-4a1a-84df-6aba382fa651"
,
"read_timestamp"
:
"2019-11-07T07:37:20.808Z"
,
"source_timestamp"
:
"2019-11-07T02:19:39"
,
"source_metadata"
:
{
"log_file"
:
"scn"
:
158691504732555
,
"row_id"
:
"AAAGYPAATAAPIC5AAC"
,
"is_deleted"
:
true
,
"database"
:
"schema"
:
"ROOT"
,
"table"
:
"SAMPLE"
"change_type"
:
"DELETE"
,
"tx_id"
:
"rs_id"
:
"0x006cf4.00056b26.0011"
,
"ssn"
:
0
,
},
"payload"
:
{
"THIS_IS_MY_PK"
:
"1231535353"
,
"FIELD1"
:
null
,
"FIELD2"
:
"TLV"
,
}
}
Ordering and consistency
This section covers how Datastream handles ordering and consistency.
Ordering
Datastream doesn't guarantee ordering, but each event contains the full row of data and the timestamp of when the data was written to the source. In BigQuery, out-of-order events are merged in the correct sequence automatically. BigQuery uses the event metadata and an internal change sequence number (CSN) to apply the events to the table in the correct order. In Cloud Storage, events from the same time can span more than one file.
Events that are generated out of order happen by design when events are backfilled for the initial backfill of data that's created when the stream is initiated.
Ordering can be inferred on a source-by-source basis.
Source | Description |
---|---|
MySQL | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
Oracle | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
PostgreSQL | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
SQL Server | Events that are part of the initial backfill have the Events that are part of the ongoing replication have the The order can be inferred by the combination of the |
Salesforce ( Preview ) | The order can be determined by using the |
MongoDB ( Preview ) | The order can be determined by using the |
Consistency
Datastream ensures that the data from the source database is delivered to the destination at least once. No event is missed, but there's a possibility of duplicate events in the stream. The window for duplicate events should be on the order of minutes, and the universally unique identifier (UUID) of the event in the event metadata can be used to detect duplicates.
When database log files contain uncommitted transactions, if any transactions are rolled back, then the database reflects this in the log files as "reverse" data manipulation language (DML) operations. For example, a rolled-back INSERT
operation will have a corresponding DELETE
operation. Datastream reads these operations from the log files.
About streams
Every stream has metadata that describes both the stream and the source from which it pulls data. This metadata includes information such as the stream name, the source and destination connection profiles.
To view the complete definition of the Stream object, see the API Reference documentation.
Stream state and status
A stream can be in one of the following states:
-
Not started
-
Starting
-
Running
-
Draining
-
Paused
-
Failed
-
Failed permanently
You can use the logs to find additional status information, such as the tables
backfilling or number of rows processed. You can also use the FetchStreamErrors
API to
retrieve errors.
Object metadata available using the discover API
The discover API returns objects which represent the structure of the objects defined in the data source or destination that's represented by the connection profile. Each object has metadata on the object itself, as well as for every field of data that it pulls. This metadata is available using the discover API .
What's next
- To learn more about streams, see Stream lifecycle .
- To learn how to create a stream, see Create a stream .