Dataflow managed I/O for Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalogs
  • Hadoop
  • Hive
  • REST-based catalogs
  • BigQuery metastore (requires Apache Beam SDK 2.62.0 or later if not using Runner v2)
Read capabilities
Batch read
Write capabilities

For BigQuery tables for Apache Iceberg , use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Requirements

The following SDKs support managed I/O for Apache Iceberg:

  • Apache Beam SDK for Java version 2.58.0 or later
  • Apache Beam SDK for Python version 2.61.0 or later

Configuration

Managed I/O for Apache Iceberg supports the following configuration parameters:

ICEBERG Read

Configuration Type Description
table
str Identifier of the Iceberg table.
catalog_name
str Name of the catalog containing the table.
catalog_properties
map[ str , str ] Properties used to set up the Iceberg catalog.
config_properties
map[ str , str ] Properties passed to the Hadoop Configuration.
drop
list[ str ] A subset of column names to exclude from reading. If null or empty, all columns will be read.
filter
str SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
keep
list[ str ] A subset of column names to read exclusively. If null or empty, all columns will be read.

ICEBERG Write

Configuration
Type
Description
table
str
A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`.
autosharding
boolean
Enables dynamic sharding to automatically adjust the number of parallel writers based on data volume. It handles data skew by further sub-dividing partitions into multiple shards to prevent bottlenecks during high-throughput writes. Only available with 'hash' distribution mode.
catalog_name
str
Name of the catalog containing the table.
catalog_properties
map[ str , str ]
Properties used to set up the Iceberg catalog.
config_properties
map[ str , str ]
Properties passed to the Hadoop Configuration.
direct_write_byte_limit
int32
For a streaming pipeline, sets the limit for lifting bundles into the direct write path.
distribution_mode
str
Defines distribution of write data. Supported distributions: - none: don't shuffle rows (default) - hash: shuffle rows by partition key before writing data
drop
list[ str ]
A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.
keep
list[ str ]
A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'.
only
str
The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.
partition_fields
list[ str ]
Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are:
  • foo
  • truncate(foo, N)
  • bucket(foo, N)
  • hour(foo)
  • day(foo)
  • month(foo)
  • year(foo)
  • void(foo)

For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms .

sort_fields
list[ str ]
Fields used to set the table's sort order, applied when the table is created. Each entry has the form <term> [asc|desc] [nulls first|nulls last] , where <term> is a field name or one of the partition transforms (e.g. bucket(col, 4) , day(ts) ). Direction defaults to ascending; null order defaults to nulls-first for ascending and nulls-last for descending. Note: this sets the table's declared sort order as metadata; it does not cause Beam to physically sort records before writing. For more information on sort orders, please visit https://iceberg.apache.org/spec/#sort-orders .
table_properties
map[ str , str ]
Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties .
triggering_frequency_seconds
int32
For a streaming pipeline, sets the frequency at which snapshots are produced.

What's next

For more information and code examples, see the following topics:

Create a Mobile Website
View Site in Mobile | Classic
Share by: