Dataflow managed I/O for Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalogs

Hadoop
Hive
REST-based catalogs
BigQuery metastore (requires Apache Beam SDK 2.62.0 or later if not using Runner v2)

Read capabilities

Batch read

Write capabilities

Batch write
Streaming write
Dynamic destinations
Dynamic table creation

For BigQuery tables for Apache Iceberg , use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Requirements

The following SDKs support managed I/O for Apache Iceberg:

Apache Beam SDK for Java version 2.58.0 or later
Apache Beam SDK for Python version 2.61.0 or later

Configuration

Managed I/O for Apache Iceberg supports the following configuration parameters:

`ICEBERG` Read

Configuration	Type	Description
table	`str`	Identifier of the Iceberg table.
catalog_name	`str`	Name of the catalog containing the table.
catalog_properties	`map[ str , str ]`	Properties used to set up the Iceberg catalog.
config_properties	`map[ str , str ]`	Properties passed to the Hadoop Configuration.
drop	`list[ str ]`	A subset of column names to exclude from reading. If null or empty, all columns will be read.
filter	`str`	SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
keep	`list[ str ]`	A subset of column names to read exclusively. If null or empty, all columns will be read.

`ICEBERG` Write

Configuration

Type

Description

table

str

A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`.

autosharding

boolean

Enables dynamic sharding to automatically adjust the number of parallel writers based on data volume. It handles data skew by further sub-dividing partitions into multiple shards to prevent bottlenecks during high-throughput writes. Only available with 'hash' distribution mode.

catalog_name

str

Name of the catalog containing the table.

catalog_properties

map[ str 
, str 
]

Properties used to set up the Iceberg catalog.

config_properties

map[ str 
, str 
]

Properties passed to the Hadoop Configuration.

direct_write_byte_limit

int32

For a streaming pipeline, sets the limit for lifting bundles into the direct write path.

distribution_mode

str

Defines distribution of write data. Supported distributions: - none: don't shuffle rows (default) - hash: shuffle rows by partition key before writing data

drop

list[ str 
]

A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.

keep

list[ str 
]

A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'.

only

str

The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.

partition_fields

list[ str 
]

Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are:

foo
truncate(foo, N)
bucket(foo, N)
hour(foo)
day(foo)
month(foo)
year(foo)
void(foo)

For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms .

sort_fields

list[ str 
]

Fields used to set the table's sort order, applied when the table is created. Each entry has the form <term> [asc|desc] [nulls first|nulls last] , where <term> is a field name or one of the partition transforms (e.g. bucket(col, 4) , day(ts) ). Direction defaults to ascending; null order defaults to nulls-first for ascending and nulls-last for descending. Note: this sets the table's declared sort order as metadata; it does not cause Beam to physically sort records before writing. For more information on sort orders, please visit https://iceberg.apache.org/spec/#sort-orders .

table_properties

map[ str 
, str 
]

Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties .

triggering_frequency_seconds

int32

For a streaming pipeline, sets the frequency at which snapshots are produced.

What's next

For more information and code examples, see the following topics:

Dataflow managed I/O for Apache Iceberg Stay organized with collections Save and categorize content based on your preferences.

Requirements

Configuration

ICEBERG Read

ICEBERG Write

What's next

Dataflow managed I/O for Apache Iceberg

`ICEBERG` Read

`ICEBERG` Write