Dataflow managed I/O for Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalogs	Hadoop Hive REST-based catalogs BigQuery metastore (requires Apache Beam SDK 2.62.0 or later if not using the Portable Runner)
Read capabilities	Batch read
Write capabilities	Batch write Streaming write Dynamic destinations Dynamic table creation

For BigQuery tables for Apache Iceberg, use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Requirements

The following SDKs support managed I/O for Apache Iceberg:

Apache Beam SDK for Java version 2.58.0 or later
Apache Beam SDK for Python version 2.61.0 or later

Configuration

Managed I/O for Apache Iceberg supports the following configuration parameters:

`ICEBERG` Read

Configuration	Type	Description
table	`str`	Identifier of the Iceberg table.
catalog_name	`str`	Name of the catalog containing the table.
catalog_properties	`map[str, str]`	Properties used to set up the Iceberg catalog.
config_properties	`map[str, str]`	Properties passed to the Hadoop Configuration.
drop	`list[str]`	A subset of column names to exclude from reading. If null or empty, all columns will be read.
filter	`str`	SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
keep	`list[str]`	A subset of column names to read exclusively. If null or empty, all columns will be read.

`ICEBERG` Write

Configuration	Type	Description
table	`str`	A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`.
autosharding	`boolean`	Enables dynamic sharding to automatically adjust the number of parallel writers based on data volume. It handles data skew by further sub-dividing partitions into multiple shards to prevent bottlenecks during high-throughput writes. Only available with 'hash' distribution mode.
catalog_name	`str`	Name of the catalog containing the table.
catalog_properties	`map[str, str]`	Properties used to set up the Iceberg catalog.
config_properties	`map[str, str]`	Properties passed to the Hadoop Configuration.
direct_write_byte_limit	`int32`	For a streaming pipeline, sets the limit for lifting bundles into the direct write path.
distribution_mode	`str`	Defines distribution of write data. Supported distributions: - none: don't shuffle rows (default) - hash: shuffle rows by partition key before writing data
drop	`list[str]`	A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.
keep	`list[str]`	A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'.
only	`str`	The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.
partition_fields	`list[str]`	Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are: `foo` `truncate(foo, N)` `bucket(foo, N)` `hour(foo)` `day(foo)` `month(foo)` `year(foo)` `void(foo)` For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms.
sort_fields	`list[str]`	Fields used to set the table's sort order, applied when the table is created. Each entry has the form `<term> [asc\|desc] [nulls first\|nulls last]`, where `<term>` is a field name or one of the partition transforms (e.g. `bucket(col, 4)`, `day(ts)`). Direction defaults to ascending; null order defaults to nulls-first for ascending and nulls-last for descending. Note: this sets the table's declared sort order as metadata; it does not cause Beam to physically sort records before writing. For more information on sort orders, please visit https://iceberg.apache.org/spec/#sort-orders.
table_properties	`map[str, str]`	Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties.
triggering_frequency_seconds	`int32`	For a streaming pipeline, sets the frequency at which snapshots are produced.

What's next

For more information and code examples, see the following topics:

Dataflow managed I/O for Apache Iceberg Stay organized with collections Save and categorize content based on your preferences.