Managed I/O supports the following capabilities for Apache Iceberg:
| Catalogs |
|
|---|---|
| Read capabilities | Batch read |
| Write capabilities |
|
For BigQuery tables for Apache Iceberg,
use the
BigQueryIO connector
with BigQuery Storage API. The table must already exist; dynamic table creation is
not supported.
Requirements
The following SDKs support managed I/O for Apache Iceberg:
- Apache Beam SDK for Java version 2.58.0 or later
- Apache Beam SDK for Python version 2.61.0 or later
Configuration
Managed I/O for Apache Iceberg supports the following configuration parameters:
ICEBERG Read
| Configuration | Type | Description |
|---|---|---|
| table |
str
|
Identifier of the Iceberg table. |
| catalog_name |
str
|
Name of the catalog containing the table. |
| catalog_properties |
map[str, str]
|
Properties used to set up the Iceberg catalog. |
| config_properties |
map[str, str]
|
Properties passed to the Hadoop Configuration. |
| drop |
list[str]
|
A subset of column names to exclude from reading. If null or empty, all columns will be read. |
| filter |
str
|
SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html |
| keep |
list[str]
|
A subset of column names to read exclusively. If null or empty, all columns will be read. |
ICEBERG Write
| Configuration | Type | Description |
|---|---|---|
| table |
str
|
A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`. |
| autosharding |
boolean
|
Enables dynamic sharding to automatically adjust the number of parallel writers based on data volume. It handles data skew by further sub-dividing partitions into multiple shards to prevent bottlenecks during high-throughput writes. Only available with 'hash' distribution mode. |
| catalog_name |
str
|
Name of the catalog containing the table. |
| catalog_properties |
map[str, str]
|
Properties used to set up the Iceberg catalog. |
| config_properties |
map[str, str]
|
Properties passed to the Hadoop Configuration. |
| direct_write_byte_limit |
int32
|
For a streaming pipeline, sets the limit for lifting bundles into the direct write path. |
| distribution_mode |
str
|
Defines distribution of write data. Supported distributions: - none: don't shuffle rows (default) - hash: shuffle rows by partition key before writing data |
| drop |
list[str]
|
A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. |
| keep |
list[str]
|
A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. |
| only |
str
|
The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. |
| partition_fields |
list[str]
|
Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are:
For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms. |
| sort_fields |
list[str]
|
Fields used to set the table's sort order, applied when the table is created. Each entry has the form <term> [asc|desc] [nulls first|nulls last], where <term> is a field name or one of the partition transforms (e.g. bucket(col, 4), day(ts)). Direction defaults to ascending; null order defaults to nulls-first for ascending and nulls-last for descending. Note: this sets the table's declared sort order as metadata; it does not cause Beam to physically sort records before writing.
For more information on sort orders, please visit https://iceberg.apache.org/spec/#sort-orders.
|
| table_properties |
map[str, str]
|
Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties. |
| triggering_frequency_seconds |
int32
|
For a streaming pipeline, sets the frequency at which snapshots are produced. |
What's next
For more information and code examples, see the following topics: