Lakehouse for Apache Iceberg is a managed data lakehouse platform on Google Cloud. Within this platform, the Lakehouse runtime catalog functions as the fully managed, serverless metastore service. It provides a single source of truth for your data lakehouse, which lets multiple engines—including Apache Spark, Apache Flink, Apache Hive, and BigQuery—share tables and metadata without copying files.
To connect query engines to the metastore, you configure a client by using a specific catalog type, such as the Apache Iceberg REST catalog endpoint. This endpoint manages table metadata and uses a storage location warehouse backed by a Cloud Storage bucket to store your underlying metadata and data files. For more information about choosing a catalog type, see Catalog types and endpoint configuration.
The Lakehouse runtime catalog supports storage access delegation, also known as credential vending, which improves security by removing the need for direct Cloud Storage bucket access. It also integrates with Knowledge Catalog for unified governance, lineage, and data quality.
Key capabilities
As a key component of Lakehouse for Apache Iceberg, the Lakehouse runtime catalog provides several advantages for data management and analysis, including a serverless architecture, engine interoperability with open APIs, a unified user experience, and high-performance analytics, streaming, and AI when you use it with BigQuery. For more information about these benefits, see What is Lakehouse?
Supported engines
The Lakehouse runtime catalog is compatible with several query engines including (but not limited to) Apache Spark, Apache Flink, Apache Hive, and Trino. The following table provides links to documentation for each engine:
| Engine | Documentation |
|---|---|
| Apache Spark | Quickstart: Use with Spark and the Iceberg REST catalog endpoint |
| Apache Hive | Use with Spark and the Hive catalog |
| Apache Flink | Use with Apache Flink |
| Trino | Use with Trino |
Catalog types and endpoint configuration
When configuring client engines to connect to the Lakehouse runtime catalog metastore, you select a specific catalog endpoint, such as the Apache Iceberg REST catalog endpoint or the Apache Hive endpoint. The best option depends on your use case, as shown in the following table:
| Use case | Recommendation |
|---|---|
| New Lakehouse runtime catalog users that want their open source engine to access data in Cloud Storage and need interoperability with other engines, including BigQuery and AlloyDB for PostgreSQL. | Use the Apache Iceberg REST catalog endpoint. |
| Users running Apache Hive or Spark workloads that depend on the Hive Metastore interface and want a fully managed metastore service. | Use the Apache Hive catalog endpoint. |
| Existing Lakehouse runtime catalog users that have current tables created with the custom Apache Iceberg catalog for BigQuery endpoint. | Continue using the custom Apache Iceberg catalog for BigQuery endpoint, but use the Apache Iceberg REST catalog for new workflows. Tables created with the custom Apache Iceberg catalog for BigQuery endpoint are visible with the Apache Iceberg REST catalog endpoint through BigQuery catalog federation. |
How the Lakehouse architecture integrates with Google Cloud services
To understand how Lakehouse manages your data, see how the Lakehouse for Apache Iceberg architecture integrates with Google Cloud services. Apache Iceberg does not store data in monolithic tables. Instead, it uses a layered architecture of metadata files to organize data files into a cohesive table structure with ACID transaction support.
The following diagram illustrates how compute engines like Managed Service for Apache Spark use the Lakehouse runtime catalog to manage table metadata to read and write underlying Parquet data files directly in Cloud Storage.
When you use Lakehouse for Apache Iceberg, the technical architecture consists of three distinct layers:
Catalog layer:
- Core Iceberg concept: The catalog stores the current state of the table by maintaining a pointer to the latest metadata file. This layer facilitates ACID compliance and transaction isolation to ensure that concurrent writes don't interfere with each other.
- Lakehouse implementation: The Lakehouse runtime catalog acts as the top-level regional metastore service. Within this service, you create individual catalogs to manage your data hierarchy. Client query engines connect to these catalogs by using specific endpoint catalog types, such as the Apache Iceberg REST catalog endpoint. The metastore manages transaction commits, credential vending for storage access delegation, and pointer management across your catalogs.
Metadata layer:
- Core Iceberg concept: This layer tracks the table structure,
snapshots, and file locations by using a hierarchy of three file types:
- Metadata files: Store the table's schema, partition specification, and a log of snapshot pointers.
- Manifest lists: Represent a single snapshot of the table by grouping a collection of manifest files.
- Manifest files: Track data at the individual file level, storing file paths, partition information, and column-level statistics, for example, row counts and minimum and maximum values, which are used for query optimization and partition pruning.
- Lakehouse implementation: Within a catalog container,
you organize your data into logical namespaces (similar to datasets)
and tables. For each table, the
Lakehouse runtime catalog generates and manages the
underlying Iceberg metadata hierarchy, starting with a root
metadata.jsonfile that points to the manifest lists and manifest files. The Lakehouse runtime catalog persists these files directly in your designated warehouse storage location.
- Core Iceberg concept: This layer tracks the table structure,
snapshots, and file locations by using a hierarchy of three file types:
Data layer:
- Core Iceberg concept: This component is the underlying storage where the actual raw data records reside, typically in optimized columnar or row-based open file formats such as Parquet, ORC, or Avro.
- Lakehouse implementation: When you configure a
Cloud Storage bucket (
gs://) as your warehouse storage location, the physical data files referenced by your tables are stored securely within your bucket. The Lakehouse runtime catalog manages access through storage access delegation (credential vending), vending short-lived access tokens directly to client engines. This lets engines read and write data files securely without requiring broad, direct IAM permissions on the underlying bucket.
Lakehouse runtime catalog limitations
The following limitations apply to tables in the Lakehouse runtime catalog:
Table management
- You can't create or modify tables with the Apache Iceberg REST catalog endpoint using BigQuery data definition language (DDL) or data manipulation language (DML) statements. You can modify these tables using the BigQuery API (with the bq command-line tool or client libraries), but doing so risks making changes that are incompatible with the external engine.
- Tables in the Lakehouse runtime catalog don't support
renaming operations or the
ALTER TABLE ... RENAME TOSpark SQL statement. - Tables in the Lakehouse runtime catalog don't support clustering.
- Tables in the Lakehouse runtime catalog don't support flexible column names.
- The Lakehouse runtime catalog doesn't support Apache Iceberg views.
Querying
- Query performance for tables in the Lakehouse runtime catalog from the BigQuery engine might be slow compared to querying data in standard BigQuery tables. In general, query speed should be equivalent to reading data from Cloud Storage.
- A BigQuery dry run of a query that uses a table in the Lakehouse runtime catalog might report a lower bound of 0 bytes of data, even if rows are returned. This result occurs because the amount of data that is processed from the table can't be determined until the full query is run. Running the query incurs a cost for processing this data.
- You can't reference a table in the Lakehouse runtime catalog in a wildcard table query.
API and metadata
- You can't use the
tabledata.listmethod to retrieve data from tables in the Lakehouse runtime catalog. Instead, you can save query results to a BigQuery table, and then use thetabledata.listmethod on that table. - The display of table storage statistics for tables in the Lakehouse runtime catalog isn't supported.
Quotas and limits
- Tables in the Lakehouse runtime catalog in BigQuery are subject to the same quotas and limits as standard tables.
Differences with BigLake metastore (classic)
The core differences between the Lakehouse runtime catalog and BigLake metastore (classic) include the following:
- The Lakehouse runtime catalog supports a direct integration with open source engines like Spark, which helps reduce redundancy when you store metadata and run jobs. Tables in the Lakehouse runtime catalog are directly accessible from multiple open source engines and BigQuery.
- The Lakehouse runtime catalog supports the Apache Iceberg REST catalog endpoint, while BigLake metastore (classic) does not.
What's next
- Understand the Apache Iceberg REST catalog endpoint.