The technical architecture of Lakehouse for Apache Iceberg supports interoperability between engines by centralizing metadata management and handling queries through specific paths.
Architecture
Building Google Cloud's Lakehouse consists of the following technical components:
Storage: Cloud Storage and BigQuery storage serve as the storage layer, with Apache Iceberg as the recommended open table format for high-performance, interoperable storage in Cloud Storage.
Catalog: The Lakehouse runtime catalog provides a single source of truth for managing metadata. It centralizes metadata discovery across multiple engines using various compatibility options, such as the Apache Iceberg REST catalog endpoint. Table registrations into the catalog automatically register entries into the business metadata knowledge catalog.
Query engine: BigQuery and open-source engines—including Apache Spark, Apache Flink, and Trino—interoperate seamlessly by connecting to the Lakehouse runtime catalog. Compute engines like Managed Service for Apache Spark leverage open-source Apache Spark with execution optimizations to ensure workload portability and avoid vendor lock-in.
Governance: Knowledge Catalog provides centralized security, lineage, and governance policies across your entire lakehouse.
Data writing and analytics tools: Integrated engines and tools provide multiple paths for data ingestion and analysis, ensuring consistent data access for both data scientists and analysts.
Resource hierarchy
Google Cloud's Lakehouse organizes data using a hierarchy that aligns with Apache Iceberg standards and standard database concepts. This structure allows the Lakehouse runtime catalog to map logical identities to physical storage paths. To interact with this resource hierarchy and connect your query engines to the catalog, you use specific endpoints, as described below.
- Lakehouse runtime catalog: The top-level regional service resource in Google Cloud that hosts your metadata. To connect query engines to this service and manage underlying catalogs, you configure client applications using a specific catalog endpoint, such as the Apache Iceberg REST catalog endpoint.
- Catalog: A logical container within the runtime catalog service. In the Project/Catalog/Namespace/Table (P.C.N.T) naming structure, this represents the specific catalog instance you are querying.
- Namespace: A logical grouping of tables within a catalog. For users familiar with BigQuery, a namespace is functionally similar to a dataset.
- Table: The specific entity pointing to data in
Cloud Storage. The table metadata contains the schema, partitioning
information, and a pointer to the current table state through an Apache
Iceberg
metadata.jsonfile.
Supported endpoints
The Lakehouse runtime catalog provides several endpoints to connect your data across Cloud Storage and BigQuery.
Apache Iceberg REST catalog endpoint: Provides a standard REST interface for wide compatibility with open-source engines like Apache Spark, Apache Flink, and Trino. This is the recommended interface for new workloads and offers full read and write interoperability.
Custom Apache Iceberg catalog for BigQuery endpoint: Enables engines to interoperate directly with the BigQuery catalog. This interface is used primarily for Apache Iceberg tables managed by BigQuery and existing workloads transitioning to the Lakehouse architecture.
Apache Hive catalog endpoint: Provides compatibility for open source workloads that depend on the Apache Hive metastore (HMS) interface. This lets you run Apache Hive or Spark workloads against a fully managed metastore service on Google Cloud.
Lakehouse runtime catalog
Within the resource hierarchy, the Lakehouse runtime catalog serves as the top-level regional metadata service in Google Cloud. It acts as the root container that hosts your individual catalog instances, centralizing metadata discovery across disparate query engines.
For a deeper dive into the metastore service, including key capabilities, supported engines, endpoint configuration, and limitations, see About the Lakehouse runtime catalog.
Catalog
A catalog is a logical metastore container backed by a single
Cloud Storage warehouse bucket. In the
Project.Catalog.Namespace.Table (P.C.N.T) naming structure, the catalog
represents the unique metastore instance that connects your open table metadata
with query engines.
Key characteristics of catalogs include the following:
- Storage association: You can only associate one catalog with a single Cloud Storage bucket.
- Regional replication: A catalog's region automatically matches the underlying Cloud Storage bucket's region.
- Access delegation: Administrators can enable credential vending on the catalog to delegate access, allowing short-lived, downscoped credentials to be auto-generated instead of granting users direct bucket permissions.
Namespace
A namespace is a logical grouping of tables within a catalog, functioning similarly to a database, schema, or a BigQuery dataset. It provides a structure to organize and manage access controls for tables.
Key characteristics of namespaces include the following:
- Regionality: When you create a namespace, it automatically uses the same region as its parent catalog.
- Nesting limitations: Nested namespaces (sub-namespaces) are not supported.
- Security boundaries: You can grant IAM roles at the namespace level to manage access to all tables contained within it.
Tables
When building with Google Cloud's Lakehouse, you can choose from the following table types:
Supported by the Lakehouse runtime catalog
Recommended
- Apache Iceberg tables: Apache Iceberg tables created from open-source engines and stored in Cloud Storage. These offer open compatibility and management through the Lakehouse runtime catalog REST endpoint.
Supported by BigQuery
- Apache Iceberg tables: Apache Iceberg tables created and managed by BigQuery. The metadata for these tables is stored in the BigQuery catalog, while table data and physical metadata are stored in Cloud Storage.
- Native tables: Tables fully managed by BigQuery that can be connected to the Lakehouse runtime catalog to enable interoperability with open-source engines.
- External tables: Tables outside of the Lakehouse runtime catalog where data and metadata are self-managed. These support delegated access through connections for data stored in Cloud Storage, Amazon S3, or Azure Blob Storage.
For a detailed comparison of these options, see Table overview.
Query processing sequence
When you submit a query to Google Cloud's Lakehouse table, the request follows a specific path to enforce policies and retrieve metadata before data is processed.
- Submission: You submit a SQL query to a compatible engine such as Apache Spark, Trino, or BigQuery.
- Metadata request: The engine requests table metadata from the Lakehouse runtime catalog to identify the table and its metadata location.
- Authorization: If supported by the endpoint you are using, the catalog validates the request against Identity and Access Management (IAM) and fine-grained security policies.
- Metadata response: The catalog returns the metadata. If credential vending is enabled, it also provides a short-lived token to help with secure storage access.
- Data retrieval: The engine uses the metadata and optional token to read data files directly from Cloud Storage.
- Execution: The engine processes the data and returns the results.
Best practices
When architecting and operating a data lakehouse on Google Cloud, consider the following best practices:
- Adopt a medallion architecture: Structure your data warehouse into progressive logical layers (bronze for raw ingestion, silver for cleansed and conformed data, and gold for curated business-level aggregations). Use BigQuery for the gold consumption layer to maximize query performance and concurrency.
- Use session templates for interactive workloads: For exploratory analytics and notebook authoring, use session templates to standardize environment configurations across development teams and reduce repetitive setup.
- Assign custom batch identifiers: When submitting non-interactive serverless Apache Spark batch workloads, assign custom batch and job names. This improves observability, making it easier to filter and track job executions within Cloud Logging and the Google Cloud console.
- Enable diagnostic logging: For complex data engineering pipelines, enable diagnostic bundles and ensure driver and executor logs are retained to simplify troubleshooting and supportability.
What's next
- For a deeper dive into the metastore service, see About the Lakehouse runtime catalog.
- Use the Lakehouse runtime catalog with Apache Spark, BigQuery and the Apache Iceberg REST catalog endpoint.