How Google Cloud Lakehouse works

This page describes the technical architecture of Google Cloud Lakehouse, provides details on how queries are handled, and explains how the Lakehouse runtime catalog supports interoperability between engines.

Architecture

Building a Google Cloud Lakehouse consists of the following technical components:

  • Storage: Cloud Storage and BigQuery storage serve as the storage layer, with Apache Iceberg as the recommended open table format for high-performance, interoperable storage in Cloud Storage.

  • Catalog: The Lakehouse runtime catalog provides a single source of truth for managing metadata. It centralizes metadata discovery across multiple engines using various compatibility options, including an Apache Iceberg REST catalog endpoint, an Apache Hive endpoint, and catalog federation.

  • Query engine: BigQuery and open-source engines—including Apache Spark, Apache Flink, and Trino—interoperate seamlessly by connecting to the Lakehouse runtime catalog.

  • Governance: Knowledge Catalog provides centralized security, lineage, and governance policies across your entire lakehouse.

  • Data writing and analytics tools: Integrated engines and tools provide multiple paths for data ingestion and analysis, ensuring consistent data access for both data scientists and analysts.

Resource hierarchy

Google Cloud Lakehouse organizes data using a hierarchy that aligns with Apache Iceberg standards and standard database concepts. This structure allows the Lakehouse runtime catalog to map logical identities to physical storage paths.

  1. Lakehouse runtime catalog: The top-level regional service resource in Google Cloud that hosts your metadata.
  2. Catalog: A logical container within the runtime catalog service. In the P.C.N.T naming structure, this represents the specific catalog instance you are querying.
  3. Namespace: A logical grouping of tables within a catalog. For users familiar with BigQuery, a namespace is functionally similar to a dataset.
  4. Table: The specific entity pointing to data in Cloud Storage. The table metadata contains the schema, partitioning information, and a pointer to the current table state through an Apache Iceberg metadata.json file.

Query processing sequence

When you submit a query to a Google Cloud Lakehouse table, the request follows a specific path to enforce policies and retrieve metadata before data is processed.

  1. Submission: You submit a SQL query to a compatible engine such as Apache Spark, Trino, or BigQuery.
  2. Metadata request: The engine requests table metadata from the Lakehouse runtime catalog to identify the table and its metadata location.
  3. Authorization: The catalog validates the request against Identity and Access Management (IAM) and fine-grained security policies.
  4. Metadata response: The catalog returns the metadata. If credential vending is enabled, it also provides a short-lived token to help with secure storage access.
  5. Data retrieval: The engine uses the metadata and optional token to read data files directly from Cloud Storage.
  6. Execution: The engine processes the data and returns the results.

Lakehouse runtime catalog

The Lakehouse runtime catalog is a fully managed and serverless metadata service for Google Cloud Lakehouse. It provides a single source of truth for metadata across disparate systems and is accessible from BigQuery and various open-source data processing engines. This centralizes discovery and removes the need to synchronize metadata between different repositories.

The Lakehouse runtime catalog integrates with Knowledge Catalog to provide unified, fine-grained access controls across all supported engines. This integration enables full data governance, including data lineage, quality monitoring, and discoverability.

Table types

When building with Google Cloud Lakehouse, you can choose how to manage and format your tables:

Recommended

  • Lakehouse Iceberg REST catalog tables: Apache Iceberg tables created from open-source engines and stored in Cloud Storage. These offer open compatibility and management through the Lakehouse runtime catalog REST endpoint.

BigQuery table types

  • Apache Iceberg tables: Apache Iceberg tables created and managed by BigQuery. The metadata for these tables is stored in the BigQuery catalog, while table data and physical metadata are stored in Cloud Storage.
  • Native tables: Tables fully managed by BigQuery that can be connected to the Lakehouse runtime catalog to enable interoperability with open-source engines.
  • External tables: Tables outside of the Lakehouse runtime catalog where data and metadata are self-managed. These support delegated access through connections for data stored in Cloud Storage, Amazon S3, or Azure Blob Storage.

For a detailed comparison of these options, see Table overview.

What's next