How BigLake works

This page describes the technical architecture of BigLake, provides details on how queries are handled, and how BigLake metastore supports interoperability between engines.

Architecture

A data lakehouse built with BigLake consists of the following components:

  • Storage: Cloud Storage and BigQuery storage serve as the storage layer, with Apache Iceberg as the recommended open table format for Cloud Storage.
  • Metastore: BigLake metastore provides a single source of truth for managing metadata across multiple engines.
  • Query engine: BigQuery, Apache Spark, Apache Flink, Trino, and other open-source engines are compatible with BigLake.
  • Governance: Dataplex Universal Catalog provides centralized security and governance policies.
  • Data writing and analytics tools: Engines and tools integrated with BigLake provide multiple paths for data ingestion and analysis.

Resource hierarchy

BigLake organizes data using the standard Apache Iceberg hierarchy. This structure maps logical database concepts to physical storage paths.

  1. Metastore service: The top-level regional resource in Google Cloud.
  2. Catalog: A container for grouping databases, which correspond to projects.
  3. Namespace: A logical grouping of tables. In BigQuery, this maps to a dataset.
  4. Table: The specific entity pointing to data in Cloud Storage. Table metadata contains information such as the table schema, partitioning information, custom properties, and a pointer to the current table state through a metadata.json file.

Query processing sequence

When you submit a query to a BigLake table, the request follows a specific path to enforce policies before data is read.

  1. Request: You submit a SQL query to an engine (for example, Spark).
  2. Metadata lookup: The engine sends a request to the BigLake metastore to resolve the table.
  3. Authentication and policy: The metastore authenticates you and checks permissions.
  4. Response: The metastore returns the metadata and, optionally, a storage token to the engine. Storage tokens are only used when credential vending is enabled.
  5. Read: The engine uses the token to read files directly from storage.
  6. Compute: The engine processes the data and returns the results.

BigLake metastore

BigLake metastore is a fully managed and serverless metastore for your lakehouse on Google Cloud. It provides a single source of truth for metadata from multiple sources and is accessible from BigQuery and various open data processing engines, which removes the need to synchronize metadata between different repositories.

BigLake metastore is supported with Dataplex Universal Catalog, which provides unified and fine-grained access controls across all supported engines and supports end-to-end governance with lineage, data quality, and discoverability.

Table types

When building a lakehouse on BigLake, you have several choices for the format and management of your tables:

  • BigLake Iceberg tables: Iceberg tables created from open-source engines and stored in Cloud Storage.
  • BigLake Iceberg tables in BigQuery: Iceberg tables created from BigQuery. The metadata for these tables is stored in the BigQuery catalog and can only be accessed through BigQuery catalog federation, while table data and physical metadata are stored in Cloud Storage.
  • Standard BigQuery tables: Tables fully managed by BigQuery that can be connected to BigLake metastore.
  • External tables: Tables outside of BigLake metastore where data and metadata are self-managed.

For a detailed comparison of these options, see the Table overview.

What's next