Key concepts

This document defines the key terms and concepts for BigLake.

This page is not an exhaustive list of features, but instead a general reference of terms and concepts used throughout the BigLake documentation.

Core Concepts

The following concepts form the foundation of the BigLake architecture.

Data Lakehouse

A data lakehouse is a data architecture that combines the cost-efficiency and flexibility of a data lake with the data management and performance structures of a data warehouse. BigLake enables a lakehouse architecture by letting you keep data in open formats on Cloud Storage while using BigQuery features such as fine-grained security and high-performance querying.

Open Interoperability

Open interoperability is the ability for multiple analytical and transactional systems—such as BigQuery, Spark, and Flink—to operate on a single copy of data in open formats such as Apache Iceberg. This eliminates the need for data duplication and ensures a consistent view of data across disparate tools.

BigLake Metastore

BigLake metastore is a centralized, serverless metadata service that acts as the single source of truth for your lakehouse. It lets multiple engines, such as Spark, Flink, and BigQuery, discover and query the same tables simultaneously.

Catalog Types

The BigLake metastore offers two different types of catalogs for managing your metadata. Your choice of catalog is a fundamental decision that affects how you interact with your data.

Iceberg REST catalog

This is a catalog based on the Apache Iceberg REST catalog specification. It provides interoperability between open source engines and BigQuery, and supports features such as credential vending and disaster recovery.

Custom Iceberg catalog for BigQuery

This is an integration that uses BigQuery directly as the backing metastore.

Table Formats

BigLake supports several table formats, depending on the engine used to manage the data.

BigLake Iceberg tables in BigQuery

These are Iceberg tables that you create from BigQuery and store in Cloud Storage. BigQuery handles all data layout and optimization. While these tables can be read by multiple engines, BigQuery is the only engine that can directly write to them.

BigLake Iceberg tables

These are Iceberg tables created from open source engines and stored in Cloud Storage. The BigLake metastore serves as the central catalog. The open source engine that created the table is the only engine that can write to it.

Standard BigQuery tables

These tables are managed by BigQuery and store data in BigQuery storage. You can connect these tables to BigLake metastore.

External tables

External tables reside outside of BigLake metastore. The data and metadata are self-managed in a third-party catalog. BigQuery can only read from these tables.

Table Features

BigLake provides several features that simplify data management and improve query performance for Iceberg tables.

Table evolution

BigLake supports Iceberg table evolution, which lets you change a table's schema or partition spec over time without rewriting the table data or recreating the table.

Time travel

Time travel lets you query a table's data as it existed at a specific point in time or snapshot ID. This is useful for auditing, reproducing experiments, or restoring data after an accidental deletion.

Metadata caching

Metadata caching is a feature that accelerates query performance for BigLake external tables. It stores a copy of the table's metadata in BigQuery storage, reducing the need to read metadata files from Cloud Storage during query execution.

Automatic table maintenance

Automatic table maintenance simplifies lakehouse management by automating tasks such as compaction and garbage collection for managed tables. This ensures optimal query performance and storage efficiency without manual intervention.

Interoperability Concepts

Interoperability provides data access across Google Cloud and open source systems.

Catalog Federation

Catalog federation is a feature of the Iceberg REST catalog that lets it manage and query tables that are visible to BigQuery, including tables created with the custom Iceberg catalog.

P.C.N.T Naming Structure

The P.C.N.T naming structure is the four-part convention used to uniquely identify and query tables in BigLake metastore from BigQuery. It stands for Project.Catalog.Namespace.Table:

  • Project: The Google Cloud project ID
  • Catalog: The name of the BigLake metastore catalog
  • Namespace: The logical grouping for tables (similar to a dataset)
  • Table: The name of the data table

Security Concepts

Security features provide mechanisms for access management and data protection.

Connections

A connection is a BigQuery resource that stores credentials for accessing external data. In BigLake, connections delegate access to Cloud Storage by letting the connection's service account access the storage bucket on your behalf.

Credential Vending

Credential vending is a security mechanism that helps tighten access control when using the Iceberg REST catalog. When enabled, BigLake generates short-lived, scoped-down credentials designed to grant access only to the specific file paths required for a query, rather than passing generic bucket access to Compute Engine. This helps prevent users from bypassing table-level security policies to read raw files directly.

Unified governance

Unified governance lets you define and enforce security and data management policies centrally through integration with Dataplex Universal Catalog.

Reliability Concepts

Reliability features provide data resilience and catalog availability.

Cross-region replication

Cross-region replication replicates metadata across multiple regions to ensure catalog availability during regional outages.

Failover

Failover is the process of switching between primary and secondary regions during a regional outage to maintain catalog operations.