Key concepts

This document defines the key terms and concepts for Google Cloud Lakehouse.

This page is not an exhaustive list of features, but instead a general reference of terms and concepts used throughout the Google Cloud Lakehouse documentation.

Core Concepts

The following concepts form the foundation of the Google Cloud Lakehouse architecture.

Data Google Cloud Lakehouse

A data lakehouse brings together the cost savings and flexibility of a data lake with the data management and performance of a data warehouse. It lets you store data in open formats on Cloud Storage and use BigQuery features, such as precise security controls and fast queries.

Open Interoperability

Open interoperability is the ability for multiple analytical and transactional systems—such as BigQuery, Apache Spark, and Apache Flink—to operate on a single copy of data in open formats such as Apache Iceberg. This eliminates the need for data duplication and ensures a consistent view of data across disparate tools.

Lakehouse runtime catalog

The Lakehouse runtime catalog is a centralized, serverless metadata service that acts as the single source of truth for Google Cloud Lakehouse. It lets multiple engines, such as Apache Spark, Apache Flink, and BigQuery, discover and query the same tables simultaneously.

Catalog Types

The Lakehouse runtime catalog offers different types of catalogs for managing your metadata.

Apache Iceberg REST catalog endpoint

This is a catalog based on the Apache Iceberg REST catalog endpoint. It provides interoperability between open source engines and BigQuery, and supports features such as credential vending and disaster recovery.

Custom Apache Iceberg catalog for BigQuery

This is an integration that uses the BigQuery catalog directly as the backing metadata service for managed Apache Iceberg tables.

Table Formats

Google Cloud Lakehouse supports several table formats, depending on the engine used to manage the data.

Lakehouse Iceberg REST catalog tables

These are Apache Iceberg tables created from open source engines and stored in Cloud Storage. The Lakehouse runtime catalog serves as the central catalog. The open source engine that created the table is the only engine that can write to it.

BigQuery tables

These tables are managed with BigQuery.

Apache Iceberg tables

These are Apache Iceberg tables that you create from BigQuery and store in Cloud Storage. BigQuery handles all data layout and optimization. While these tables can be read by multiple engines, BigQuery is the only engine that can directly write to them.

Native tables

These tables are managed by BigQuery and store data in BigQuery storage. You can connect these tables to the Lakehouse runtime catalog.

External tables

External tables reside outside of the Lakehouse runtime catalog. The data and metadata are self-managed in a third-party catalog (such as Cloud Storage, S3, or Azure Blob Storage). BigQuery can only read from these tables.

Table Features

Table evolution

Google Cloud Lakehouse supports Apache Iceberg table evolution, which lets you change a table's schema or partition spec over time without rewriting the table data or recreating the table.

Time travel

Time travel lets you query a table's data as it existed at a specific point in time or snapshot ID. This is useful for auditing, reproducing experiments, or restoring data after an accidental deletion.

Metadata caching

Metadata caching is a feature that accelerates query performance for external tables. It stores a copy of the table's metadata in BigQuery storage, reducing the need to read metadata files from Cloud Storage during query execution.

Google Cloud Lakehouse table management

Google Cloud Lakehouse table management simplifies lakehouse maintenance by automating tasks such as compaction and garbage collection for managed tables. This ensures optimal query performance and storage efficiency.

Interoperability Concepts

Lakehouse runtime catalog federation

Catalog federation is a feature that allows the Lakehouse runtime catalog to manage and query tables from foreign catalogs—such as AWS Glue or Unity Catalog—that are visible to BigQuery.

P.C.N.T Naming Structure

The P.C.N.T naming structure is the four-part convention used to uniquely identify and query tables in the Lakehouse runtime catalog from BigQuery. It stands for Project.Catalog.Namespace.Table:

  • Project: The Google Cloud project ID.
  • Catalog: The name of the Lakehouse runtime catalog.
  • Namespace: The logical grouping for tables (similar to a dataset).
  • Table: The name of the data table.

Security Concepts

Connections

A connection is a BigQuery resource that stores credentials for accessing external data. In Google Cloud Lakehouse, connections delegate access to Cloud Storage by letting the connection's service account access the storage bucket on your behalf.

Credential Vending

Credential vending is a security mechanism that helps tighten access control when using the Lakehouse runtime catalog. When enabled, the service generates short-lived, scoped-down credentials designed to grant access only to the specific file paths required for a query.

Unified governance

Unified governance lets you define and enforce security and data management policies centrally through integration with Knowledge Catalog.

Reliability Concepts

Cross-region replication

Cross-region replication replicates metadata across multiple regions to ensure catalog availability during regional outages.

Failover

Failover is the process of switching between primary and secondary regions during a regional outage to maintain catalog operations.