What is Google Cloud Lakehouse?

Google Cloud Lakehouse is a high-performance storage engine designed for building open data lakehouses. By integrating the Apache Iceberg open table format with fully managed, enterprise-grade storage on Google Cloud, it provides a unified interface for advanced analytics and AI.

By decoupling storage from compute, Google Cloud Lakehouse ensures seamless interoperability across analytical and transactional systems. This architecture allows multiple engines—including Apache Spark, Apache Flink, Apache Hive, Trino, and BigQuery—to access a single source of truth, eliminating data duplication and ensuring consistent insights.

Key benefits

  • Serverless architecture: Google Cloud Lakehouse eliminates the need for server or cluster management, reducing operational overhead and automatically scaling based on demand.
  • Unified data management and governance: Integration with Knowledge Catalog ensures central definition and enforcement of governance policies across multiple engines, and enables semantic search, data lineage, and quality checks.
  • Storage extensions: Google Cloud Lakehouse extends Cloud Storage management capabilities to include features such as Autoclass tiering and Customer-managed encryption keys (CMEK).
  • Fully managed experience: When integrated with BigQuery, Google Cloud Lakehouse uses high-throughput streaming and real-time metadata management to provide a fully managed streaming, analytics, and AI experience.
  • High availability and disaster recovery: Google Cloud Lakehouse offers options for cross-region replication and disaster recovery (Preview) to support high availability of your data.

Use cases

  • Open lakehouse: Use Cloud Storage as the storage layer, and Google Cloud Lakehouse provides the management and governance interface for Apache Iceberg data.
  • Analytical and transactional integration: Access analytical Apache Iceberg tables directly within AlloyDB for PostgreSQL (Preview) to combine analytical data with transactional workloads.
  • Unified access: Let different engines (Apache Spark, Apache Flink, BigQuery) interact with the same Apache Iceberg tables with consistent metadata.

Catalog interfaces

The Lakehouse runtime catalog is a single metadata service that provides several interfaces (endpoints) to connect your data across Cloud Storage and BigQuery. For more information, see How Google Cloud Lakehouse works.

  • Apache Iceberg REST catalog endpoint: Provides a standard REST interface for wide compatibility with open-source engines like Apache Spark, Apache Flink, and Trino. This is the recommended interface for new workloads and offers full R/W interoperability.

  • Custom Apache Iceberg catalog for BigQuery endpoint: Enables engines to interoperate directly with the BigQuery catalog. This interface is used primarily for BigQuery managed Apache Iceberg tables and existing workloads transitioning to the Google Cloud Lakehouse architecture.

Interfaces and tools

You can interact with Google Cloud Lakehouse resources using the following tools:

  • Google Cloud console: Use the console to create catalogs, view catalog properties, view audit logs, and configure permissions.
  • BigQuery SQL: Use standard SQL DDL (Data Definition Language) to create and manage Apache Iceberg tables and external tables integrated with the Lakehouse runtime catalog.
  • Open source engines: Use engines such as Apache Spark, Apache Flink, and Apache Hive with the Lakehouse runtime catalog to read and write data.
  • Lakehouse runtime catalog API: use the Apache Iceberg REST catalog endpoint to interact with the service using tools that are compatible with the open Apache Iceberg REST specification.

What's next