Introduction to BigLake metastore

BigLake metastore is a unified, managed, serverless, and scalable metastore that connects lakehouse data stored in Google Cloud to multiple runtimes, including open source engines (like Apache Spark) and BigQuery. It provides the foundation that you need to build an open, managed, and high-performance lakehouse with automated data management and built-in governance using key open source table formats, such as Apache Iceberg.

BigLake metastore provides a single source of truth for metadata from multiple sources, removing the need to copy and synchronize data and metadata between different analytical systems and repositories with customized tools. It also supports storage access delegation models, such as credential vending, which eliminates the need for catalog users to have direct access to Cloud Storage buckets.

For workflows that use BigLake Iceberg tables in BigQuery, BigLake metastore is also supported with Dataplex Universal Catalog, which provides unified and fine-grained access controls across all supported engines and enables end-to-end governance that includes comprehensive lineage, data quality, and discoverability capabilities.

BigLake metastore can be configured in one of two ways: with the Iceberg REST catalog or the custom Iceberg catalog for BigQuery. The best option depends on your use case, as shown in the following table:

Use case Recommendation
New BigLake metastore users that want their open source engine to access data in Cloud Storage and need interoperability with other engines, including BigQuery and AlloyDB. Use the Iceberg REST catalog.
Existing BigLake metastore users that have current tables with the custom Iceberg catalog for BigQuery. Continue using the custom Iceberg catalog for BigQuery, but use the Iceberg REST catalog for new workflows. Tables created with the custom Iceberg catalog for BigQuery are visible with the Iceberg REST catalog through federation.

Key capabilities

BigLake metastore offers several major advantages for data management and analysis:

  • Serverless architecture. BigLake metastore provides a serverless architecture, eliminating the need for server or cluster management. Doing so helps reduce operational overhead, simplifies deployment, and allows for automatic scaling based on demand.
  • Engine interoperability with open APIs. BigLake metastore supports interoperability with open source and third-party engines. BigLake metastore provides you with direct table access across open source engines (such as Spark and Flink) and BigQuery, allowing you to query open-format tables without additional connection steps. This configuration helps streamline your analytics workflow and reduces the need for complex data movement or ETL processes.
  • Unified user experience. BigLake metastore provides a unified workflow across open source engines and BigQuery. This unified experience means you can, for example, configure Spark environments that are self-hosted, hosted by Dataproc through the Iceberg REST catalog, or in a BigQuery notebook.
  • High-performance analytics, streaming, and AI with BigQuery. BigLake metastore lets you store Iceberg data in your own Cloud Storage buckets and leverage the highly scalable, real-time metadata management capabilities of BigQuery. This architecture equips you with the openness and data ownership of Cloud Storage, as well as the fully managed streaming, analytics, and AI capabilities of BigQuery.

Differences with BigLake metastore (classic)

BigLake metastore is the recommended metastore on Google Cloud, while BigLake metastore (classic) is considered a legacy feature.

The core differences between BigLake metastore and BigLake metastore (classic) include the following:

  • BigLake metastore supports a direct integration with open source engines like Spark, which helps reduce redundancy when you store metadata and run jobs. Tables in BigLake metastore are directly accessible from multiple open source engines and BigQuery.
  • BigLake metastore supports the Iceberg REST catalog, while BigLake metastore (classic) does not.

BigLake metastore limitations

The following limitations apply to tables in BigLake metastore:

  • You can't create or modify BigLake Iceberg tables with BigQuery data definition language (DDL) or data manipulation language (DML) statements. You can modify BigLake Iceberg tables using the BigQuery API (with the bq command-line tool or client libraries), but doing so risks making changes that are incompatible with the external engine.
  • BigLake metastore tables don't support renaming operations or the ALTER TABLE ... RENAME TO Spark SQL statement.
  • BigLake metastore tables in BigQuery are subject to the same quotas and limits as standard tables.
  • Query performance for BigLake metastore tables from the BigQuery engine might be slow compared to querying data in standard BigQuery tables. In general, query speed should be equivalent to reading data from Cloud Storage.
  • A BigQuery dry run of a query that uses a BigLake metastore table might report a lower bound of 0 bytes of data, even if rows are returned. This result occurs because the amount of data that is processed from the table can't be determined until the full query is run. Running the query incurs a cost for processing this data.
  • You can't reference a BigLake metastore table in a wildcard table query.
  • You can't use the tabledata.list method to retrieve data from BigLake metastore tables. Instead, you can save query results to a BigQuery table, and then use the tabledata.list method on that table.
  • BigLake metastore tables don't support clustering.
  • BigLake metastore tables don't support flexible column names.
  • The display of table storage statistics for BigLake metastore tables isn't supported.
  • BigLake metastore doesn't support Iceberg views.

What's next