Use Cross-cloud Lakehouse

This page shows you how to query remote data and manage permissions after you have set up a Cross-cloud Lakehouse in Google Cloud Lakehouse.

Before you begin

Before you can query your data, you must complete the following:

  1. Set up a Cross-cloud Lakehouse.
  2. Ensure that you have data in a dataset within your remote catalog, for example, Databricks Unity Catalog.

Required roles

To get the permissions that you need to query federated data, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Query data

After you set up federation, you can query your remote data using standard SQL in BigQuery or Spark in Managed Service for Apache Spark. Google Cloud Lakehouse handles the metadata translation and secure data access, which lets you treat remote Apache Iceberg tables as if they were local to your Google Cloud environment.

Query from BigQuery

To query federated Apache Iceberg tables, use standard BigQuery SQL. The table path follows a 4-part structure: project.federated_catalog.namespace.table.

SELECT
  user_id,
  action,
  COUNT(*) as total_actions
FROM `PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME`
WHERE event_date >= '2026-04-01'
GROUP BY 1, 2;

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • FEDERATED_CATALOG_NAME: the name of the federated catalog.
  • NAMESPACE_NAME: the namespace within the catalog.
  • TABLE_NAME: the name of the table.

Query from Managed Service for Apache Spark Spark

You can read Cross-cloud Lakehouse data directly from Managed Service for Apache Spark Spark clusters without managing separate AWS credentials or S3 connectors. Spark connects natively to Lakehouse using the standard Apache Iceberg REST catalog interface, instead of using the BigQuery Storage Read API.

By configuring your Managed Service for Apache Spark job with the Lakehouse REST endpoint, Lakehouse automatically provides temporary, scoped S3 credentials to Spark through the X-Iceberg-Access-Delegation=vended-credentials. This lets Spark read the underlying data securely while adhering to your catalog's configuration.

To configure your Managed Service for Apache Spark job with the Lakehouse REST endpoint.

  1. Configure the Spark job.

    In this command, pass the Apache Iceberg REST catalog properties when submitting your Spark job. This maps your federated Lakehouse catalog to a local Spark catalog name (e.g., my_catalog):

    gcloud dataproc jobs submit pyspark PYSPARK_SCRIPT_PATH \
      --cluster="CLUSTER_NAME" \
      --region="REGION" \
      --properties="spark.sql.catalog.SPARK_CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,\
    spark.sql.catalog.SPARK_CATALOG_NAME.type=rest,\
    spark.sql.catalog.SPARK_CATALOG_NAME.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\
    spark.sql.catalog.SPARK_CATALOG_NAME.warehouse=bl://projects/PROJECT_ID/catalogs/FEDERATED_CATALOG_NAME,\
    spark.sql.catalog.SPARK_CATALOG_NAME.io-impl=org.apache.iceberg.aws.s3.S3FileIO,\
    spark.sql.catalog.SPARK_CATALOG_NAME.header.X-Iceberg-Access-Delegation=vended-credentials,\
    spark.sql.catalog.SPARK_CATALOG_NAME.rest-metrics-reporting-enabled=false,\
    spark.sql.catalog.SPARK_CATALOG_NAME.header.x-goog-user-project=PROJECT_ID"
  2. Query the data in PySpark.

    This command uses Spark to read directly from the Lakehouse catalog, then runs a standard transformation.

    df = spark.table("SPARK_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME")
    df.filter(df.action == "purchase").show()

    Replace the following:

    • PYSPARK_SCRIPT_PATH: The path to your PySpark job.
    • CLUSTER_NAME: the name of your Managed Service for Apache Spark cluster.
    • REGION: the Google Cloud region. For example, us-east4.
    • PROJECT_ID: your Google Cloud project ID.
    • SPARK_CATALOG_NAME: the name you want to use for the Spark catalog.
    • FEDERATED_CATALOG_NAME: the name of the federated catalog.
    • NAMESPACE_NAME: the namespace within the catalog.
    • TABLE_NAME: the name of the table.

What's next