Lakehouse for Apache Iceberg supports querying remote data and managing permissions through a cross-cloud Lakehouse configuration. Once configured, the system supports data access using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.
This page shows you how to query remote data and manage permissions after you have set up a cross-cloud Lakehouse.
Before you begin
Before you can query your data, you must complete the following:
- Set up cross-cloud Lakehouse.
- Ensure that you have data within your remote catalog. For example, a Databricks Unity Catalog catalog.
Required roles
To get the permissions that you need to query federated data, ask your administrator to grant you the following IAM roles on your project:
-
Query data in BigQuery:
BigQuery Data Viewer (
roles/bigquery.dataViewer) -
Run BigQuery jobs:
BigQuery Job User (
roles/bigquery.jobUser)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Query data
After you set up federation, you can query your remote data using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.
Lakehouse for Apache Iceberg handles the metadata translation and secure data access, which lets you treat remote Apache Iceberg tables as if they were local to your Google Cloud environment.
Query from BigQuery
To query federated Apache Iceberg tables, use standard BigQuery
SQL. The table path follows a 4-part structure:
project.federated_catalog.namespace.table.
SELECT user_id, action, COUNT(*) as total_actions FROM `PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME` WHERE event_date >= '2026-04-01' GROUP BY 1, 2;
Replace the following:
PROJECT_ID: your Google Cloud project ID.FEDERATED_CATALOG_NAME: the name of the federated catalog.NAMESPACE_NAME: the namespace within the catalog.TABLE_NAME: the name of the table.
Query from Managed Service for Apache Spark
You can read cross-cloud Lakehouse data directly from Managed Service for Apache Spark clusters without managing separate AWS credentials or S3 connectors. Spark connects natively to Lakehouse using the standard Apache Iceberg REST catalog interface, instead of using the BigQuery Storage Read API.
By configuring your Managed Service for Apache Spark job with the
Lakehouse REST endpoint, Lakehouse automatically
provides temporary, scoped S3 credentials to Spark through the
X-Iceberg-Access-Delegation=vended-credentials. This lets Spark read the
underlying data securely while adhering to your catalog's configuration.
To configure your Managed Service for Apache Spark job with the Lakehouse REST endpoint:
Configure the Spark job.
In this command, pass the Apache Iceberg REST catalog properties when submitting your Spark job. This maps your federated Lakehouse catalog to a local Spark catalog name (for example,
my_catalog):gcloud dataproc jobs submit pyspark PYSPARK_SCRIPT_PATH \ --cluster="CLUSTER_NAME" \ --region="REGION" \ --properties="spark.sql.catalog.SPARK_CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,\ spark.sql.catalog.SPARK_CATALOG_NAME.type=rest,\ spark.sql.catalog.SPARK_CATALOG_NAME.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\ spark.sql.catalog.SPARK_CATALOG_NAME.warehouse=bl://projects/PROJECT_ID/catalogs/FEDERATED_CATALOG_NAME,\ spark.sql.catalog.SPARK_CATALOG_NAME.io-impl=org.apache.iceberg.aws.s3.S3FileIO,\ spark.sql.catalog.SPARK_CATALOG_NAME.header.X-Iceberg-Access-Delegation=vended-credentials,\ spark.sql.catalog.SPARK_CATALOG_NAME.rest-metrics-reporting-enabled=false,\ spark.sql.catalog.SPARK_CATALOG_NAME.header.x-goog-user-project=PROJECT_ID"
Query the data in PySpark.
This command uses Spark to read directly from the Lakehouse catalog, then runs a standard transformation.
df = spark.table("SPARK_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME") df.filter(df.action == "purchase").show()
Replace the following:
PYSPARK_SCRIPT_PATH: the path to your PySpark job.CLUSTER_NAME: the name of your Managed Service for Apache Spark cluster.REGION: the Google Cloud region. For example,us-east4.PROJECT_ID: your Google Cloud project ID.SPARK_CATALOG_NAME: the name you want to use for the Spark catalog.FEDERATED_CATALOG_NAME: the name of the federated catalog.NAMESPACE_NAME: the namespace within the catalog.TABLE_NAME: the name of the table.