Lakehouse for Apache Iceberg supports querying remote data through a cross-cloud Lakehouse configuration. Once configured, the system supports data access using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.
This page shows you how to query remote data after you have set up a cross-cloud Lakehouse.
Before you begin
Before you can query your data, you must complete the following:
- Set up cross-cloud Lakehouse for AWS Glue or Databricks Unity Catalog.
- Ensure that you have data within your remote catalog.
Required roles
To get the permissions that you need to query federated data, ask your administrator to grant you the following IAM roles on your project:
-
Query data in BigQuery:
BigQuery Data Viewer (
roles/bigquery.dataViewer) -
Run BigQuery jobs:
BigQuery Job User (
roles/bigquery.jobUser) -
Discover and read table metadata in Lakehouse catalogs:
BigLake Viewer (
roles/biglake.viewer)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Query data
After you set up federation, you can query your remote data using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.
Lakehouse handles the metadata translation and secure data access, which lets you treat remote Apache Iceberg tables as if they were local to your Google Cloud environment.
Query from BigQuery
To query federated Apache Iceberg tables, use standard BigQuery
SQL. The table path follows a 4-part structure:
project.federated_catalog.namespace.table. Caching, credential vending,
and CCI transit routing are automatically handled.
SELECT user_id, action, COUNT(*) as total_actions FROM `PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME` WHERE event_date >= '2026-04-01' GROUP BY 1, 2;
Replace the following:
PROJECT_ID: your Google Cloud project ID.FEDERATED_CATALOG_NAME: the name of the federated catalog.NAMESPACE_NAME: the namespace within the catalog.TABLE_NAME: the name of the table.REGION: the Google Cloud region. For example,us-east4.
You can also run the query using the bq command-line tool:
bq --location="REGION" --project_id="PROJECT_ID" query --use_legacy_sql=false \ "SELECT * FROM \`PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME\` LIMIT 10"
Query from Managed Service for Apache Spark
You can read cross-cloud Lakehouse data directly from Managed Service for Apache Spark clusters without managing separate AWS credentials or S3 connectors. Spark connects directly to Lakehouse using the standard Apache Iceberg REST catalog interface, instead of using the BigQuery Storage Read API.
By configuring your Managed Service for Apache Spark job with the
Lakehouse REST endpoint, Lakehouse automatically
provides temporary, scoped S3 credentials to Spark through the
X-Iceberg-Access-Delegation=vended-credentials. This lets Spark read the
underlying data securely while adhering to your catalog's configuration.
To configure your Managed Service for Apache Spark job with the Lakehouse REST endpoint:
Configure the Spark job.
In this command, pass the Apache Iceberg REST catalog properties when submitting your Spark job. This maps your federated Lakehouse catalog to a local Spark catalog name (for example,
my_catalog):gcloud dataproc jobs submit pyspark PYSPARK_SCRIPT_PATH \ --cluster="CLUSTER_NAME" \ --region="REGION" \ --properties="spark.sql.catalog.SPARK_CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,\ spark.sql.catalog.SPARK_CATALOG_NAME.type=rest,\ spark.sql.catalog.SPARK_CATALOG_NAME.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\ spark.sql.catalog.SPARK_CATALOG_NAME.warehouse=bl://projects/PROJECT_ID/catalogs/FEDERATED_CATALOG_NAME,\ spark.sql.catalog.SPARK_CATALOG_NAME.io-impl=org.apache.iceberg.aws.s3.S3FileIO,\ spark.sql.catalog.SPARK_CATALOG_NAME.header.X-Iceberg-Access-Delegation=vended-credentials,\ spark.sql.catalog.SPARK_CATALOG_NAME.rest-metrics-reporting-enabled=false,\ spark.sql.catalog.SPARK_CATALOG_NAME.header.x-goog-user-project=PROJECT_ID"
Query the data in PySpark.
This command uses Spark to read directly from the Lakehouse catalog, then runs a standard transformation.
df = spark.table("SPARK_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME") df.filter(df.action == "purchase").show()
Replace the following:
PYSPARK_SCRIPT_PATH: the path to your PySpark job.CLUSTER_NAME: the name of your Managed Service for Apache Spark cluster.REGION: the Google Cloud region. For example,us-east4.PROJECT_ID: your Google Cloud project ID.SPARK_CATALOG_NAME: the name you want to use for the Spark catalog.FEDERATED_CATALOG_NAME: the name of the federated catalog.NAMESPACE_NAME: the namespace within the catalog.TABLE_NAME: the name of the table.