Use cross-cloud Lakehouse

Lakehouse for Apache Iceberg supports querying remote data through a cross-cloud Lakehouse configuration. Once configured, the system supports data access using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.

This page shows you how to query remote data after you have set up a cross-cloud Lakehouse.

Before you begin

Before you can query your data, you must complete the following:

  1. Set up cross-cloud Lakehouse for AWS Glue or Databricks Unity Catalog.
  2. Ensure that you have data within your remote catalog.

Required roles

To get the permissions that you need to query federated data, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Query data

After you set up federation, you can query your remote data using standard SQL in BigQuery or Apache Spark in Managed Service for Apache Spark.

Lakehouse handles the metadata translation and secure data access, which lets you treat remote Apache Iceberg tables as if they were local to your Google Cloud environment.

Query from BigQuery

To query federated Apache Iceberg tables, use standard BigQuery SQL. The table path follows a 4-part structure: project.federated_catalog.namespace.table. Caching, credential vending, and CCI transit routing are automatically handled.

SELECT
  user_id,
  action,
  COUNT(*) as total_actions
FROM `PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME`
WHERE event_date >= '2026-04-01'
GROUP BY 1, 2;

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • FEDERATED_CATALOG_NAME: the name of the federated catalog.
  • NAMESPACE_NAME: the namespace within the catalog.
  • TABLE_NAME: the name of the table.
  • REGION: the Google Cloud region. For example, us-east4.

You can also run the query using the bq command-line tool:

bq --location="REGION" --project_id="PROJECT_ID" query --use_legacy_sql=false \
  "SELECT * FROM \`PROJECT_ID.FEDERATED_CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME\` LIMIT 10"

Query from Managed Service for Apache Spark

Submit a PySpark batch workload to Managed Service for Apache Spark with credential vending enabled using X-Iceberg-Access-Delegation=vended-credentials. Spark will use the short-lived scoped vended credentials to connect to S3 securely, all without needing to manage separate AWS credentials or S3 connectors.

  1. Enable outbound connectivity for Managed Service for Apache Spark.

    Managed Service for Apache Spark cannot connect to AWS S3 with its default network configuration. You must provision a Cloud Router and Cloud NAT.

    gcloud compute routers create lakehouse-router \
      --network=NETWORK_NAME \
      --region=REGION
    
    gcloud compute routers nats create lakehouse-nat \
      --router=lakehouse-router \
      --auto-allocate-nat-external-ips \
      --nat-all-subnet-ip-ranges \
      --region=REGION

    Replace the following:

    • NETWORK_NAME: the network for the Managed Service for Apache Spark batch workload (for example, default).
    • REGION: the region for the Managed Service for Apache Spark batch workload.
  2. Create a PySpark application file and run the PySpark job.

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("CATALOG_NAME").getOrCreate()
    
    df = spark.table("CATALOG_NAME.NAMESPACE_NAME.TABLE_NAME")
    df.show(10, truncate=False)

    Upload this to Cloud Storage at PYSPARK_FILE.

    gcloud dataproc batches submit pyspark PYSPARK_FILE \
        --project=PROJECT_ID \
        --region=REGION \
        --version=RUNTIME_VERSION \
        --properties="\
        spark.sql.defaultCatalog=CATALOG_NAME,\
        spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,\
        spark.sql.catalog.CATALOG_NAME.type=rest,\
        spark.sql.catalog.CATALOG_NAME.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\
        spark.sql.catalog.CATALOG_NAME.warehouse=bl://projects/PROJECT_ID/catalogs/FEDERATED_CATALOG_NAME,\
        spark.sql.catalog.CATALOG_NAME.header.x-goog-user-project=PROJECT_ID,\
        spark.sql.catalog.CATALOG_NAME.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager,\
        spark.sql.catalog.CATALOG_NAME.io-impl=org.apache.iceberg.aws.s3.S3FileIO,\
        spark.sql.catalog.CATALOG_NAME.header.X-Iceberg-Access-Delegation=vended-credentials,\
        spark.sql.catalog.CATALOG_NAME.rest-metrics-reporting-enabled=false,\
        spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

    Replace the following:

    • NAMESPACE_NAME: the namespace in the federated catalog.
    • TABLE_NAME: the name of the table in the federated catalog.
    • CATALOG_NAME: a name for the local Spark catalog (for example, my_catalog).
    • PYSPARK_FILE: the gs:// Cloud Storage path to your PySpark application file.
    • REGION: the region for the Managed Service for Apache Spark batch workload.
    • RUNTIME_VERSION: the Managed Service for Apache Spark runtime version, for example 2.3.
    • PROJECT_ID: the project that is billed for using the Apache Iceberg REST catalog endpoint.
    • FEDERATED_CATALOG_NAME: the name of the federated catalog.

What's next