Set up cross-cloud Lakehouse for AWS Glue

This document describes how to set up a cross-cloud Lakehouse for Apache Iceberg to query data from an AWS Glue catalog directly within Google Cloud. This capability unifies your data analytics by integrating your external data sources with your existing Google Cloud environment.

Afterward, you can use Lakehouse to manage access to your federated data.

Before you begin

  1. Review the Lakehouse overview to understand how Lakehouse manages access to data.
  2. Read About cross-cloud Lakehouse to understand how it works.
  3. Review the supported catalogs to verify table format requirements and supported configurations.
  4. Make sure that your AWS Administrator has permissions to create Identity and Access Management (IAM) roles and configure permissions policies.
  5. Optional: If you plan to route queries over a private interconnect between your Google Cloud VPC and your remote cloud provider's VPC (for example, AWS), make sure that you have an active account with your remote provider, provision a Cross-Cloud Interconnect or Partner Interconnect, establish BGP sessions with your Cloud Router, and verify that you have the required IAM permissions in both cloud environments.
  6. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  7. Verify that billing is enabled for your Google Cloud project.

  8. Enable the BigLake API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  9. Verify that billing is enabled for your Google Cloud project.

  10. Enable the BigLake API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

Required roles

To get the permissions that you need to set up cross-cloud Lakehouse, ask your administrator to grant you the following IAM roles on your project:

  • Manage Lakehouse catalogs: BigLake Admin (roles/biglake.admin)
  • Route traffic over private interconnect: Compute Network Admin (roles/compute.networkAdmin), Service Directory Viewer (roles/servicedirectory.viewer), and Service Directory PSC Authorized Service (roles/servicedirectory.pscAuthorizedService)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Limitations and considerations

This section lists the limitations and considerations for using cross-cloud Lakehouse.

  • Supported Cloud Providers: Using a private interconnect with your cross-cloud Lakehouse is supported with the following remote cloud providers: Amazon Web Services (AWS). You can use either a Cross-Cloud Interconnect or a Partner Interconnect.
  • Network routing: If a private interconnect (such as customer-owned CCI or Partner Interconnect) is not configured, queries route over the public internet. This might result in higher egress fees from your remote cloud provider and less predictable performance.
  • Data freshness: The --refresh-interval flag for the federated catalog determines how often metadata is synchronized. A shorter interval provides fresher data but might incur additional API costs from the remote catalog provider.
  • Iceberg Metrics Reporting: Iceberg Metrics Reporting isn't available for federated catalogs. Set the rest-metrics-reporting-enabled property to false in your Iceberg client when accessing a federated catalog.

General workflow

To set up and use cross-cloud Lakehouse, follow these general steps:

  • Set up Cross-Cloud Interconnect (Optional): Configure a private connection between your Google Cloud VPC and your remote cloud provider.
  • Set up federation: Configure authentication by creating an IAM role with a placeholder trust policy with your remote provider. Then, create a federated catalog in Lakehouse and update the trust policy.
  • Verify the connection: Make sure that Lakehouse can successfully connect to your remote catalog.
  • Query data: Run queries against your federated data by using BigQuery or Managed Service for Apache Spark. For more information, see Use cross-cloud Lakehouse.
  • Configure permissions: Use IAM to manage who can view and query the federated data.

Set up Cross-Cloud Interconnect (Optional)

Queries to your remote catalog travel over the public internet by default. To help enhance security and compliance, provide predictable performance, and reduce data transfer costs, use a private interconnect. This establishes a dedicated, private network connection between your Google Cloud Virtual Private Cloud (VPC) and your remote cloud provider's network (for example, AWS).

You can provision and configure either of the following private interconnect options between your Google Cloud VPC and your remote cloud provider's VPC (for example, AWS):

Establish BGP sessions between your Cloud Router in Google Cloud and your remote cloud provider's VPC to ensure route exchange.

To enable private querying, you must configure a path from Lakehouse to your remote storage bucket (for example, an AWS Amazon S3 bucket) through your private interconnect. There are two architectural flows you can follow to configure this routing:

  • Internal Load Balancer (ILB) routing (Recommended): This flow uses a Google Cloud Internal Load Balancer to distribute requests across Hybrid Connectivity Network Endpoint Groups (NEGs) pointing to multiple AWS Elastic Network Interfaces (ENIs). This flow is essential for load balancing, scalability, and high availability.
  • Direct endpoint routing: This flow connects Service Directory directly to a single AWS Interface VPC Endpoint IP address.

Select the configuration flow that matches your architecture requirements:

Internal Load Balancer

To configure an Internal Load Balancer (ILB) to distribute requests across multiple AWS ENIs for high availability and load balancing, follow these steps:

Configure AWS networking

First, create an Amazon S3 VPC Interface Endpoint (AWS PrivateLink):

  1. In the AWS VPC console, create an Interface Endpoint for Amazon S3.
  2. For the service name, specify com.amazonaws.<var>AWS_REGION</var>.s3.
  3. Select the VPC and subnets that are connected through Direct Connect to your Google Cloud VPC.
  4. Attach Security Groups to the endpoint to control inbound access.
  5. This provisions Elastic Network Interfaces (ENIs) in each selected subnet. Note the private IP addresses of these ENIs.

Next, configure Security Groups:

  • Ensure that the Security Group or groups attached to the Amazon S3 Endpoint ENIs allow inbound TCP traffic on port 443 from the relevant IP ranges of your Google Cloud VPC.

Configure Google Cloud networking

First, create Hybrid Connectivity Network Endpoint Groups (NEGs) for each AWS Amazon S3 ENI private IP address. These NEGs are zonal.

# Repeat for each AWS ENI IP address
gcloud compute network-endpoint-groups create NEG_NAME_1 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_1 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_1,port=443"

# Example for a second ENI
gcloud compute network-endpoint-groups create NEG_NAME_2 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_2 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_2,port=443"

Replace the following:

  • NEG_NAME_1, NEG_NAME_2: unique identifiers for your NEGs.
  • PROJECT_ID: your Google Cloud project ID.
  • GCP_ZONE: the Google Cloud zone. For example, us-east4-a.
  • VPC_NETWORK: the Google Cloud VPC network name.
  • AWS_ENI_IP_1, AWS_ENI_IP_2: the private IP addresses of your AWS ENIs.

Next, configure the Internal Load Balancer (ILB):

  1. Create a health check:

    gcloud compute health-checks create tcp HEALTH_CHECK_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --port=443

    Replace the following:

    • HEALTH_CHECK_NAME: a unique identifier for your health check.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a regional backend service:

    gcloud compute backend-services create BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --load-balancing-scheme=INTERNAL \
        --protocol=TCP \
        --region=REGION \
        --health-checks=HEALTH_CHECK_NAME \
        --health-checks-region=REGION

    Replace the following:

    • BACKEND_SERVICE_NAME: a unique identifier for your backend service.
  3. Add the NEGs to the backend service:

    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_1 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION
    
    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_2 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION

    Replace the following:

    • NEG_NAME_1, NEG_NAME_2: the names of your zonal NEGs created in the previous step.
    • GCP_ZONE: the Google Cloud zone where your NEGs are located. For example, us-east4-a.
  4. Create a forwarding rule:

    gcloud compute forwarding-rules create FORWARDING_RULE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --load-balancing-scheme=INTERNAL \
        --network=VPC_NETWORK \
        --subnet=GCP_SUBNET \
        --ip-protocol=TCP \
        --ports=443 \
        --backend-service=BACKEND_SERVICE_NAME \
        --backend-service-region=REGION

    After creation, note the internal IP address assigned to the forwarding rule. This is your ILB_IP_ADDRESS.

    Replace the following:

    • FORWARDING_RULE_NAME: a unique identifier for your forwarding rule.
    • VPC_NETWORK: the Google Cloud VPC network name.
    • GCP_SUBNET: the Google Cloud subnet name within your VPC network where the ILB will be provisioned.

Configure Service Directory

Register the ILB's IP address in Service Directory, so Lakehouse can discover it.

  1. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  3. Create an endpoint for the ILB in the service:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --project=PROJECT_ID \
        --namespace=NAMESPACE \
        --service=SERVICE_NAME \
        --location=REGION \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK \
        --address=ILB_IP_ADDRESS \
        --port=443

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • ILB_IP_ADDRESS: the internal IP address of your ILB forwarding rule.

Direct endpoint

To configure Service Directory to route traffic directly to a single AWS Interface VPC Endpoint IP address, follow these steps:

  1. Create an Interface VPC Endpoint for Amazon S3 inside your AWS VPC. Note the IP address and port of this endpoint.
  2. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  3. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  4. Create an endpoint in the service containing the routing information for your Amazon S3 Interface VPC Endpoint:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --service=SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION \
        --address=S3_VPCE_IP_ADDRESS \
        --port=S3_VPCE_PORT \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • S3_VPCE_IP_ADDRESS: the IP address of your Amazon S3 Interface VPC Endpoint. For example, 10.0.1.45.
    • S3_VPCE_PORT: the port number of your Amazon S3 Interface VPC Endpoint. For example, 443.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • VPC_NETWORK: the Google Cloud VPC network name associated with your private interconnect.

Set up cross-cloud federation

To query your data, set up a Lakehouse federated catalog that connects to your remote AWS catalog.

Create the AWS IAM role with a placeholder trust policy

Lakehouse provisions a Google service account ID after catalog creation. Create the AWS IAM role with a placeholder trust policy.

  1. Create a file named trust_policy.json:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "accounts.google.com"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "accounts.google.com:aud": [
                "PLACEHOLDER_VALUE"
              ],
              "accounts.google.com:sub": [
                "PLACEHOLDER_VALUE"
              ]
            }
          }
        }
      ]
    }
  2. Run the AWS CLI command to create the role with the placeholder trust policy. We recommend setting the maximum session duration to 12 hours (43200 seconds) to prevent credential expiration during long-running jobs:

    aws iam create-role \
      --role-name AWS_ROLE_NAME \
      --assume-role-policy-document file://trust_policy.json \
      --max-session-duration 43200

    Replace the following:

    • AWS_ROLE_NAME: a name for your AWS IAM role. For example, biglake_glue_federation_role.

Attach a permissions policy

Attach a permissions policy to your IAM role that lets Lakehouse access the AWS region's Glue Data Catalog and S3 buckets. This policy also grants access to S3 table buckets if you use the AWS Lake Formation S3 Tables integration. If you have upgraded from the default AWS Glue data permissions to the AWS Lake Formation model, you might need to grant additional permissions in Lake Formation instead.

  1. Create a file named permissions_policy.json with the following policy configuration.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "GlueRead",
          "Effect": "Allow",
          "Action": [
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:GetDatabases",
            "glue:GetTable",
            "glue:GetTables"
          ],
          "Resource": "arn:aws:glue:AWS_REGION:AWS_ACCOUNT_ID:*"
        },
        {
          "Sid": "S3Read",
          "Effect": "Allow",
          "Action": [
            "s3:ListBucket",
            "s3:GetObject"
          ],
          "Resource": [
            "arn:aws:s3:::*"
          ]
        },
        {
          "Sid": "S3TablesRead",
          "Effect": "Allow",
          "Action": [
            "s3tables:GetTableBucket",
            "s3tables:ListNamespaces",
            "s3tables:GetNamespace",
            "s3tables:ListTables",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData"
          ],
          "Resource": [
            "arn:aws:s3tables:AWS_REGION:AWS_ACCOUNT_ID:*"
          ]
        }
      ]
    }
  2. Attach this permissions policy to your IAM role:

    aws iam put-role-policy \
      --role-name AWS_ROLE_NAME \
      --policy-name AWS_POLICY_NAME \
      --policy-document file://permissions_policy.json

    Replace the following:

    • AWS_ROLE_NAME: the name of your AWS IAM role. For example, biglake_glue_federation_role.
    • AWS_POLICY_NAME: a name for your permissions policy. For example, biglake_glue_permissions.
    • AWS_REGION: the AWS region where your Glue catalog or S3 Tables reside. For example, us-east-1.
    • AWS_ACCOUNT_ID: your 12-digit AWS account ID string. For example, 123456789012.

Create a federated catalog

Establish the federated catalog on Google Cloud by using the gcloud CLI or REST API.

To prevent premature metadata synchronization failures while AWS trust relationships are propagating, initialize the catalog without specifying a refresh schedule (which defaults to 0s).

Google Cloud CLI

Public internet (no CCI)

If you don't configure CCI, the connection securely travels over the public internet.

gcloud biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
    --project="PROJECT_ID" \
    --primary-location="REGION" \
    --catalog-type="federated" \
    --federated-catalog-type="glue" \
    --glue-warehouse="GLUE_OR_S3_TABLE_BUCKET_WAREHOUSE" \
    --glue-aws-region="AWS_REGION" \
    --glue-aws-role-arn="arn:aws:iam::AWS_ACCOUNT_ID:role/AWS_ROLE_NAME"

Customer-owned (CCI)

If you configured a private interconnect (such as Dedicated CCI or Partner Interconnect), provide the Service Directory service reference to make sure that Lakehouse routes traffic privately.

gcloud biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
    --project="PROJECT_ID" \
    --primary-location="REGION" \
    --catalog-type="federated" \
    --federated-catalog-type="glue" \
    --glue-warehouse="GLUE_OR_S3_TABLE_BUCKET_WAREHOUSE" \
    --glue-aws-region="AWS_REGION" \
    --glue-aws-role-arn="arn:aws:iam::AWS_ACCOUNT_ID:role/AWS_ROLE_NAME" \
    --service-directory-name="projects/PROJECT_ID/locations/REGION/namespaces/NAMESPACE/services/SERVICE_NAME"

REST

curl -s -X POST \
  -H "x-goog-user-project: PROJECT_ID" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  "https://biglake.googleapis.com/iceberg/v1/restcatalog/extensions/projects/PROJECT_ID/catalogs?iceberg_catalog_id=FEDERATED_CATALOG_NAME&primary_location=REGION" \
  -d '{
    "catalog_type": "CATALOG_TYPE_FEDERATED",
    "storage_regions": ["'"REGION"'"],
    "federated_catalog_options": {
      "glue_catalog_info": {
        "warehouse": "'"GLUE_OR_S3_TABLE_BUCKET_WAREHOUSE"'",
        "aws_region": "'"AWS_REGION"'",
        "aws_role_arn": "arn:aws:iam::'"AWS_ACCOUNT_ID"':role/'"AWS_ROLE_NAME"'"
      }
    }
  }'

Replace the following:

  • FEDERATED_CATALOG_NAME: a name for your federated catalog.
  • PROJECT_ID: your Google Cloud project ID.
  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4.
  • GLUE_OR_S3_TABLE_BUCKET_WAREHOUSE: your target warehouse catalog identifier. For an AWS region's Glue Data Catalog, enter your 12-digit AWS account ID string. For example, 123456789012. To use an S3 table bucket in the region, enter AWS_ACCOUNT_ID:s3tablescatalog/S3_TABLE_BUCKET. For example, 123456789012:s3tablescatalog/my-table-bucket.
  • AWS_ACCOUNT_ID: your 12-digit AWS account ID string. For example, 123456789012.
  • AWS_REGION: the AWS region where your Glue catalog or S3 table bucket resides. For example, us-east-1.
  • AWS_ROLE_NAME: the name of your AWS IAM role. For example, biglake_glue_federation_role.
  • NAMESPACE: (Optional) the Service Directory namespace you created during private interconnect setup.
  • SERVICE_NAME: (Optional) the Service Directory service name you created during private interconnect setup.

Update the trust policy

When the catalog is created, Lakehouse provisions a unique service account for it, which is returned as the biglake-service-account-id field in the catalog creation response. You use this service account to establish the trust relationship.

  1. Run the following command to extract the biglake-service-account-id value into an active bash variable:

    BIGLAKE_SA_ID=$(gcloud biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --format="value(biglake-service-account-id)")
  2. Update your AWS IAM role's trust policy to replace the placeholder with your verified Google Service Agent ID. The condition block validates both sub and aud matching. Write the policy to a file named trust_policy_comprehensive.json:

    cat > trust_policy_comprehensive.json << EOF
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "accounts.google.com"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "accounts.google.com:aud": [
                "$BIGLAKE_SA_ID"
              ],
              "accounts.google.com:sub": [
                "$BIGLAKE_SA_ID"
              ]
            }
          }
        }
      ]
    }
    EOF
  3. Apply the finalized policy to your AWS role:

    aws iam update-assume-role-policy \
      --role-name AWS_ROLE_NAME \
      --policy-document file://trust_policy_comprehensive.json

Enable background refresh

Now that secure trust relationships are successfully established across both platforms, update your catalog to activate background refresh (every 5 minutes or 300s, or larger).

gcloud CLI

gcloud biglake iceberg catalogs update FEDERATED_CATALOG_NAME \
  --project="PROJECT_ID" \
  --refresh-interval="300s"

REST

curl -s -X PATCH \
-H "x-goog-user-project: PROJECT_ID" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://biglake.googleapis.com/iceberg/v1/restcatalog/extensions/projects/PROJECT_ID/catalogs/FEDERATED_CATALOG_NAME?updateMask=federated_catalog_options.refresh_options.refresh_schedule" \
-d '{
  "federated_catalog_options": {
    "refresh_options": {
      "refresh_schedule": {
        "refresh_interval": "300s"
      }
    }
  }
}'

Verify the connection

Verify that the catalog background refresh cycle completed successfully and namespaces are syncing.

  1. Verify that the refresh status indicates success:

    gcloud biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --location="REGION"
  2. Confirm that remote databases appear as synchronized namespaces:

    gcloud biglake iceberg namespaces list \
      --catalog="FEDERATED_CATALOG_NAME" \
      --project="PROJECT_ID" \
      --location="REGION"

What's next