Set up cross-cloud Lakehouse for Databricks Unity Catalog

This document describes how to set up a cross-cloud Lakehouse to query data from a Databricks Unity Catalog catalog directly within Google Cloud. This capability unifies your data analytics by integrating your external data sources with your existing Google Cloud environment.

Afterward, you can use Lakehouse for Apache Iceberg to manage access to your federated data.

Before you begin

  1. Review the Lakehouse overview to understand how Lakehouse manages access to data.
  2. Read About cross-cloud Lakehouse to understand how it works.
  3. Review the supported catalogs to verify external location requirements and supported configurations.
  4. Understand how to use regional Secret Manager secrets. This is required to set up a cross-cloud Lakehouse with Databricks Unity Catalog.
  5. Generate an OAuth Service Principal (client ID and client secret) within your remote catalog provider that has read access to the target catalog. This process is outside the scope of this documentation.
  6. Optional: If you plan to route queries over a private interconnect between your Google Cloud VPC and your remote cloud provider's VPC (for example, AWS), ensure that you have an active account with your remote provider, provision a Cross-Cloud Interconnect or Partner Interconnect, establish BGP sessions with your Cloud Router, and verify that you have the required IAM permissions in both cloud environments.
  7. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  8. Verify that billing is enabled for your Google Cloud project.

  9. Enable the BigLake, Secret Manager APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  10. Verify that billing is enabled for your Google Cloud project.

  11. Enable the BigLake, Secret Manager APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

Required roles

To get the permissions that you need to set up cross-cloud Lakehouse, ask your administrator to grant you the following IAM roles on your project:

  • Manage Lakehouse catalogs: BigLake Admin (roles/biglake.admin)
  • Manage secrets: Secret Manager Admin (roles/secretmanager.admin)
  • Route traffic over private interconnect: Compute Network Admin (roles/compute.networkAdmin), Service Directory Viewer (roles/servicedirectory.viewer), and Service Directory PSC Authorized Service (roles/servicedirectory.pscAuthorizedService)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Supported catalog details

This guide provides instructions for setting up cross-cloud Lakehouse with a Databricks Unity Catalog catalog on Amazon Web Services (AWS) or Google Cloud. For detailed information regarding external location requirements and supported configurations, see Supported catalogs.

Limitations and considerations

This section lists the limitations and considerations for using cross-cloud Lakehouse.

  • Supported Cloud Providers: Using a private interconnect with your cross-cloud Lakehouse is supported with the following remote cloud providers: Amazon Web Services (AWS). You can use either a Cross-Cloud Interconnect or a Partner Interconnect.
  • Only Databricks Unity Catalog catalogs that use an external location on AWS or an external location on Google Cloud are supported. Unity Catalog catalogs that use default storage on AWS or default storage on Google Cloud are not supported.
  • You must enable external data access on the metastore used by Unity Catalog, which is disabled by default.
  • Network routing: If a private interconnect (such as customer-owned CCI or Partner Interconnect) is not configured, queries route over the public internet. This might result in higher egress fees from your remote cloud provider and less predictable performance.
  • Data freshness: The --refresh-interval flag for the federated catalog determines how often metadata is synchronized. A shorter interval provides fresher data but might incur additional API costs from the remote catalog provider.
  • Iceberg Metrics Reporting: Iceberg Metrics Reporting isn't available for federated catalogs. Set the rest-metrics-reporting-enabled property to false in your Iceberg client when accessing a federated catalog.

General workflow

To set up and use cross-cloud Lakehouse, follow these general steps:

  • Set up Cross-Cloud Interconnect (Optional): Configure a private connection between your Google Cloud VPC and your remote cloud provider.
  • Set up federation: Create a secret in Secret Manager with your remote catalog credentials. Then, create a federated catalog in Lakehouse and grant it access to the secret.
  • Verify the connection: Verify that Lakehouse can successfully connect to your remote catalog.
  • Query data: Run queries against your federated data using BigQuery or Managed Service for Apache Spark. For more information, see Use cross-cloud Lakehouse.
  • Configure permissions: Use Identity and Access Management (IAM) to manage who can view and query the federated data.

Set up Cross-Cloud Interconnect (Optional)

Queries to your remote catalog travel over the public internet by default. To help enhance security and compliance, provide predictable performance, and reduce data transfer costs, use a private interconnect. This establishes a dedicated, private network connection between your Google Cloud Virtual Private Cloud (VPC) and your remote cloud provider's network (for example, AWS).

You can provision and configure either of the following private interconnect options between your Google Cloud VPC and your remote cloud provider's VPC (for example, AWS):

Establish BGP sessions between your Cloud Router in Google Cloud and your remote cloud provider's VPC to ensure route exchange.

To enable private querying, you must configure a path from Lakehouse to your remote storage bucket (for example, an AWS Amazon S3 bucket) through your private interconnect. There are two architectural flows you can follow to configure this routing:

  • Internal Load Balancer (ILB) routing (Recommended): This flow uses a Google Cloud Internal Load Balancer to distribute requests across Hybrid Connectivity Network Endpoint Groups (NEGs) pointing to multiple AWS Elastic Network Interfaces (ENIs). This flow is essential for load balancing, scalability, and high availability.
  • Direct endpoint routing: This flow connects Service Directory directly to a single AWS Interface VPC Endpoint IP address.

Select the configuration flow that matches your architecture requirements:

Internal Load Balancer

To configure an Internal Load Balancer (ILB) to distribute requests across multiple AWS ENIs for high availability and load balancing, follow these steps:

Configure AWS networking

First, create an Amazon S3 VPC Interface Endpoint (AWS PrivateLink):

  1. In the AWS VPC console, create an Interface Endpoint for Amazon S3.
  2. For the service name, specify com.amazonaws.<var>AWS_REGION</var>.s3.
  3. Select the VPC and subnets that are connected through Direct Connect to your Google Cloud VPC.
  4. Attach Security Groups to the endpoint to control inbound access.
  5. This provisions Elastic Network Interfaces (ENIs) in each selected subnet. Note the private IP addresses of these ENIs.

Next, configure Security Groups:

  • Ensure that the Security Group or groups attached to the Amazon S3 Endpoint ENIs allow inbound TCP traffic on port 443 from the relevant IP ranges of your Google Cloud VPC.

Configure Google Cloud networking

First, create Hybrid Connectivity Network Endpoint Groups (NEGs) for each AWS Amazon S3 ENI private IP address. These NEGs are zonal.

# Repeat for each AWS ENI IP address
gcloud compute network-endpoint-groups create NEG_NAME_1 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_1 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_1,port=443"

# Example for a second ENI
gcloud compute network-endpoint-groups create NEG_NAME_2 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_2 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_2,port=443"

Replace the following:

  • NEG_NAME_1, NEG_NAME_2: unique identifiers for your NEGs.
  • PROJECT_ID: your Google Cloud project ID.
  • GCP_ZONE: the Google Cloud zone. For example, us-east4-a.
  • VPC_NETWORK: the Google Cloud VPC network name.
  • AWS_ENI_IP_1, AWS_ENI_IP_2: the private IP addresses of your AWS ENIs.

Next, configure the Internal Load Balancer (ILB):

  1. Create a health check:

    gcloud compute health-checks create tcp HEALTH_CHECK_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --port=443

    Replace the following:

    • HEALTH_CHECK_NAME: a unique identifier for your health check.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a regional backend service:

    gcloud compute backend-services create BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --load-balancing-scheme=INTERNAL \
        --protocol=TCP \
        --region=REGION \
        --health-checks=HEALTH_CHECK_NAME \
        --health-checks-region=REGION

    Replace the following:

    • BACKEND_SERVICE_NAME: a unique identifier for your backend service.
  3. Add the NEGs to the backend service:

    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_1 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION
    
    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_2 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION

    Replace the following:

    • NEG_NAME_1, NEG_NAME_2: the names of your zonal NEGs created in the previous step.
    • GCP_ZONE: the Google Cloud zone where your NEGs are located. For example, us-east4-a.
  4. Create a forwarding rule:

    gcloud compute forwarding-rules create FORWARDING_RULE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --load-balancing-scheme=INTERNAL \
        --network=VPC_NETWORK \
        --subnet=GCP_SUBNET \
        --ip-protocol=TCP \
        --ports=443 \
        --backend-service=BACKEND_SERVICE_NAME \
        --backend-service-region=REGION

    After creation, note the internal IP address assigned to the forwarding rule. This is your ILB_IP_ADDRESS.

    Replace the following:

    • FORWARDING_RULE_NAME: a unique identifier for your forwarding rule.
    • VPC_NETWORK: the Google Cloud VPC network name.
    • GCP_SUBNET: the Google Cloud subnet name within your VPC network where the ILB will be provisioned.

Configure Service Directory

Register the ILB's IP address in Service Directory, so Lakehouse can discover it.

  1. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  3. Create an endpoint for the ILB in the service:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --project=PROJECT_ID \
        --namespace=NAMESPACE \
        --service=SERVICE_NAME \
        --location=REGION \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK \
        --address=ILB_IP_ADDRESS \
        --port=443

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • ILB_IP_ADDRESS: the internal IP address of your ILB forwarding rule.

Direct endpoint

To configure Service Directory to route traffic directly to a single AWS Interface VPC Endpoint IP address, follow these steps:

  1. Create an Interface VPC Endpoint for Amazon S3 inside your AWS VPC. Note the IP address and port of this endpoint.
  2. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  3. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  4. Create an endpoint in the service containing the routing information for your Amazon S3 Interface VPC Endpoint:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --service=SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION \
        --address=S3_VPCE_IP_ADDRESS \
        --port=S3_VPCE_PORT \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • S3_VPCE_IP_ADDRESS: the IP address of your Amazon S3 Interface VPC Endpoint. For example, 10.0.1.45.
    • S3_VPCE_PORT: the port number of your Amazon S3 Interface VPC Endpoint. For example, 443.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • VPC_NETWORK: the Google Cloud VPC network name associated with your private interconnect.

Set up federation

To query your data, you must set up a Lakehouse federated catalog that connects to your remote catalog.

Create a regional secret

Federation requires credentials to access the remote catalog. Lakehouse uses regional Secret Manager secrets to securely store and retrieve these credentials to authenticate with your remote provider.

For Databricks, you must create a Service Principal in your Databricks account and generate an OAuth client ID and client secret. Verify that this Service Principal has read access to the target Unity Catalog catalog. You then format these credentials as a JSON payload to store in Secret Manager.

  1. Create a JSON file named credentials.json with your payload:

    {
      "client_id": "CLIENT_ID",
      "client_secret": "CLIENT_SECRET"
    }

    Replace the following:

    • CLIENT_ID: The OAuth client ID for your Databricks Service Principal.
    • CLIENT_SECRET: The OAuth client secret for your Databricks Service Principal.
  2. Configure the regional endpoint for Secret Manager:

    By default, Secret Manager uses a global endpoint. However, cross-cloud Lakehouse requires that your secrets be stored in the same region as your Lakehouse catalog. To interact with regional secrets using the gcloud CLI, you must override the default API endpoint for your current session or profile. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.

    gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/

    Replace the following:

    • REGION: The Google Cloud region where your Secret Manager secret is stored. For example, us-east4. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.
  3. Upload the payload to Secret Manager:

    gcloud secrets create DATABRICKS_SECRET_NAME \
      --location="REGION" \
      --project="PROJECT_ID" \
      --data-file=credentials.json

    Replace the following:

    • DATABRICKS_SECRET_NAME: A name for your Databricks secret.

Create a federated catalog

Create the federated catalog using the gcloud biglake iceberg catalogs create command.

Console

  1. In the Google Cloud console, go to Lakehouse.

    Go to Lakehouse

  2. Click Create catalog.

  3. Click Federated catalog.

    The Catalog configuration details appear.

  4. For Federated catalog source, select Unity (Databricks).

  5. For Data location, select the Lakehouse region where you want to create the federated catalog. For example, us-east4. To minimize latency (even over public internet) do the following when selecting a region:

    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
    • If your Unity Catalog catalog is on Google Cloud, select the exact same region.
  6. Click Continue.

    The Connection details details appear.

  7. In the Remote catalog details section, in the Unity instance name field, enter your target Databricks instance name. For example: abcd.cloud.databricks.com.

  8. In the Unity catalog name field, enter the name of the target Databricks Unity Catalog catalog to federate to.

  9. In the Authentication and network section, in the Secret field, enter the name of your Databricks secret. Use the following format projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME.

  10. Optional: In the Service directory name field, enter the path to your Service Directory service. For example: projects/PROJECT_ID/locations/REGION/namespaces/NAMESPACE/services/SERVICE_NAME. This is only required if you are configuring a Cross-Cloud Interconnect.

  11. Click Create.

gcloud CLI

Public internet (no CCI)

If you don't configure CCI, the connection securely travels over the public internet.

gcloud biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
    --project="PROJECT_ID" \
    --primary-location="REGION" \
    --catalog-type="federated" \
    --federated-catalog-type="unity" \
    --secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME" \
    --unity-instance-name="UNITY_INSTANCE_NAME" \
    --unity-catalog-name="UNITY_CATALOG_NAME" \
    --refresh-interval="REFRESH_INTERVAL" \
    --namespace-filters="NAMESPACE_FILTERS"

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4. To minimize latency, do the following when selecting a region:
    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
    • If your Unity Catalog catalog is on Google Cloud, select the exact same region.
  • DATABRICKS_SECRET_NAME: the name of your Databricks secret.
  • UNITY_INSTANCE_NAME: your target Databricks instance name. For example: abcd.cloud.databricks.com.
  • UNITY_CATALOG_NAME: the name of the target Databricks Unity Catalog catalog to federate to.
  • REFRESH_INTERVAL: Specifies how often to update the catalog's information. Set this value as a duration, for example, 330s or 5m30s. Shorter intervals update data more often but can cost more in API calls. Longer intervals can cost less, but the queried data might not reflect your most current dataset. If omitted or if you set the value to 0s, then updates will be disabled.
  • NAMESPACE_FILTERS: Optional: A comma-separated list of namespaces to federate. For example, ns1,ns2. If omitted, all namespaces will be included.

Customer-owned (CCI)

If you configured a private interconnect (such as Dedicated CCI or Partner Interconnect), provide the Service Directory service reference so that Lakehouse routes traffic privately.

gcloud biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
    --project="PROJECT_ID" \
    --primary-location="REGION" \
    --catalog-type="federated" \
    --federated-catalog-type="unity" \
    --secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME" \
    --unity-instance-name="UNITY_INSTANCE_NAME" \
    --unity-catalog-name="UNITY_CATALOG_NAME" \
    --refresh-interval="REFRESH_INTERVAL" \
    --namespace-filters="NAMESPACE_FILTERS" \
    --service-directory-name="projects/PROJECT_ID/locations/REGION/namespaces/NAMESPACE/services/SERVICE_NAME"

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • PROJECT_NUMBER: your Google Cloud project number.
  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4. To minimize latency, do the following when selecting a region:
    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
    • If your Unity Catalog catalog is on Google Cloud, select the exact same region. Note: This must be the same region as the Service Directory namespace and regional secret.
  • DATABRICKS_SECRET_NAME: the name of your Databricks secret.
  • UNITY_INSTANCE_NAME: your target Databricks instance name. For example: abcd.cloud.databricks.com.
  • UNITY_CATALOG_NAME: the name of the target Databricks Unity Catalog catalog to federate.
  • REFRESH_INTERVAL: Specifies how often to update the catalog's information. Set this value as a duration, for example, 330s or 5m30s. Shorter intervals update data more often but can cost more in API calls. Longer intervals can cost less, but the queried data might not reflect your most current dataset. If omitted or if you set the value to 0s, then updates will be disabled.
  • NAMESPACE_FILTERS: Optional: A comma-separated list of namespaces to federate. For example, ns1,ns2. If omitted, all namespaces will be included.
  • NAMESPACE: the Service Directory namespace you created during private interconnect setup.
  • SERVICE_NAME: the Service Directory service name you created during private interconnect setup.

Grant the federated catalog access to the secret

When the catalog is created, Lakehouse provisions a unique service account for it (returned as biglake-service-account in the resource description).

You must grant this service account permission to access the secret you created earlier in this tutorial. Note that propagating IAM policies can take a few minutes.

Grant the catalog's service account permission to access the secret.

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets add-iam-policy-binding DATABRICKS_SECRET_NAME \
  --project="PROJECT_ID" \
  --location="REGION" \
  --member="serviceAccount:$(gcloud biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --location="REGION" \
      --format='value(biglake-service-account)')" \
  --role="roles/secretmanager.secretAccessor"

Verify the connection

To verify that the federated catalog service account has access to the secret, run the following command:

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets get-iam-policy DATABRICKS_SECRET_NAME \
     --project="PROJECT_ID" \
     --location="REGION"

In the output, verify that the biglake-service-account service account has the roles/secretmanager.secretAccessor role assigned to it.

Next, verify that the catalog background refresh cycle completed successfully and namespaces are syncing.

  1. Verify that the refresh status indicates success:

    gcloud biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --location="REGION"
  2. Confirm that remote databases appear as synchronized namespaces:

    gcloud biglake iceberg namespaces list \
      --catalog="FEDERATED_CATALOG_NAME" \
      --project="PROJECT_ID" \
      --location="REGION"

What's next