Set up Cross-cloud Lakehouse

This tutorial describes how to set up a Cross-cloud Lakehouse to query data from a Databricks Unity catalog directly within Google Cloud. This capability unifies your data analytics by integrating your external data sources with your existing Google Cloud environment.

Afterward, you can use Google Cloud Lakehouse to manage access to your federated data.

Before you begin

  1. Review the Lakehouse overview to understand how Lakehouse manages access to data.
  2. Read About Cross-cloud Lakehouse to understand how it works.
  3. Understand how to use regional Secret Manager secrets. This is required to set up a Cross-cloud Lakehouse.
  4. Generate an OAuth Service Principal (Client ID and Secret) within your remote catalog provider (for example, Databricks) that has read access to the target catalog. This process is outside of the scope of this direct documentation.
  5. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the BigLake API, Secret Manager API APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  8. Verify that billing is enabled for your Google Cloud project.

  9. Enable the BigLake API, Secret Manager API APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

Required roles

To get the permissions that you need to set up Cross-cloud Lakehouse, ask your administrator to grant you the following IAM roles on your project:

  • Manage Lakehouse catalogs: BigLake Admin (roles/biglake.admin)
  • Manage secrets: Secret Manager Admin (roles/secretmanager.admin)
  • Route traffic over CCI: Service Directory Viewer (roles/servicedirectory.viewer) and Service Directory PSC Authorized Service (roles/servicedirectory.pscAuthorizedService)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Limitations and considerations

This section lists the limitations and considerations for using Cross-cloud Lakehouse.

  • Supported catalogs: Databricks Unity Catalog on Amazon Web Services (AWS) and Google Cloud.
  • BigQuery UI browsing: The BigQuery UI does not support browsing the federated catalog tree natively. You must verify the setup using the CLI and query tables using the 4-part table path.
  • Network routing: If a Cross-Cloud Interconnect (CCI) is not configured, queries route over the public internet. This might result in higher AWS egress fees and less predictable performance.
  • Data freshness: The --refresh-interval flag for the federated catalog determines how often metadata is synchronized. A shorter interval provides fresher data but might incur additional API costs from the remote catalog provider.
  • Iceberg Metrics Reporting isn't currently available for federated catalogs. Set the rest-metrics-reporting-enabled property to false in your Iceberg client when accessing a federated catalog.

General workflow

To set up and use Cross-cloud Lakehouse, follow these general steps:

  • Set up Cross-Cloud Interconnect (Optional): Configure a private connection between your Google Cloud VPC and your remote cloud provider.
  • Set up federation: Create a secret in Secret Manager with your remote catalog credentials. Then, create a federated catalog in Lakehouse and grant it access to the secret.
  • Verify the connection: Ensure Lakehouse can successfully connect to your remote catalog.
  • Query data: Run queries against your federated data using BigQuery or Managed Service for Apache Spark Spark. For more information, see Use a Cross-cloud Lakehouse.
  • Configure permissions: Use Identity and Access Management (IAM) to manage who can view and query the federated data. For more information, see Use a Cross-cloud Lakehouse.

Set up Cross-Cloud Interconnect (Optional)

Queries to your remote catalog travel over the public internet by default. To help enhance security, compliance, and reduce latency, use a customer-owned Cross-Cloud Interconnect (CCI). This establishes a dedicated, private network connection between your Google Cloud Virtual Private Cloud (VPC) (VPC) and your remote cloud provider's network (for example, AWS).

To set up this private connection, establish the physical and Border Gateway Protocol (BGP) links by following these steps:

  1. Provision the CCI (Dedicated or Partner Interconnect) connecting your Google Cloud VPC to your AWS VPC. For more information, see Partner Interconnect overview.
  2. Create an Interface VPC Endpoint for S3 inside your AWS VPC.
  3. Configure Service Directory to let Lakehouse route traffic to your specific endpoint. Make sure you specify the following:
    • The IP address of the Interface VPC Endpoint for S3.
    • The Google Cloud VPC that has a Cross-Cloud Interconnect to route to AWS.

To set up a CCI

  1. Create a namespace for your remote cloud.

    gcloud service-directory namespaces create REMOTE_CLOUD_NAMESPACE
        --project=PROJECT_ID
        --location=REGION

    Replace the following:

    • REMOTE_CLOUD_NAMESPACE: a name for your remote cloud.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4.
  2. Create the S3 service in that namespace.

    gcloud service-directory services create s3 --namespace NAMESPACE
        --project=PROJECT_ID
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
  3. Register the PrivateLink IP endpoint with the namespace and remove cloud.

    gcloud service-directory services endpoints create privatelink-endpoint \
       --service=s3 \
       --namespace=REMOTE_CLOUD_NAMESPACE \
       --project=PROJECT_ID \
       --location=REGION \
       --address=IP_ADDRESS
       --network projects/PROJECT_ID/global/networks/VPC_NAME

    Replace the following:

    • IP_ADDRESS: the IP address of your PrivateLink endpoint. For example, 10.0.1.45.
    • VPC: the Google Cloud VPC that has a Cross-Cloud Interconnect to route to AWS.

Set up federation

To query your data, you must set up a Lakehouse federated catalog that connects to your remote catalog. The following examples demonstrate this process for a Databricks Unity Catalog.

Create a regional secret

Federation requires credentials to access the remote catalog. Lakehouse uses regional Secret Manager secrets to securely store and retrieve these credentials to authenticate with your remote provider.

For Databricks, you must create a Service Principal in your Databricks account and generate an OAuth Client ID and Client Secret. Ensure this Service Principal has read access to the target Unity Catalog. You then format these credentials as a JSON payload to store in Secret Manager.

  1. Create a JSON file named credentials.json with your payload:

    {
      "client_id": "CLIENT_ID",
      "client_secret": "CLIENT_SECRET"
    }

    Replace the following:

    • CLIENT_ID: the OAuth Client ID for your Databricks Service Principal.
    • CLIENT_SECRET: the OAuth Client Secret for your Databricks Service Principal.
  2. Configure the regional endpoint for Secret Manager:

    By default, Secret Manager uses a global endpoint. However, Cross-cloud Lakehouse requires that your secrets be stored in the same region as your Lakehouse catalog. To interact with regional secrets using the gcloud CLI, you must override the default API endpoint for your current session or profile. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.

    gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/

    Replace the following:

    • REGION: the Google Cloud region where your Secret Manager secret is stored. For example, us-east4. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.
  3. Upload the payload to Secret Manager:

    gcloud secrets create DATABRICKS_SECRET_NAME \
      --location="REGION" \
      --project="PROJECT_ID" \
      --data-file=credentials.json

    Replace the following:

    • DATABRICKS_SECRET_NAME: a name for your Databricks secret.

Create a federated catalog

Create the federated catalog using the gcloud alpha biglake iceberg catalogs create command.

Public internet (no CCI)

If you don't configure CCI, the connection securely travels over the public internet.

gcloud alpha biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
   --project="PROJECT_ID" \
   --primary-location="REGION" \
   --catalog-type="federated" \
   --secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME" \
   --unity-instance-name="INSTANCE_NAME" \
   --unity-catalog-name="CATALOG_NAME" \
   --refresh-interval="REFRESH_INTERVAL" \
   --namespace-filters="NAMESPACE_FILTERS" \
   --federated-catalog-type="unity"

Customer-owned (CCI)

If you configured a Cross-Cloud Interconnect, provide the VPC and Service Directory endpoint references to ensure that Lakehouse routes traffic privately.

gcloud alpha biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
   --project="PROJECT_ID" \
   --primary-location="REGION" \
   --catalog-type="federated" \
   --secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME" \
   --unity-instance-name="INSTANCE_NAME" \
   --unity-catalog-name="CATALOG_NAME" \
   --network-vpc="VPC_NAME" \
   --service-directory-name="projects/PROJECT_ID/locations/REGION/namespaces/aws/services/s3/endpoints/privatelink-endpoint" \
   --refresh-interval="REFRESH_INTERVAL" \
   --namespace-filters="NAMESPACE_FILTERS"

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4. To minimize latency (even over public internet) do the following when selecting a region:
    • If your Unity Catalog is on AWS, select the Google Cloud region closest to your AWS region.
    • If your Unity Catalog is on Google Cloud, select the exact same region.
  • DATABRICKS_SECRET_NAME: the name of your Databricks secret.
  • INSTANCE_NAME: your target Databricks instance name. For example: abcd.cloud.databricks.com.
  • CATALOG_NAME: the name of the target Databricks Unity Catalog to federate.
  • REFRESH_INTERVAL: Specifies how often to update the catalog's information. Set this value as a duration, for example, 330s seconds or 5m30s 5 minutes and 30 seconds. Shorter intervals update data more often but can cost more in API calls. Longer intervals can cost less, but the queried data might not reflect your most current dataset.
  • NAMESPACE_FILTERS: a comma-separated list of namespaces to federate. For example, ns1,ns2.
  • VPC_NAME: the Google Cloud VPC used for transit when using a customer-owned CCI.

Grant the federated catalog access to the secret

When the catalog is created, Lakehouse provisions a unique service account for it (returned as biglake-service-account in the resource description).

You must grant this service account permission to access the secret you created earlier in this tutorial. Note that propagating IAM policies can take a few minutes.

Grant the catalog's service account permission to access the secret.

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets add-iam-policy-binding DATABRICKS_SECRET_NAME \
  --project="PROJECT_ID" \
  --location="REGION" \
  --member="serviceAccount:$(gcloud alpha biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --location="REGION" \
      --format='value(biglake-service-account)')" \
  --role="roles/secretmanager.secretAccessor"

Verify the connection

Use the describe command to verify that Lakehouse can connect to your remote catalog:

gcloud alpha biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
     --project="PROJECT_ID" \
     --location="REGION"

To verify that the federated catalog service account has access to the secret, run the following command:

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets get-iam-policy DATABRICKS_SECRET_NAME \
     --project="PROJECT_ID" \
     --location="REGION"

In the output, verify that the biglake-service-account service account has the roles/secretmanager.secretAccessor role assigned to it.

What's next