Use cross-region replication and disaster recovery

This page describes how to use BigLake metastore cross-region replication and disaster recovery.

This feature is only available for catalogs that use Cloud Storage Dual-Region or Multi-Region buckets.

Before you begin

  1. Verify that billing is enabled for your Google Cloud project.

  2. Enable the BigLake API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

Required roles

To get the permissions that you need to use the Iceberg REST catalog in BigLake metastore, ask your administrator to grant you the following IAM roles:

  • Perform administrative tasks, such as managing catalog user access, storage access, and the catalog's credential vending mode:
  • Read table data in credential vending mode: BigLake Viewer (roles/biglake.viewer) on the project
  • Write table data in credential vending mode: BigLake Editor (roles/biglake.editor) on the project
  • Read catalog resources and table data in non-credential vending mode:
  • Manage catalog resources and write table data in non-credential vending mode:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Replication and disaster recovery workflow

To use cross-region replication and disaster recovery, you follow these general steps:

  1. View replication status: Identify your current primary and secondary regions to determine the target region for the failover.
  2. Check synchronization status: Verify the current state of your primary and secondary regions to ensure they are ready for a transition.
  3. Choose a failover mode: Decide between a soft failover (best for planned maintenance) or a hard failover (best for emergency recovery).
  4. Initiate the failover: Run the command corresponding to your chosen mode to switch your primary and secondary regions.

Prepare for failover

Identify your current primary region and verify the synchronization status of your secondary region. Then, initiate the failover.

View replication status

To determine the regions where your catalog is replicated, run the following gcloud alpha biglake iceberg catalogs describe command.

gcloud alpha biglake iceberg catalogs describe CATALOG_NAME

Replace CATALOG_NAME with the name of your catalog.

Check synchronization status

Before you initiate a failover, check the synchronization status of your secondary replica with the alpha biglake iceberg failover command:

gcloud alpha biglake iceberg catalogs failover CATALOG_NAME \
    --validate_only \
    --primary-replica PRIMARY_REPLICA_REGION

Replace the following:

  • CATALOG_NAME: the name of your catalog.
  • PRIMARY_REPLICA_REGION: the region to designate as the new primary replica.

Initiate a failover

The disaster recovery feature uses metastore replication to designate primary and secondary regions. All table commit metadata is served from the primary region and replicated to the secondary region. You can switch the primary and secondary regions for the catalog using the failover operation.

Soft failover

To initiate a soft failover, run the following alpha biglake iceberg failover command:

gcloud alpha biglake iceberg catalogs failover CATALOG_NAME \
    --primary-replica PRIMARY_REPLICA_REGION

Replace the following:

  • CATALOG_NAME: the name of your catalog.
  • PRIMARY_REPLICA_REGION: the region to designate as the new primary replica.

Hard failover

To initiate a hard failover, run the following alpha biglake iceberg failover command:

gcloud alpha biglake iceberg catalogs failover CATALOG_NAME \
    --primary-replica PRIMARY_REPLICA_REGION \
    --conditional-failover-replication-time=REPLICATION_TIMESTAMP

Replace the following:

  • CATALOG_NAME: the name of your catalog.

  • PRIMARY_REPLICA_REGION: the region to designate as the new primary replica.

  • REPLICATION_TIMESTAMP: an RFC 3339 timestamp that acts as a checkpoint for replication. The replication process verifies that the replica contains all data committed up to this time. If the replica does not contain all data committed before this timestamp, the command fails. To force the failover process regardless of any replication delay, set this timestamp to a date far in the past.

What's next