Cross-region replication and disaster recovery for the Lakehouse runtime catalog protects against regional outages. As part of Lakehouse for Apache Iceberg, this capability enables failover for tables that use the Apache Iceberg REST catalog endpoint.
When managing failovers, you can choose between soft failovers for planned testing or hard failovers to quickly restore service.
How it works
The Lakehouse runtime catalog automatically selects primary and secondary regions for catalog metadata. The primary region processes all table commit metadata and then replicates it to the secondary region for backup.
At any time, especially during a disaster, you can switch the primary and secondary regions for the catalog using the failover operation. This action switches the primary for the catalog and all contained namespaces and tables.
Cross-region replication
Cross-region replication involves two main components: data replication and metastore replication. The disaster recovery feature builds upon cross-region replication to enable failover.
Data replication: Cloud Storage automatically replicates your catalog data across multiple regions when you use a dual-region or multi-region bucket. If a regional outage occurs, your data remains accessible without changes to storage paths.
Metastore replication: For Iceberg REST catalog endpoints, the Lakehouse runtime catalog automatically replicates your metastore when you use a dual-region (or custom dual-region) bucket. Metastore replication begins when you create the catalog. The Lakehouse runtime catalog selects a primary and secondary region from the regions defined in your Cloud Storage configuration. The primary region serves all table commit metadata and replicates it to the secondary region for backup.
Disaster recovery with failover
The disaster recovery feature lets you switch the primary and secondary regions for a catalog. The failover operation switches the primary region for the catalog and all its namespaces and tables. Failovers have two modes: soft failover and hard failover.
Soft failover: A soft failover prevents data loss. In this mode, the new primary region begins to accept writes only after all previous data synchronizes from the previous primary region. Use a soft failover for disaster recovery testing or other planned scenarios.
Hard failover: A hard failover prioritizes availability over data consistency and is designed to restore service. In this mode, the primary region always takes over and accepts write traffic, regardless of the primary region's current state. For example, when using a hard failover, the new primary region can take over even if the previous primary is unreachable.
Limitations
While this feature is in Preview, the REPLICATION_TIMESTAMP only tracks the catalog metadata, rather than Cloud Storage files. To keep data loss with a lower bound, see the Cloud Storage Data availability and durability documentation.
What's next
- Use cross-region replication and disaster recovery with the Lakehouse runtime catalog endpoint.