OpenShift on Google Cloud: Disaster Recovery Strategies for active-passive and active-inactive setups

This document describes how to plan and implement active-passive and active-inactive disaster recovery for OpenShift deployments on Google Cloud to help you to achieve minimal downtime and rapid recovery in the event of a disaster. It provides best practices for backing up data, managing configuration as code, and handling secrets to help to ensure that you can quickly recover your applications in the event of a disaster.

This document is intended for system administrators, cloud architects, and application developers who are responsible for maintaining the availability and resilience of applications on an OpenShift Container Platform that's deployed on Google Cloud.

This document is part of a series that focuses on the application-level strategies that ensure your workloads remain highly available and quickly recoverable in the face of failures. It assumes that you have read Best practices for disaster recovery with OpenShift. The documents in this series are as follows:

Architectures for disaster recovery

This section describes architectures for active-passive and active-inactive disaster recovery scenarios.

Products used

Active-passive deployments

The following diagram shows an active-passive deployment scenario for OpenShift on Google Cloud.

Active-passive deployment, explained in following text.

As shown in the preceding diagram, in an active-passive deployment for disaster recovery, an OpenShift cluster in the primary region handles all production traffic. A secondary cluster in a different region is kept ready to take over if the primary fails. This setup ensures minimal downtime by having the secondary cluster pre-provisioned and in a warm state, meaning it's set up with the necessary infrastructure and application components but not actively serving traffic until needed. Application data is replicated to the passive cluster to minimize data loss, aligning with the RPO.

One of the regional clusters acts as the primary (active) site and handles all of the production traffic. A secondary cluster, in a different region, is the standby for disaster recovery. The secondary cluster is kept in a warm state, and is ready to take over with minimal delay in case of a primary cluster failure.

Description of components in an active-passive DR scenario

The architecture has the following configuration:

  • Primary OpenShift cluster (active): Located in the primary Google Cloud region, this cluster runs the production workload and actively serves all user traffic under normal operating conditions.
  • Secondary OpenShift cluster (passive): Located in a separate Google Cloud region for fault isolation, this cluster acts as the warm standby cluster. It's partially set up and running and is ready to take over if the primary system fails. It has the necessary infrastructure, OpenShift configuration, and application components deployed on it, but it doesn't serve live production traffic until a failover event is triggered.
  • Google Cloud regions: Geographically isolated locations that provide the foundation for disaster recovery. Using separate regions ensures that a large-scale event impacting one region doesn't affect the standby cluster.
  • Global External HTTPS Load Balancer: Acts as the single, global entry point for application traffic. Under normal conditions, it's configured to route all traffic to the primary (active) cluster. Its health checks monitor the primary cluster's availability.
  • Data replication mechanism: Continuous process or tools that are responsible for copying essential application data from the primary cluster to the secondary cluster (for example, databases or persistent volumes state). This approach ensures data consistency and minimizes data loss during a failover, helping you to meet your RPO.
  • Monitoring and health checks: Systems that continuously assess the health and availability of the primary cluster and its applications, for example, Cloud Monitoring, load balancer health checks, internal cluster monitoring. These systems are important for the quick detection of any failures.
  • Failover mechanism: A predefined process (manual, semi-automated, or fully automated) to redirect traffic from the primary to the secondary cluster upon detection of an unrecoverable failure in the primary. This process typically involves updating the Global Load Balancer's backend configuration to target the secondary cluster, making it the new active site.
  • VPC Network: The underlying Google Cloud network infrastructure that creates the necessary connectivity between regions for data replication and management.

Active-inactive deployments

Active-inactive DR involves maintaining a secondary region as a standby, which is activated only during disasters. Unlike active-passive setups, where data is continuously replicated, this strategy relies on periodic backups that are stored in Cloud Storage, with infrastructure provisioned and data restored during failover. You can use tools such as Velero, integrated with OpenShift API for Data Protection (OADP), to perform periodic backups. This approach minimizes costs, making it ideal for applications that can tolerate longer recovery times. It can also help organizations to align with extended recovery time objectives (RTO) and recovery point objectives (RPO).

In an active-inactive DR scenario, data is regularly backed up to the standby region, but not actively replicated. The infrastructure is provisioned as part of the failover process and data is restored from the most recent backup. You can use the OpenShift API for Data Protection (OADP), which is based on the Velero open-source project, to perform regular backups. We recommend that you store these backups in Cloud Storage buckets with versioning enabled. In the event of a disaster, you can use OADP to restore the contents of the cluster. This approach minimizes ongoing costs but results in longer RTO and potentially higher RPO compared to active-passive. This setup is suitable for applications with longer recovery time objectives.

The following diagram shows an active-inactive deployment and the failover process.

Failover process.

The failover process is as follows:

  1. A DR event is triggered when a monitored service becomes unavailable.
  2. A pipeline automatically provisions infrastructure in the DR region.
  3. A new OpenShift cluster is provisioned.
  4. Application data, secrets, and objects are restored from the latest backup through OADP.
  5. Cloud DNS record is updated to point to the regional load balancers in the DR region.

As shown in the preceding diagram, two separate OpenShift regional clusters are deployed, each in a different Google Cloud region, such as us-central1 and europe-west1. Each cluster must be be highly available within its region and use multiple zones to allow for redundancy.

Description of components in an active-inactive DR scenario

The architecture has the following configuration:

  • Primary region (Region A): Contains the fully operational OpenShift cluster serving production traffic.
  • Secondary region (Region B): Initially contains minimal resources (VPC and subnets). Infrastructure (Compute Engine instances and OCP) is provisioned during failover.
  • Backup storage: Google Cloud Storage buckets store periodic backups (OADP or Velero for application objects, as well as PVs and database backups). We recommend that you use versioning and cross-region replication for the bucket.
  • Configuration management: Git repository stores Infrastructure as Code (IaC, for example, Terraform) and Kubernetes or OpenShift manifests (for GitOps).
  • Backup tooling: OADP (Velero) configured in the primary cluster to perform scheduled backups to Cloud Storage.
  • Orchestration: Scripts or automation tools trigger infrastructure provisioning and restore processes during failover.

Use cases

This section provides examples of the different use cases for active-passive and active-inactive deployments.

Active-passive DR use cases

Active-passive DR is recommended for the following use cases:

  • Applications that require a lower RTO (for example, minutes to hours) than would be achievable with cold restores, where data is restored from a backup that is not immediately accessible.
  • Systems where continuous data replication is feasible and the RPO must be minimized (for example, minutes to seconds).
  • Regulated industries with strict downtime limits and critical business applications where the cost of maintaining a warm standby cluster is justified by the business impact of downtime.

Active-inactive DR use cases

Active-inactive DR is recommended for the following use cases:

  • Applications that can tolerate longer RTOs (for example, several minutes to hours).
  • Environments where cost optimization is important, and the expense of a continuously running standby cluster is prohibitive. The primary ongoing cost is for object storage rather than for running compute instances.
  • Development, testing, or less critical production workloads.
  • Archival or batch processing systems where recovery time is less critical.

Design considerations

This section describes design factors, best practices, and design recommendations that you should consider when you use this reference architecture to develop a topology that meets your specific requirements for security, reliability, cost, and performance.

Active-passive design considerations

This section describes the design considerations for an active-passive DR scenario.

Safeguarding application state and configuration

OpenShift Container Platform provides OADP and offers comprehensive disaster recovery protection for applications running in clusters. You can use it to back up the Kubernetes and OpenShift objects that are used by both containerized applications and virtual machines (for example, deployments, services, routes, PVCs, ConfigMaps, secrets, and CRDs). However, OADP does not support full cluster backup and restore. To learn how to configure and schedule backups, and how to restore operations, see the Red Hat documentation.

OADP provides backup and restore processes for persistent volumes that rely on the block storage and NFS stores that are used by the applications. You can action these processes by using tools Restic or Kopia to snapshot or performing file-level backup.

OADP is useful for backing up object definitions, ensuring configuration consistency, and potentially restoring specific applications or namespaces if needed, complementing data replication.

To further reduce RPO and RTO in an active-passive configuration, we recommend that you configure data replication between primary and secondary regions.

Data replication is important for ensuring the secondary cluster can take over seamlessly. As outlined in the following section, implementation of data replication from the primary to the secondary clusters depends on the storage type that the application uses.

Block storage (persistent volumes)

Use Google Persistent Disk Asynchronous Replication to copy data from the primary to the secondary region. In this approach, you create a primary disk in the primary region, a secondary disk in the secondary region, and set up replication between them. The use of consistent groups ensures that both disks contain replication data from a common point in time, which is then used for DR. To learn more, see Configure Persistent Disk Asynchronous Replication.

PersistentVolumes objects

In OpenShift, create PersistentVolumes objectsin both clusters that link to these disks, and make sure that applications use the same Persistent Volume Claims (PVCs) in both clusters.

Application-level replication

Some applications (for example, databases and message queues) have built-in replication features that you can configure across clusters. You can also use a managed service like Pub/Sub to facilitate replication for specific types of application data or events. (for example, databases and message queues).

Database backups

Applications can depend on different kinds of database products. To help outline design considerations for database backups, this document uses PostgreSQL as an example database.

Self-hosted backups using an in-cluster database operator

Database operators like the CloudNative PostgreSQL Operator can facilitate scheduled backups and disaster recovery for PostgreSQL clusters. CloudNative PostgreSQL Operator natively integrates with tools like pg_basebackup and supports streaming replication backups. You can store backups in cloud storage services such as Google Cloud Storage (Cloud Storage) for durability and recovery.

You can set up streaming replication between primary and secondary regional clusters to ensure that, even in the case of outage in the primary region, data is available. This streaming replication is typically synchronous within a region, and asynchronously across regions. For detailed configuration steps, see the CloudNativePG documentation.

In the case of a disaster, you can restore backups to a new PostgreSQL cluster, ensuring minimal downtime and data loss. The following is an example configuration snippet for enabling scheduled backups using the CloudNative PostgreSQL Operator:

        apiVersion: postgresql.cnpg.io/v1
        kind: ScheduledBackup
        metadata:
        name: backup-example
        spec:
        schedule: "0 0 0 * * *"
        backupOwnerReference: self
        cluster:
            name: pg-backup

Managed services

Managed databases like Cloud SQL have built-in backup and replication features. We recommend that you set up asynchronous replication from the primary database instance to a replica in the secondary region. To learn more, see About replication in Cloud SQL. In OpenShift, configure secrets or config maps to point to the correct database connection strings for each cluster.

Because asynchronous replication results in a non-zero RPO, there's a potential for loss of the most recent data writes. You must design your application to mitigate against data loss. Alternatively, consider using another replication method.

We also recommend that you enable Cloud SQL automated backups. To learn more, see Create and manage on-demand and automatic backups.

Failover process

In the event of primary cluster failure, Cloud DNS automatically redirects traffic to the secondary regional cluster based on health checks and failover policies.

When the secondary cluster is promoted from a read replica to a primary, it takes over as an active site and serves production traffic. This promotion is necessary to be able to accept database writes.

To set up DR for Cloud SQL, follow the steps that are described in Google Cloud SQL Disaster recovery documentation. Using asynchronous database or storage replication causes a non-zero RPO to help to ensure that your application can tolerate loss of most recent writes. Alternatively, consider using another replication method.

Secure secret management

Secrets like database passwords, API keys, and TLS certificates are important aspects of DR. You must be able to restore these secrets securely and reliably in a new cluster.

Common approaches to secret management are as follows:

  • Use external secrets: Use a tool like the external secrets operator to pull secrets from Google Secret Manager.
  • Backup secrets with OADP Operator: If you don't use an external store, ensure that secrets are included in your backups.
  • Regular rotation: Rotate secrets regularly and ensure your secret management strategy accommodates DR scenarios.
  • Testing: Test secret restoration in a staged environment to confirm that all services can start with the provided credentials.
  • Validation: Validate that your DR cluster has the necessary IAM roles or authentication methods to retrieve secrets from external stores.

Networking and traffic management

Use Google Cloud's Global External HTTPS Load Balancer as the primary ingress point to distribute traffic between multiple OpenShift clusters (for example, primary and secondary clusters). This global service directs user requests to the appropriate backend cluster based on proximity, health, and availability.

To connect the Global Load Balancer to your OpenShift clusters, you can use either of the following approaches:

  • Use regional load balancers (internet NEGs): Configure Google Cloud internet network endpoint groups (NEGs) to point to the external IP addresses of the Regional Load Balancers exposing each of your OpenShift clusters' ingress services (OCP Routers). The Global Load Balancer then routes traffic to these regional load balancer IPs. This approach provides an abstraction layer but involves a hop to an additional network.
  • Direct Pod Routing (Compute Engine_VM_IP_PORT NEGs): Configure the OpenShift Ingress Controller integration to use Google Cloud Network Endpoint Groups (NEGs) of type Compute Engine_VM_IP_PORT. This approach allows the Global Load Balancer to target the OpenShift Ingress Controller (Router) pods directly using their internal PodIP:TargetPort. This method bypasses the extra hop and additional node proxying. It typically results in lower latency, and enables more direct health checking from the global load balancer.

Both setups allow the Global Load Balancer to manage traffic distribution effectively across clusters in different regions. To learn more, see Set up a global external Application Load Balancer with an external backend.

VPCs

We recommend the following approaches for VPC management:

  • Shared VPC: Use a Shared VPC to centralize network management for both primary and secondary clusters. This approach simplifies administration and ensures consistent network policies across regions.
  • Global Dynamic Routing: Enable Global Dynamic Routing within your VPCs to automatically propagate routes between regions, ensuring seamless connectivity between clusters.
  • Custom Mode VPCs: Use custom mode VPCs and create specific subnets in the regions where your clusters run. This is often necessary for VPC-native pod networking required by methods like Compute Engine_VM_IP_PORT routing.
  • VPC Network Peering: If it's necessary for you to use separate VPC networks for each region and cluster, use VPC Network Peering to connect the regions and clusters.

Subnets and IP addresses

Create regional subnets in each region to maintain network segmentation and avoid IP address conflicts.

Ensure that there are non-overlapping IP ranges between regions to prevent routing issues.

Cross-cluster traffic with Red Hat Service Mesh

OpenShift supports Service Mesh federation, which enables communication between services deployed across multiple OpenShift clusters. This function is particularly useful for DR scenarios where services might need to communicate across clusters during failover or data replication.

To learn how to set up Service Mesh federation between primary and secondary clusters, see the Red Hat documentation.

Active-inactive design considerations

This section describes the design considerations for an active-inactive DR solution.

Application configuration as code (GitOps)

We recommend that you adopt a GitOps approach to store all cluster and application configurations in a Git repository. This approach enables quick restoration in a DR scenario by enabling syncing to a state that is known to be running reliably in another cluster. Backups ensure you have snapshots of your runtime state, however, you also need a reliable way to redeploy application logic, manifests, and infrastructure definitions rapidly after a disaster.

Use the OpenShift GitOps Operator

The OpenShift GitOps operator, based on Argo CD, provides a Red Hat-supported way to implement GitOps patterns directly within an OpenShift environment. It automates the process of continuously reconciling your cluster state with your chosen configuration and stores it in a Git repository.

The OpenShift GitOps operator's controller continuously ensures that the cluster's state matches the configuration defined in this repository. If resources drift or are missing, it automatically reconciles them. To learn more, see About Red Hat OpenShift GitOps.

DR scenario execution

In the event of a disaster, do the following:

  • Set up a new OpenShift cluster in another region.
  • Install the OpenShift GitOps operator.
  • Apply the same Application manifest referencing your Git repository.

The operator synchronizes the cluster state to match your repository, quickly redeploying deployments, services, routes, operators, and any other resources that are defined in your code.

To help avoid any issues during DR, we recommend that you do the following:

  • Maintain strict branching and tagging strategies in your Git repository so you can identify stable configurations suitable for DR.
  • Check that your DR cluster has network connectivity and appropriate permissions to access the Git repository.
  • Include all resource types as code to avoid manual intervention during failover (for example, infrastructure components, application workloads, and configurations).

Firewall rules

Define unified firewall policies and apply them consistently across both clusters to control traffic flow and enhance security.

Follow the least privilege principle, which means that you restrict the inbound and outbound traffic to only what's necessary for application functionality.

Deployment

To learn how to deploy a topology based on this reference architecture, see the Red Hat documentation.

What's next