kubectl command-line tool.
Overview
You enable HA in your database cluster by configuring the AlloyDB Omni Kubernetes operator to create standby replicas of your primary database instance. The AlloyDB Omni operator configures your database cluster to continuously update the data on this replica, and it matches all data changes on your primary instance.
Enable HA
Before you enable HA on your database cluster, ensure that your Kubernetes cluster has the following:
- Storage for two complete copies of your data 
- Compute resources for two database instances running in parallel 
To enable HA, follow these steps:
- Modify the database cluster's manifest to include an - availabilitysection under its- specsection. This section uses the- numberOfStandbysparameter to specify the number of standbys to add.- spec: availability: numberOfStandbys: NUMBER_OF_STANDBYS- Replace - NUMBER_OF_STANDBYSwith the number of standbys that you want to add. The maximum value is- 5. If you're unsure about the number of standbys you need, then start by setting the value to- 1or- 2.
- Re-apply the manifest. 
Disable HA
To disable HA, follow these steps:
- Set - numberOfStandbysto- 0in the cluster's manifest:- spec: availability: numberOfStandbys: 0
- Re-apply the manifest. 
Verify HA on a database cluster
To check the status of HA on a database cluster, check its HAReady status. If
the status of HAReady is True, then HA is enabled and working on the
cluster. If it's False, then it's enabled but not ready because it's in the
process of being set up.
To check the HAReady status using the kubectl command-line, run the
following command:
kubectl get dbcluster.alloydbomni.dbadmin.goog DB_CLUSTER_NAME -o jsonpath={.status.conditions[?(@.type == \'HAReady\')]} -n NAMESPACEReplace the following:
- DB_CLUSTER_NAME: the name of the database cluster.
- NAMESPACE: the namespace of the database cluster.
Fail over to a standby instance
If your primary instance becomes unavailable for a configurable period of time, then the AlloyDB Omni operator automatically fails over from the primary database instance to the standby instance. The following parameters determine when to start an automatic failover:
- The time between health checks (default is 30 seconds) 
- The number of consecutive failed health checks (default is 3) 
An automatic failover starts after the specified number of consecutive failed health checks occur, with the specified time between each failed health check. If the default values are retained, then an automatic failover occurs after 3 consecutive health check failures that are each 30 seconds apart.
Performing a manual failover is a good option when you want to quickly recover from an unexpected failure and minimize downtime.
The AlloyDB Omni operator supports both automatic and manual failover. Automatic failover is enabled by default.
Failover results in the following sequence of events:
- The AlloyDB Omni operator takes the primary database instance offline. 
- The AlloyDB Omni operator promotes the standby replica to be the new primary database instance. 
- The AlloyDB Omni operator deletes the previous primary database instance. 
- The AlloyDB Omni operator creates a new standby replica. 
Disable automatic failover
Automatic failovers are enabled by default on database clusters.
To disable automatic failover, follow these steps:
- Set - enableAutoFailoverto- falsein the cluster's manifest:- spec: availability: enableAutoFailover: false
Adjust automatic failover trigger settings
You can use settings to adjust automatic failovers for each database cluster.
The AlloyDB Omni operator issues regular health checks to the primary
database instance as well as to all standby replicas. The default frequency for
health checks is 30 seconds. If an instance reaches the automatic
failover trigger threshold, then the AlloyDB Omni operator triggers an
automatic failover.
The threshold value is the number of consecutive failures for the health check
before a failover is triggered. To change the health check period or the
threshold value, set the healthcheckPeriodSeconds and
autoFailoverTriggerThreshold fields to integer values in the cluster's
manifest:
spec:
  availability:
    healthcheckPeriodSeconds: HEALTHCHECK_PERIOD
    autoFailoverTriggerThreshold: AUTOFAILOVER_TRIGGER_THRESHOLD
Replace the following:
- HEALTHCHECK_PERIOD: an integer value that indicates the number of seconds to wait between each health check. The default value is- 30. The minimum value is- 1and the maximum value is- 86400(equivalent to a day).
- AUTOFAILOVER_TRIGGER_THRESHOLD: an integer value for the number of consecutive failures for the health check before a failover is triggered. The default value is- 3. The minimum value is- 0. There is no maximum value. If this field is set to- 0, then the default value of- 3is used instead.
Trigger a manual failover
To trigger a manual failover, create and apply a manifest for a new failover resource:
apiVersion: alloydbomni.dbadmin.goog/v1
kind: Failover
metadata:
  name: FAILOVER_NAME
  namespace: NAMESPACE
spec:
  dbclusterRef: DB_CLUSTER_NAME
Replace the following:
- FAILOVER_NAME: a name for this resource—for example,- failover-1.
- NAMESPACE: the namespace for this failover resource, which must match the namespace of the database cluster that it applies to.
- DB_CLUSTER_NAME: the name of the database cluster to fail over.
To monitor the failover, run the following command:
kubectl get failover FAILOVER_NAME -o jsonpath={.status.state} -n NAMESPACEReplace the following:
- FAILOVER_NAME: the name that you assigned the failover resource when you created it.
- NAMESPACE: the namespace of the database cluster.
The command returns Success after the new primary database instance is ready
for use. To monitor the status of the new standby instance, see the next section.
Switchover to a standby instance
Switchover is performed when you want to test your disaster recovery setup or any other planned activities that require switching the roles of the primary database and the standby replica.
After the switchover completes, the direction of replication and the roles of the primary database instance and the standby replica are reversed. Use switchovers to gain more control over testing your disaster recovery setup with no data loss.
The AlloyDB Omni operator supports manual switchover. A switchover results in the following sequence of events:
- The AlloyDB Omni operator takes the primary database instance offline. 
- The AlloyDB Omni operator promotes the standby replica to be the new primary database instance. 
- The AlloyDB Omni operator switches the previous primary database instance to be a standby replica. 
Perform a switchover
Before you begin a switchover, do the following:
- Verify that both your primary database instance and the standby replica are in a healthy state. For more information, see Manage and monitor AlloyDB Omni. 
- Verify that the current HA status is in - HAReadycondition. For more information, see Verify HA on a database cluster.
To perform a switchover, create and apply a manifest for a new switchover resource:
apiVersion: alloydbomni.dbadmin.goog/v1
kind: Switchover
metadata:
    name: SWITCHOVER_NAME
spec:
     dbclusterRef: DB_CLUSTER_NAME
     newPrimary: STANDBY_REPLICA_NAME
Replace the following:
- SWITCHOVER_NAME: a name for this switchover resource — for example,- switchover-1.
- DB_CLUSTER_NAME: the name of the primary database instance that switchover operation applies to.
- STANDBY_REPLICA_NAME: the name of the database instance that you want to promote to be the new primary.- To identify the standby replica name, run the following command: - kubectl get instances.alloydbomni.internal.dbadmin.goog
Heal a standby instance automatically
If a standby instance becomes unavailable, then the AlloyDB Omni operator heals the instance by deleting the old standby replica and creating a new standby replica that takes its place. The default time to trigger an automatic heal is 90 seconds.
Automatically healing the database cluster helps maintain healthy, continuous replication of the primary database.
Disable healing the instance automatically
By default, healing a standby instance automatically is enabled on database clusters.
To disable automatic heals, follow these steps:
- In the cluster's manifest, set - enableAutoHealto- false:- spec: availability: enableAutoHeal: false
Adjust automatic heal trigger settings
For each database cluster, you can use settings to adjust automatic heals.
The AlloyDB Omni operator issues regular health checks that you can configure. For more information, see Adjust automatic failover trigger settings. If a standby replica reaches the automatic heal trigger threshold, then the AlloyDB Omni operator triggers an automatic heal.
The threshold value is the number of consecutive failures for the health check
before a heal is triggered. To change the threshold value, set
autoHealTriggerThreshold in the cluster's manifest:
spec:
  availability:
    autoHealTriggerThreshold: AUTOHEAL_TRIGGER_THRESHOLD
Replace the following:
- AUTOHEAL_TRIGGER_THRESHOLD: an integer value for the number of consecutive failures for the health check before a heal is triggered. The default value is- 5. The minimum value is- 2because of possible transient, one-time errors on standby health checks.
Troubleshoot instance healing
If you use a large number of database clusters in a single Kubernetes cluster,
or if you have database clusters that are underprovisioned, automatic healing
might cause unavailability for the AlloyDB Omni operator or the database
clusters. If you receive an error that starts with
HealthCheckProber: health check for instance failed and the error references a
timeout or failure to connect, try the following recommended fixes:
- Reduce the number of database clusters you're managing in your Kubernetes cluster. 
- Increase the - healthcheckPeriodSecondsthreshold value so that health checks occur less frequently. For more information, see Adjust automatic failover trigger settings.
- Increase the - autoHealTriggerThresholdvalue so that automatic healing occurs less frequently. For more information, see Adjust automatic heal trigger settings.
- Disable automatic healing on database clusters. For more information, see Disable healing the instance automatically. 
- Increase the CPU limit, the memory limit, or both, for the AlloyDB Omni operator. For more information, see About automatic memory management and Analyze AlloyDB Omni operator memory heap usage. 
- Increase the resource limits for the AlloyDB Omni database clusters. For more information, see Resize your Kubernetes-based database cluster. 
The following are examples of errors that might be caused by overwhelming automatic healing. These examples omit environment details that help you identify the error source, such as the name of a cluster or an IP address.
- HealthCheckProber: health check for instance failed" err="DBSE0005: DBDaemon Client Error. secure dbdaemon connection failed: context deadline exceeded...
- HealthCheckProber: health check for instance failed" err="rpc error: code = Code(10303) desc = DBSE0303: Healthcheck: Health check table query failed. dbdaemon/healthCheck: read healthcheck table: timeout...
- HealthCheckProber: health check for instance failed" err="rpc error: code = Code(12415) desc = DBSE2415: Postgres: failed to connect to database. dbdaemon/healthCheck: failed to connect...
Use standby replica as a read-only instance
To use a standby replica as a read-only instance, complete the following steps:
- Set - enableStandbyAsReadReplicato- truein the manifest of the database cluster:- spec: availability: enableStandbyAsReadReplica: true
- Re-apply the manifest. 
- Verify that the read-only endpoint is reported in the - statusfield of the- DBClusterobject:- kubectl describe dbcluster -n NAMESPACE DB_CLUSTER_NAME- The following example response shows the endpoint of the read-only instance: - Status: [...] Primary: [...] Endpoints: Name: Read-Write Value: 10.128.0.81:5432 Name: Read-Only Value: 10.128.0.82:5432