This page describes what happens during an experiment with a Failover Cloud SQL fault type.
How does the Failover Cloud SQL fault work
The Failover Cloud SQL fault triggers a high availability (HA) failover for a
Cloud SQL instance by calling the failover method in the Cloud SQL Admin API. The
purpose of this fault is to allow you to test your service's behavior and
resilience when the primary Cloud SQL instance transitions to the standby
instance in another zone within the same region.
The target for this fault is a Cloud SQL instance.
What happens during experiment execution
As the experiment progresses through the states, the resources involved undergo the following changes.
Resource |
|
|
|
Instance |
None |
Promote HA standby instance |
Not applicable |
HA failover validation and retries
The Fault Injection Testing backend performs validation checks before initiating the
failover to ensure the environment is suitable. This process occurs during the
INJECTING stage, before sending the sqladmin.instances.failover API call to
the Cloud SQL control plane:
- The system checks that the Cloud SQL instance
state is
RUNNABLE. If the instance is in any other state during a running experiment, Fault Injection Testing notifies you. - It verifies that the
cloudsql.googleapis.com/database/available_for_failovermetric isTRUE.
If these conditions are not met, the failover request is enqueued for retry.
Handling failures
If the failover is already in progress or encounters an error:
- During injection: If the failover call is rejected by the Cloud SQL
control plane (for example, if the standby instance is unhealthy), the
experiment moves to the
STOPPINGstate. - Post-failover monitoring: If the failover operation does not reach a
DONEstatus within a predetermined 10-minute duration, the Fault Injection Testing backend queries the operation errors, moves the experiment to theSTOPPINGstate, and marks the experiment as having encountered an error.
Fault Injection Testing relies on the Cloud SQL control plane as the source of truth. The control plane rejects any failover attempt if the standby instance is unhealthy. Fault Injection Testing uses pre-flight checks and status monitoring to avoid or report these failures rather than attempting automated retries after a rejection.
Manual failback
Because failover is a stateful change, stopping the experiment does not
automatically fail back to the original primary instance (as indicated by 'Not
applicable' in the REVERTING stage). The database will continue to run in the
secondary zone.
To return the instance to its original zone after the experiment completes, you
must manually initiate another failover using the Google Cloud console or the
gcloud CLI.