Fault Injection Testing overview

Fault Injection Testing lets you do fault injection experimentation, a method of introducing faults to a system to test its resilience before they cause a real, unexpected failure that impacts your customers. With Fault Injection Testing, you can inject faults into various components in your Google Cloud environment to ensure that your application handles them in a predictable way.

For the initial release of Fault Injection Testing, faults are generally equivalent to a failure of the target resource. These failures should trigger your application to redirect traffic to healthy instances if you've designed fault tolerance into your application.

You are expected to observe your application before, during, and after injecting the fault to verify the application handled the fault as expected.

Why use Fault Injection Testing?

Fault Injection Testing lets you run experiments on the resiliency of your applications in Google Cloud across a full spectrum of failure scenarios. A primary part of this spectrum is to run experiments to approximate Google Cloud zones and regions failing, which are difficult or impossible for you to fully run on your own. Additionally, Fault Injection Testing serves as a valuable development and improvement tool, allowing you to run experiments on your designed resilience mechanisms prior to introducing them in a production environment. Catching problems before they surface in production allows for faster improvement of designs, avoids costly downtime and reputational loss, and improves your overall experience.

Without a Google Cloud-native fault injection product, you are faced with doing your own experimentation. Doing your own experimentation is problematic – Google Cloud is a shared environment, and in many cases you may lack direct access to the underlying services and infrastructure. This scenario can result in inadequate, toilsome experimentation that may ultimately be ineffective in its goal to properly experiment with your application resilience. Fault Injection Testing helps you reduce the effort to automate these experiments, gives you access to induce failure modes you don't otherwise have access to, and improves the fidelity of your experimentation efforts.

For regulated customers, performing regular experimentation often is a requirement to remain compliant with industry-governing bodies. In these cases, the experimentation often takes the form of disaster recovery testing, showing that the failure of a zone or region won't stop the application from continuing to operate effectively.

Faults available for experiments

The following faults are available

  • Failover Cloud SQL - failing a database from primary to standby.
  • Degrade Application Traffic - simulating traffic degradation through a layer 7 load balancer

These faults are designed to restrict the scope of the experiment to a single project's resources in a single region or narrower. The experiment won't affect a wider scope than what is intended. However, these faults cause real failures to the targeted resources in its environment.

By targeting infrastructure components such as a Cloud SQL database or an Application Load Balancer, as well as other components that will be supported in the future, you can reasonably approximate a zone or region failure in the context of your own application.

How to use Fault Injection Testing

Before using Fault Injection Testing, ensure the following:

  • The environments to be experimented on have some redundancy in place. When a fault is injected, this environment's application can continue to run on the redundant infrastructure.
  • For Admins or Owners of Google Cloud resources: Be thoughtful about who gets permissions to set up and run Fault Injection Testing experiments. Causing faults on running infrastructure components in your Google Cloud environment will cause disruption. Therefore, you must grant permission only to operators who understand how their cloud environment is architected and how to safely experiment with the resiliency of that architecture, so that they can configure experiments in a way that doesn't cause unintended disruptions.

Using Fault Injection Testing involves setting up an experiment. To set up a new experiment, you'll first create an experiment template, which defines the fault to be injected and the target resources. Next, you will run the experiment from the template. An experiment is the set of actions defined in the template which will run against the resources you selected in the template.