Fail Compute fault reference

This page describes what happens during of an experiment with a Fail Compute fault type, and measures to take in case Fault Injection Testing cannot stop the experiment.

How does the Fail Compute fault work

The Fail Compute fault uses resource tags to mark VMs and a firewall policy with rules that target those tags to block all ingress and egress traffic. This setup makes the targeted resources appear as if they are experiencing an outage while ensuring they remain undamaged and quickly recoverable.

The target for this fault can be the following:

What happens during experiment execution

As the experiment progresses through the states, the resources involved undergo the following changes.

Resource

PREPARING

INJECTING

REVERTING

VMs

None

Bind resource tag

Unbind resource tag

MIGs / RMIGs

None

Turn off autohealing and autoscaling (if they are active)

Restore autohealing and autoscaling settings

Tag

Create unique TagValue resource for experiment

None

Delete TagValue

Firewall Policy

Create SYSTEM-level FirewallPolicy resource

Populate with DENY rules and bind policy to relevant VPCs

Unbind from VPCs and delete FirewallPolicy

Emergency manual recovery

In the event of a catastrophic backend failure where Fault Injection Testing cannot automatically stop an experiment, you can manually restore connectivity to the VM resources by manually removing the resource tags bound to the affected VMs. The system firewall policy, which targets those tags, will no longer apply, effectively detaching the VMs from the isolation fault.

Required permissions

You need the following IAM permissions:

  • Required to view VM instances:
    • compute.instances.get
    • compute.instances.list
  • Required to view and remove the tag bindings:
    • resourcemanager.tagValueBindings.list
    • resourcemanager.tagValueBindings.delete

Manual recovery using the Google Cloud console UI

To perform a recovery using the Google Cloud console:

  1. Navigate to the VM instances page in the Google Cloud console.
  2. Select the specific VM instances affected by the isolation fault.
  3. In the instance details page, navigate to the section managing Tags.
  4. Identify the experiment-specific tag binding associated with the Fault Injection Testing Fail Compute fault.
  5. Remove the tag binding from the VM.

Once the tag is removed, the associated firewall policy will no longer target the VM, and connectivity will be restored.

Manual recovery using the gcloud CLI

You can manually remove the experiment tag binding from a VM using the Google Cloud CLI. In the following commands, replace TAG_VALUE_NAME, PROJECT_NUMBER, ZONE, and VM_NAME with the specific values for your environment.

First, retrieve the experiment-specific tag value name by listing the current tag bindings for the VM:

gcloud resource-manager tag bindings list --resource=//compute.googleapis.com/projects/PROJECT_NUMBER/zones/ZONE/instances/VM_NAME

Use the output of the previous command to determine the TAG_VALUE_NAME required for the deletion step:

gcloud resource-manager tag bindings delete --tag-value=TAG_VALUE_NAME --resource=//compute.googleapis.com/projects/PROJECT_NUMBER/zones/ZONE/instances/VM_NAME

Once the tag is removed, the associated firewall policy will no longer target the VM, and connectivity will be restored.