This page describes what happens during of an experiment with a Fail Compute fault type, and measures to take in case Fault Injection Testing cannot stop the experiment.
How does the Fail Compute fault work
The Fail Compute fault uses resource tags to mark VMs and a firewall policy with rules that target those tags to block all ingress and egress traffic. This setup makes the targeted resources appear as if they are experiencing an outage while ensuring they remain undamaged and quickly recoverable.
The target for this fault can be the following:
- Virtual machine instances
- VMs based on a specified resource tag
- VMs within a specified zonal or regional Managed Instance Group (MIG)
- Non-MIG VMs in a zone and VMs in zonal MIGs.
- Non-MIGs in a region and VMs in all MIGs.
What happens during experiment execution
As the experiment progresses through the states, the resources involved undergo the following changes.
Resource |
|
|
|
VMs |
None |
Bind resource tag |
Unbind resource tag |
MIGs / RMIGs |
None |
Turn off autohealing and autoscaling (if they are active) |
Restore autohealing and autoscaling settings |
Tag |
Create unique TagValue resource for experiment |
None |
Delete TagValue |
Firewall Policy |
Create SYSTEM-level FirewallPolicy resource |
Populate with DENY rules and bind policy to relevant VPCs |
Unbind from VPCs and delete FirewallPolicy |
Emergency manual recovery
In the event of a catastrophic backend failure where Fault Injection Testing cannot automatically stop an experiment, you can manually restore connectivity to the VM resources by manually removing the resource tags bound to the affected VMs. The system firewall policy, which targets those tags, will no longer apply, effectively detaching the VMs from the isolation fault.
Required permissions
You need the following IAM permissions:
- Required to view VM instances:
compute.instances.getcompute.instances.list
- Required to view and remove the tag bindings:
resourcemanager.tagValueBindings.listresourcemanager.tagValueBindings.delete
Manual recovery using the Google Cloud console UI
To perform a recovery using the Google Cloud console:
- Navigate to the VM instances page in the Google Cloud console.
- Select the specific VM instances affected by the isolation fault.
- In the instance details page, navigate to the section managing Tags.
- Identify the experiment-specific tag binding associated with the Fault Injection Testing Fail Compute fault.
- Remove the tag binding from the VM.
Once the tag is removed, the associated firewall policy will no longer target the VM, and connectivity will be restored.
Manual recovery using the gcloud CLI
You can manually remove the experiment tag binding from a VM using the
Google Cloud CLI. In the following commands, replace TAG_VALUE_NAME,
PROJECT_NUMBER, ZONE, and VM_NAME with the specific values for your
environment.
First, retrieve the experiment-specific tag value name by listing the current tag bindings for the VM:
gcloud resource-manager tag bindings list --resource=//compute.googleapis.com/projects/PROJECT_NUMBER/zones/ZONE/instances/VM_NAME
Use the output of the previous command to determine the TAG_VALUE_NAME
required for the deletion step:
gcloud resource-manager tag bindings delete --tag-value=TAG_VALUE_NAME --resource=//compute.googleapis.com/projects/PROJECT_NUMBER/zones/ZONE/instances/VM_NAME
Once the tag is removed, the associated firewall policy will no longer target the VM, and connectivity will be restored.