VM resiliency

This document gives an overview of virtual machine (VM) resiliency and optional checks that application operators can enable for deeper insights from within a VM in Google Distributed Cloud (GDC) air-gapped.

This document is for developers within the application operator group who operate VMs. For more information, see Audiences for GDC air-gapped documentation.

VMs in GDC provide HA to improve service continuity in the event of underlying infrastructure or guest failures. You can also configure the system to emit optional health signals that offer deeper insight into VM status.

VM availability checks

The system provides the following VM availability checks:

Check Name Description Mitigation Action Support Signal Availability
Guest Health Check Verifies guest OS health. A prerequisite for other in-guest checks. Yes In-Guest Check
Storage Check Verifies the VM's underlying storage health. Yes In-Guest Check
Egress Check Verifies connectivity to a well-known internal endpoint. No In-Guest Check and Outside
Ingress Check Verifies VM accessibility using its configured ingress (VirtualMachineExternalAccess). No In-Guest Check

The mitigation actions can restart the VM and reschedule to another node on frequent failures.

Request permissions and access

To perform the tasks listed in this page, you must have the Project VirtualMachine Admin role. Follow the steps to verify that you have the Project VirtualMachine Admin (project-vm-admin) role in the namespace of the project where the VM resides.

Enable in-guest checks

By default, the in-guest check is disabled if the guestHealthCheck is not present.

To enable or disable the in-guest check for a VM, you must update the GuestEnvironment in the spec of the VM. This setting collects metrics from inside the VM, provided the guest agent is installed. If the guestHealthCheck is not present, the in-guest checks are disabled by default.

  1. Open your VM's configuration file.
  2. Navigate to the spec: section.
  3. Add or modify the guestEnvironment: and guestHealthCheck: fields to enable the check.
  4. Set the enable field to true.

Here is an example of the configuration in a YAML file:

spec:
  compute:
    virtualMachineType: n2-standard-2-gdc
  guestEnvironment:
    guestHealthCheck:
      enable: true

Verify the checks

After configuring your VM, you can verify the status of the availability checks by inspecting the virtual machine's Condition in its Status.

kubectl --kubeconfig MANAGEMENT_API_SERVER \
  -n NAMESPACE_NAME \
  get gvm -o yaml

The output shows the status of the various checks. For example, if the guestHealthCheck is enabled, the gvm status conditions are populated with the VMGuestHealth signal.

Disable VM high availability

By default, VM high availability is enabled if the annotation is not present. You can explicitly disable HA for a specific VM by adding an annotation. Add the annotation to disable VM HA:

  1. Open your VM's configuration file.
  2. Add the highavailability.virtualmachine.gdc.goog/enable: false annotation to the VM's metadata to disable high availability.

Here is an example of the annotation in a YAML file:

metadata:
  annotations:
    highavailability.virtualmachine.gdc.goog/enable: false

Node failure mitigation

Automated mitigation actions address VM failures and maintain HA. When the underlying infrastructure can no longer support a running VM, the system attempts to fence the unhealthy node and reschedule the VM to a healthy node. The following scenarios can trigger this node level mitigation:

  • Node partition from API server: The bare metal node that hosts the VM is partitioned from the Management API server due to a condition like the following:

    • Loss of network connectivity between the API nserver and the node.
    • The kubelet agent on the node is down.
    • The node observes a power failure.
  • User cluster VM partition: A user cluster worker VM is partitioned from its cluster's Management API server.