For version information, see the Distributed Cloud connected release notes.

Troubleshoot Distributed Cloud connected

Google remotely monitors and maintains the Google Distributed Cloud connected hardware. For this purpose, Google engineers have Secure Shell (SSH) access to the Distributed Cloud connected hardware. If Google detects an issue, a Google engineer contacts you to troubleshoot and resolve it. If you have identified an issue yourself, contact Google Support immediately to diagnose and resolve it.

Distributed Cloud connected machine connectivity

This section describes how to check the internet and Google Cloud connectivity for your Distributed Cloud connected machines using the Metrics Explorer feature of Cloud Monitoring.

This procedure uses the following Monitoring metrics:

Machine Connected (/machine/connected): indicates whether the machine is connected to Google Cloud.
Network Connectivity (/machine/network/connectivity): indicates whether the machine's primary network interface has internet connectivity.

To complete the steps in this section, you must satisfy the following prerequisites:

Access to the Google Cloud console and your Distributed Cloud connected Google Cloud project.
The Monitoring Viewer IAM role, which lets you view Monitoring metrics.
(Optional) the machine_id value of the target Distributed Cloud connected machine for filtering the returned results.

Use Metrics Explorer to validate machine connectivity

Navigate to Metrics Explorer:
1. In the Google Cloud console, navigate to the Monitoring section.
2. In the left-hand navigation tree, click Metrics Explorer.
Select the target resource type:
1. In the Metrics Explorer page, navigate to the Queries page.
2. Use the search bar to search for the Machine resource type. You can also use the full resource identifier edgecontainer.googleapis.com/Machine.
3. In the returned results, click the Machine resource type.
Validate the machine's connection to Google Cloud:
1. In the Metric section, search for the connected value.
2. Select the Machine Connected metric. Its full path is edgecontainer.googleapis.com/machine/connected.
3. (Optional) Filter by the target machine_id value using the Filter section.
4. In the time chart that appears, verify that the Healthy line remains at 100% contiguously. If at any point this value is 0% or Unhealthy, the machine had lost connectivity with Google Cloud at the indicated time.
Validate the machine's internet connectivity:
1. In the Metric section, search for the connectivity value.
2. Select the Network Connectivity metric. Its full path is edgecontainer.googleapis.com/machine/network/connectivity.
3. (Optional) Filter by the target machine_id value using the Filter section.
4. In the time chart that appears, verify that the Healthy line remains at 100% contiguously. If at any point this value is 0% Unhealthy, the machine had lost internet connectivity at the indicated time.

Understand validation results

The following table explains the results returned by Metrics Explorer.

Machine state	Diagnosis	Resolution
Healthy "Machine Connected" metric value is `1` "Network Connectivity" metric value is `1`	Normal operation.	None.
Disconnected "Machine Connected" metric value is `0` "Network Connectivity" metric value is `1`	Machine has internet connectivity but can't connect to Google Cloud.	Check your [firewall rules](distributed-cloud/edge/latest/docs/requirements#connected_management_and_monitoring_traffic) for Google services and API endpoints. Verify that Distributed Cloud connected agents are running on the machine.
Isolated "Machine Connected" metric value is `0` "Network Connectivity" metric value is `0`	Machine has no internet connectivity.	Check power and network cabling, local network configuration, machine LED status. Verify your VLAN and routing configuration.
Intermittent "Machine Connected" metric value alternates between `0` and `1` "Network Connectivity" metric value alternates between `0` and `1`	Unstable network connection, packet loss, or excessive latency.	Check your local network for congestion and faulty hardware.

If you notice sustained values of 0 for either metric, follow the troubleshooting steps described in the table to resolve them. If the problem persists, contact Google Support with the affected machine's machine_id value and the timestamp of the outage.

Virtual machines stuck in `Pending` state

A virtual machine workload can get stuck in the Pending state and fail to schedule on a node if one of the following happens:

Distributed Cloud connected cannot allocate the requested resources, such as CPU time, memory, or disk space, to the virtual machine.
There is a fault in the virtual machine's configuration.
There is a fault with the virtual machine's storage.
The target node is tainted.

To resolve this issue, do the following:

Obtain cluster credentials as described in Obtain credentials for a cluster.

Get information about the affected virtual machine:

kubectl describe virtualmachine VM_NAME -n NAMESPACE

Replace the following:

VM_NAME: The name of the target virtual machine.
NAMESPACE: The namespace of the target virtual machine.

The command returns output similar to the following:

Status:
...
State:                    Pending
...
Events:
Type     Reason                  Age   From                       Message
----     ------                  ----  ----                       -------
Normal   SuccessfulCreate        15m   virtualmachine-controller  Created virtual machine my-stuck-vm
Warning  DiskProvisioningFailed  14m   virtualmachine-controller  Failed to provision disk: DataVolume my-stuck-vm-data-disk not ready
Warning  PVCNotBound             14m   virtualmachine-controller  PersistentVolumeClaim my-stuck-vm-data-disk is in phase Pending
Warning  VMINotCreated           10m   virtualmachine-controller  VirtualMachineInstance cannot be created: dependencies not ready

The output of the command contains messages that might indicate resource constraints, scheduling failures, storage faults, and other problems.

Examine the output to determine the causes of the scheduling failure as explained in the next sections.

Insufficient resources

You might see a message indicating insufficient resources, such as CPU, memory, or disk space. For example:

5/8 nodes are available: 3 Insufficient memory, 3 Insufficient CPU.

To remedy this problem, check the resources allocated to the affected virtual machines and other workloads scheduled on the node, then do the following depending on your business needs:

Scale down other workloads scheduled on the node,
Reduce the amount of resources allocated to the affected virtual machine,
Add more machines to the affected cluster.

Tainted nodes

You might see a message indicating the target node is tainted. For example:

5/8 nodes are available: 3 node(s) had taint {<taint-key>:<taint-value>}, that the pod didn't tolerate.

To remedy this problem, do the following:

Use the following command to check for taints on the node:

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

The command returns output similar to the following:

NAME                           TAINTS
node-name-1   [map[effect:PreferNoSchedule key:node-role.kubernetes.io/master] map[effect:PreferNoSchedule key:node-role.kubernetes.io/control-plane]]
node-name-2   <none>

Do one of the following:
- For unexpected taints, remove them as described in Taints and Tolerations.
- For expected taints, add corresponding tolerations to the virtual machine's configuration, as described in Taints and Tolerations.

Storage faults

You might see a message indicating a fault with the virtual machine's storage. For example:

5/8 nodes are available: 3 node(s) had volume node affinity conflict, 3 node(s) had unbound immediate PersistentVolumeClaims.

This message might indicate that the corresponding persistent volume fails to mount on the target node.

To remedy this problem, do the following:

Use the following command to obtain the status of the persistent volume claims (PVCs) in the affected virtual machine's namespace:

kubectl get pvc -n NAMESPACE

Replace NAMESPACE with the name of the target namespace.

The command returns output similar to the following:

NAME                                               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
windows-robin-disk-0                               Bound     pvc-b1a1d264-84bf-4e58-857d-f37f629d5082   25Gi       RWX            robin-block-immediate   30h
windows-robin-disk-1                               Bound     pvc-0130b9a8-7fed-4df0-8226-d79273792a16   25Gi       RWX            robin-block-immediate   30h
windows-robin-vm-0-restored-windows-robin-disk-0   Pending                                                                        gce-pd-gkebackup-in     26m

Verify that the corresponding PVC has a Bound status; if the status is Pending, the storage subsystem has failed to provision the volume. In such cases, you must troubleshoot the storage subsystem configuration and ensure that the appropriate StorageClass is available.

Troubleshoot Distributed Cloud connected Stay organized with collections Save and categorize content based on your preferences.

Distributed Cloud connected machine connectivity

Use Metrics Explorer to validate machine connectivity

Understand validation results

Virtual machines stuck in Pending state

Insufficient resources

Tainted nodes

Storage faults

Troubleshoot Distributed Cloud connected

Virtual machines stuck in `Pending` state