Google remotely monitors and maintains the Google Distributed Cloud connected hardware. For this purpose, Google engineers have Secure Shell (SSH) access to the Distributed Cloud connected hardware. If Google detects an issue, a Google engineer contacts you to troubleshoot and resolve it. If you have identified an issue yourself, contact Google Support immediately to diagnose and resolve it.
Distributed Cloud connected software upgrades
This section describes how to use Metrics Explorer to check whether a Distributed Cloud connected cluster is undergoing a software upgrade.
This procedure uses the following Monitoring metrics:
Current Cluster Version (
/edge_cluster/current_cluster_version): indicates the current version of the Distributed Cloud connected software running on the cluster.Target Cluster Version (
/edge_cluster/target_cluster_version): indicates the target version of the Distributed Cloud connected to which the cluster is upgrading.
To complete the steps in this section, you must satisfy the following prerequisites:
- Access to the Google Cloud console and your Distributed Cloud connected Google Cloud project.
- The Monitoring Viewer IAM role, which lets you view Monitoring metrics.
- (Optional) the
machine_idvalue of the target Distributed Cloud connected machine for filtering the returned results.
Use Metrics Explorer to check the cluster's current and target software versions
Navigate to Metrics Explorer:
In the Google Cloud console, navigate to the Monitoring section.
In the left-hand navigation tree, click Metrics Explorer.
Select the target resource type:
In the Metrics Explorer page, navigate to the Configuration page.
Click Select a metric.
Use the search bar to search for the Cluster resource type. You can also use the full resource identifier
edgecontainer.googleapis.com/Cluster.In the returned results, click the Cluster resource type.
Obtain the cluster's current software version:
In the Metric section, search for the
current_cluster_versionvalue.Select the Machine Uptime metric. Its full path is
edgecontainer.googleapis.com/edge_cluster/current_cluster_version.(Optional) Filter by the target
machine_idvalue using the Filter section.
Obtain the cluster's target software version:
Click Add Query.
In the Metric section, search for the
target_cluster_versionvalue.Select the Target Cluster Version metric. Its full path is
edgecontainer.googleapis.com/edge_cluster/target_cluster_version.(Optional) Filter by the target
machine_idvalue using the Filter section.
Check the cluster's software upgrade status in the chart that appears.
If the Current Cluster Version and Target Cluster Version lines each indicate different values, the cluster is undergoing a software upgrade.
If the Current Cluster Version and Target Cluster Version lines indicate the same value, the cluster is not undergoing a software upgrade.
Verify the result from the previous step using the following command:
gcloud edge-cloud container clusters describe CLUSTER_ID --location=REGION
Replace the following:
CLUSTER_ID: the ID of the target cluster.REGION: the Google Cloud region in which the cluster has been created.
In the command's output, note the values of the following fields:
- If the value of the
statusfield isUPDATING, the cluster is undergoing a software upgrade. - If the values of the
clusterVersionandtargetVersionfields are different, check them against the values returned by Metrics Explorer.
Understand the results
The following table explains the results returned by Metrics Explorer and the gcloud command.
| Cluster state | Diagnosis | Resolution |
|---|---|---|
HealthycurrentVersion and targetVersion values match`status` value is RUNNING |
Cluster is running the target version of the Distributed Cloud connected software. | None. |
UpgradingcurrentVersion value is lower than targetVersion`status` value is UPDATING |
Cluster is upgrading to the target version of the Distributed Cloud connected software. | Monitor the cluster in Metrics Explorer until the current and target cluster version values match. |
StuckcurrentVersion value is lower than targetVersion indefinitely`status` value is UPDATING indefinitely |
Upgrade to the target version of the Distributed Cloud connected software has failed on at least one node in the cluster. | Check machine connectivity and system logs; contact Google for assistance. |
Rolling backcurrentVersion value is higher than targetVersion`status` value is UPDATING |
luster is rolling back to a previous version of the Distributed Cloud connected software. | Contact Google to identify the reason for the rollback. |
If the software upgrade on the cluster has failed or the cluster has rolled back to a previous software version, check the following:
- Node health. Verify that each physical Distributed Cloud connected machine has network connectivity and reporting uptime as described in the next section.
- Maintenance windows. Verify whether the software upgrade has been paused due to a maintenance exclusion window.
- System logs. Examine your system logs to identify possible reasons for the software upgrade failure, such as Pod eviction timeouts.
If the resolutions steps listed in the table do not resolve the issue, contact
Google Support
with the affected machine's machine_id value and the timestamp of the outage.
Distributed Cloud connected machine restarts
This section describes how to use Metrics Explorer to check whether a Distributed Cloud connected physical machine has restarted, and determine the reason for the restart. Monitoring restarts helps determine whether they were part of planned maintenance or the result of a hardware failure or power disruption.
This procedure uses the following Monitoring metrics:
Machine Uptime (
/machine/uptime): indicates the time, in seconds, since the last restart.Machine Restarts (
/machine/restart_count): indicates the total number of restarts for the target machine since its deployment.
To complete the steps in this section, you must satisfy the following prerequisites:
- Access to the Google Cloud console and your Distributed Cloud connected Google Cloud project.
- The Monitoring Viewer IAM role, which lets you view Monitoring metrics.
- (Optional) the
machine_idvalue of the target Distributed Cloud connected machine for filtering the returned results.
Use Metrics Explorer to check machine uptime and restart count
Navigate to Metrics Explorer:
In the Google Cloud console, navigate to the Monitoring section.
In the left-hand navigation tree, click Metrics Explorer.
Select the target resource type:
In the Metrics Explorer page, navigate to the Configuration page.
Click Select a metric.
Use the search bar to search for the Machine resource type. You can also use the full resource identifier
edgecontainer.googleapis.com/Machine.In the returned results, click the Machine resource type.
Check the machine's uptime:
In the Metric section, search for the
uptimevalue.Select the Machine Uptime metric. Its full path is
edgecontainer.googleapis.com/machine/uptime.(Optional) Filter by the target
machine_idvalue using the Filter section.In the time chart that appears, verify that the uptime graph is continuously rising. If at any point the uptime value drops to zero and restarts, this indicates the machine has restarted.
Check the machine's restart count:
In the Metric section, search for the
restart_countvalue.Select the Machine Restarts metric. Its full path is
edgecontainer.googleapis.com/machine/restart_count.(Optional) Filter by the target
machine_idvalue using the Filter section.In the time chart that appears, verify that the graph line remains at
0, which indicates no restarts have occurred. If at any point this the line spikes to1this indicates the machine has restarted; note the exact timestamp of the restart for further troubleshooting.(Optional) To view individual events instead of a graph, navigate to the Aggregation section of the page, set the Alignment period field to
1 minuteand the Per-series aligner field to Delta.
Understand the results
The following table explains the results returned by Metrics Explorer.
| Machine state | Diagnosis | Resolution |
|---|---|---|
| Stable "Machine Uptime" metric rises steadily "Machine Restarts" metric delta is 0 |
Machine has not restarted. | None. |
| Clean restart "Machine Uptime" metric drops to 0"Machine Restarts" metric spikes to 1 |
Machine has successfully restarted and reconnected to Google Cloud. | Check system logs to determine the reason for the restart. |
| Power failure "Machine Uptime" metric graph has a break with no data "Machine Restarts" metric has not changed during the break in machine uptime |
Machine lost power or network connectivity before it could restart. | Check power and network cabling, local network configuration, LED indicator status. |
| Intermittent "Machine Connected" metric value alternates between 0 and 1"Network Connectivity" metric value alternates between 0 and 1 |
Unstable network connection, packet loss, or excessive latency. | Check your local network for congestion and faulty hardware. |
If the resolutions steps listed in the table do not resolve the issue, contact
Google Support
with the affected machine's machine_id value and the timestamp of the outage.
Distributed Cloud connected machine connectivity
This section describes how to check the internet and Google Cloud connectivity for your Distributed Cloud connected machines using the Metrics Explorer feature of Cloud Monitoring.
This procedure uses the following Monitoring metrics:
Machine Connected (
/machine/connected): indicates whether the machine is connected to Google Cloud.Network Connectivity (
/machine/network/connectivity): indicates whether the machine's primary network interface has internet connectivity.
To complete the steps in this section, you must satisfy the following prerequisites:
- Access to the Google Cloud console and your Distributed Cloud connected Google Cloud project.
- The Monitoring Viewer IAM role, which lets you view Monitoring metrics.
- (Optional) the
machine_idvalue of the target Distributed Cloud connected machine for filtering the returned results.
Use Metrics Explorer to check machine connectivity
Navigate to Metrics Explorer:
In the Google Cloud console, navigate to the Monitoring section.
In the left-hand navigation tree, click Metrics Explorer.
Select the target resource type:
In the Metrics Explorer page, navigate to the Queries page.
Use the search bar to search for the Machine resource type. You can also use the full resource identifier
edgecontainer.googleapis.com/Machine.In the returned results, click the Machine resource type.
Check the machine's connection to Google Cloud:
In the Metric section, search for the
connectedvalue.Select the Machine Connected metric. Its full path is
edgecontainer.googleapis.com/machine/connected.(Optional) Filter by the target
machine_idvalue using the Filter section.In the time chart that appears, verify that the Healthy line remains at 100% contiguously. If at any point this value is 0% or Unhealthy, the machine had lost connectivity with Google Cloud at the indicated time.
Check the machine's internet connectivity:
In the Metric section, search for the
connectivityvalue.Select the Network Connectivity metric. Its full path is
edgecontainer.googleapis.com/machine/network/connectivity.(Optional) Filter by the target
machine_idvalue using the Filter section.In the time chart that appears, verify that the Healthy line remains at 100% contiguously. If at any point this value is 0% Unhealthy, the machine had lost internet connectivity at the indicated time.
Understand the results
The following table explains the results returned by Metrics Explorer.
| Machine state | Diagnosis | Resolution |
|---|---|---|
| Healthy "Machine Connected" metric value is 1"Network Connectivity" metric value is 1 |
Normal operation. | None. |
| Disconnected "Machine Connected" metric value is 0"Network Connectivity" metric value is 1 |
Machine has internet connectivity but can't connect to Google Cloud. | Check your firewall rules for Google services and API endpoints. Verify that Distributed Cloud connected agents are running on the machine. |
| Isolated "Machine Connected" metric value is 0"Network Connectivity" metric value is 0 |
Machine has no internet connectivity. | Check power and network cabling, local network configuration, LED indicator status. Verify your VLAN and routing configuration. |
| Intermittent "Machine Connected" metric value alternates between 0 and 1"Network Connectivity" metric value alternates between 0 and 1 |
Unstable network connection, packet loss, or excessive latency. | Check your local network for congestion and faulty hardware. |
If you notice sustained values of 0 for either metric, follow the troubleshooting steps described
in the table to resolve them. If the problem persists, contact Google Support
with the affected machine's machine_id value and the timestamp of the outage.
Virtual machines stuck in Pending state
A virtual machine workload can get stuck in the Pending state and fail to schedule on a node if one of the following happens:
- Distributed Cloud connected cannot allocate the requested resources, such as CPU time, memory, or disk space, to the virtual machine.
- There is a fault in the virtual machine's configuration.
- There is a fault with the virtual machine's storage.
- The target node is tainted.
To resolve this issue, do the following:
Obtain cluster credentials as described in Obtain credentials for a cluster.
Get information about the affected virtual machine:
kubectl describe virtualmachine VM_NAME -n NAMESPACE
Replace the following:
VM_NAME: The name of the target virtual machine.NAMESPACE: The namespace of the target virtual machine.
The command returns output similar to the following:
Status: ... State: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 15m virtualmachine-controller Created virtual machine my-stuck-vm Warning DiskProvisioningFailed 14m virtualmachine-controller Failed to provision disk: DataVolume my-stuck-vm-data-disk not ready Warning PVCNotBound 14m virtualmachine-controller PersistentVolumeClaim my-stuck-vm-data-disk is in phase Pending Warning VMINotCreated 10m virtualmachine-controller VirtualMachineInstance cannot be created: dependencies not readyThe output of the command contains messages that might indicate resource constraints, scheduling failures, storage faults, and other problems.
Examine the output to determine the causes of the scheduling failure as explained in the next sections.
Insufficient resources
You might see a message indicating insufficient resources, such as CPU, memory, or disk space. For example:
5/8 nodes are available: 3 Insufficient memory, 3 Insufficient CPU.
To remedy this problem, check the resources allocated to the affected virtual machines and other workloads scheduled on the node, then do the following depending on your business needs:
- Scale down other workloads scheduled on the node,
- Reduce the amount of resources allocated to the affected virtual machine,
- Add more machines to the affected cluster.
Tainted nodes
You might see a message indicating the target node is tainted. For example:
5/8 nodes are available: 3 node(s) had taint {<taint-key>:<taint-value>}, that the pod didn't tolerate.
To remedy this problem, do the following:
Use the following command to check for taints on the node:
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
The command returns output similar to the following:
NAME TAINTS node-name-1 [map[effect:PreferNoSchedule key:node-role.kubernetes.io/master] map[effect:PreferNoSchedule key:node-role.kubernetes.io/control-plane]] node-name-2 <none>Do one of the following:
- For unexpected taints, remove them as described in Taints and Tolerations.
- For expected taints, add corresponding tolerations to the virtual machine's configuration, as described in Taints and Tolerations.
Storage faults
You might see a message indicating a fault with the virtual machine's storage. For example:
5/8 nodes are available: 3 node(s) had volume node affinity conflict, 3 node(s) had unbound immediate PersistentVolumeClaims.
This message might indicate that the corresponding persistent volume fails to mount on the target node.
To remedy this problem, do the following:
Use the following command to obtain the status of the persistent volume claims (PVCs) in the affected virtual machine's namespace:
kubectl get pvc -n NAMESPACE
Replace
NAMESPACEwith the name of the target namespace.The command returns output similar to the following:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE windows-robin-disk-0 Bound pvc-b1a1d264-84bf-4e58-857d-f37f629d5082 25Gi RWX robin-block-immediate 30h windows-robin-disk-1 Bound pvc-0130b9a8-7fed-4df0-8226-d79273792a16 25Gi RWX robin-block-immediate 30h windows-robin-vm-0-restored-windows-robin-disk-0 Pending gce-pd-gkebackup-in 26mVerify that the corresponding PVC has a
Boundstatus; if the status isPending, the storage subsystem has failed to provision the volume. In such cases, you must troubleshoot the storage subsystem configuration and ensure that the appropriateStorageClassis available.