Troubleshoot Distributed Cloud connected

Google remotely monitors and maintains the Google Distributed Cloud connected hardware. For this purpose, Google engineers have Secure Shell (SSH) access to the Distributed Cloud connected hardware. If Google detects an issue, a Google engineer contacts you to troubleshoot and resolve it. If you have identified an issue yourself, contact Google Support immediately to diagnose and resolve it.

Corrupt BGP sessions in Cloud Router resources used by VPN connections

Distributed Cloud VPN connections rely on BGP sessions established and managed by their corresponding Cloud Router resources to advertise routes between the Distributed Cloud connected cluster and Google Cloud. If you modify the configuration of a Cloud Router resource associated with a Distributed Cloud VPN connection, that connection can stop functioning.

To recover the corrupt BGP session configuration in the affected Cloud Router, complete the following steps:

  1. In the Google Cloud console, get the name of the corrupt BGP session. For example:

    INTERFACE=anthos-mcc-34987234
    
  2. Get the peer BGP and the Cloud Router BGP IP addresses for the corrupted BGP session, as well as the peer ASN used by the affected Distributed Cloud VPN connection. For example:

    GDCE_BGP_IP=168.254.208.74
    CLOUD_ROUTER_BGP_IP=168.254.208.73
    PEER_ASN=65506
    

    If you deleted the BGP session, get this information from the Distributed Cloud connected cluster instead:

    1. Get the cluster credentials:

      gcloud edge-cloud container clusters get-credentials CLUSTER_ID \
        --location REGION \
        --project PROJECT_ID
      

      Replace the following:

      • CLUSTER_ID: the name of the target cluster.
      • REGION: the Google Cloud region in which the target cluster is created.
      • PROJECT_ID: the ID of the target Google Cloud project.
    2. Get the configuration of the MultiClusterConnectivityConfig resource:

      kubectl get multiclusterconnectivityconfig -A
      

      The command returns output similar to the following:

       NAMESPACE     NAME                   LOCAL ASN              PEER ASN
       kube-system   MultiClusterConfig1    65505                   65506
       ```
      
    3. Get the peer BGP IP address, the Cloud Router IP address, and the BGP session ASN:

      kubectl describe multiclusterconnectivityconfig -n kube-system MCC_CONFIG_NAME   
      

      Replace MCC_CONFIG_NAME with the name of the MultiClusterConfigResource that you obtained in the previous step.

      The command returns output similar to the following:

       ​​Spec:
       Asns:
         Peer:  65505
         Self:  65506 # GDCE ASN
       Tunnels:
         Ike Key:
           Name:       MCC_CONFIG_NAME-0
           Namespace:  kube-system
         Peer:
           Bgp IP:      169.254.208.73 # Cloud Router BGP IP
           Private IP:  34.157.98.148
           Public IP:   34.157.98.148
         Self:
           Bgp IP:      169.254.208.74 # GDCE BGP IP
           Private IP:  10.100.29.49
           Public IP:   208.117.254.68
       ```
      
  3. In the Google Cloud console, get the name, region, and Google Cloud project name for the corrupted VPN tunnel. For example:

    VPN_TUNNEL=VPNTunnel1
    REGION=US-East1
    VPC_PROJECT_ID=VPC-Project-1
    
  4. Delete the corrupted BGP session from the Cloud Router configuration.

  5. Create a new Cloud Router interface:

    gcloud compute routers add-interface --interface-name=INTERFACE_NAME \
       --vpn-tunnel=TUNNEL_NAME \ 
       --ip-address=ROUTER_BGP_IP \
       --project=VPC_PROJECT_ID \
       --region=REGION \      
       --mask-length=30
    

    Replace the following:

    • INTERFACE_NAME: a descriptive name that uniquely identifies this interface.
    • TUNNEL_NAME: the name of the VPN tunnel that you obtained in the previous step.
    • ROUTER_BGP_IP: the BGP IP address of the Cloud Router that you obtained earlier in this procedure.
    • VPC_PROJECT_ID: the ID of the target VPC Google Cloud project.
    • REGION: the Google Cloud region in which the target VPC Google Cloud project has been created.
  6. Create the BGP peer:

    gcloud compute routers add-bgp-peer --interface=INTERFACE_NAME \
       --peer-name=TUNNEL_NAME \
       --region REGION \
       --project=VPC_PROJECT_ID \
       --peer-ip-address=GDCE_BGP_IP \
       --peer-asn=GDCE_BGP_ASN \
       --advertised-route-priority=100 \
       --advertisement-mode=DEFAULT
    

    Replace the following:

    • INTERFACE_NAME: the name of the interface that you created in the previous step.
    • TUNNEL_NAME: the name of the VPN tunnel that you used to create the interface in the previous step.
    • REGION: the Google Cloud region in which the target VPC Google Cloud project is created.
    • VPC_PROJECT_ID: the ID of the target VPC Google Cloud project.
    • GDCE_BGP_IP: the Distributed Cloud peer BGP IP address that you obtained earlier in this procedure.
    • GDCE_BGP_ASN: the Distributed Cloud peer BGP ASN that you obtained earlier in this procedure.

At this point, the BGP session is back up and operational.

Virtual machines stuck in Pending state

A virtual machine workload can get stuck in the Pending state and fail to schedule on a node if one of the following happens:

  • Distributed Cloud connected cannot allocate the requested resources, such as CPU time, memory, or disk space, to the virtual machine.
  • There is a fault in the virtual machine's configuration.
  • There is a fault with the virtual machine's storage.
  • The target node is tainted.

To resolve this issue, do the following:

  1. Obtain cluster credentials as described in Obtain credentials for a cluster.

  2. Get information about the affected virtual machine:

    kubectl describe virtualmachine VM_NAME -n NAMESPACE
    

    Replace the following:

    • VM_NAME: The name of the target virtual machine.
    • NAMESPACE: The namespace of the target virtual machine.

    The command returns output similar to the following:

    Status:
    ...
    State:                    Pending
    ...
    Events:
    Type     Reason                  Age   From                       Message
    ----     ------                  ----  ----                       -------
    Normal   SuccessfulCreate        15m   virtualmachine-controller  Created virtual machine my-stuck-vm
    Warning  DiskProvisioningFailed  14m   virtualmachine-controller  Failed to provision disk: DataVolume my-stuck-vm-data-disk not ready
    Warning  PVCNotBound             14m   virtualmachine-controller  PersistentVolumeClaim my-stuck-vm-data-disk is in phase Pending
    Warning  VMINotCreated           10m   virtualmachine-controller  VirtualMachineInstance cannot be created: dependencies not ready
    

    The output of the command contains messages that might indicate resource constraints, scheduling failures, storage faults, and other problems.

  3. Examine the output to determine the causes of the scheduling failure as explained in the next sections.

Insufficient resources

You might see a message indicating insufficient resources, such as CPU, memory, or disk space. For example:

5/8 nodes are available: 3 Insufficient memory, 3 Insufficient CPU.

To remedy this problem, check the resources allocated to the affected virtual machines and other workloads scheduled on the node, then do the following depending on your business needs:

  • Scale down other workloads scheduled on the node,
  • Reduce the amount of resources allocated to the affected virtual machine,
  • Add more machines to the affected cluster.

Tainted nodes

You might see a message indicating the target node is tainted. For example:

5/8 nodes are available: 3 node(s) had taint {<taint-key>:<taint-value>}, that the pod didn't tolerate.

To remedy this problem, do the following:

  1. Use the following command to check for taints on the node:

    kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
    

    The command returns output similar to the following:

    NAME                           TAINTS
    node-name-1   [map[effect:PreferNoSchedule key:node-role.kubernetes.io/master] map[effect:PreferNoSchedule key:node-role.kubernetes.io/control-plane]]
    node-name-2   <none>
    
  2. Do one of the following:

Storage faults

You might see a message indicating a fault with the virtual machine's storage. For example:

5/8 nodes are available: 3 node(s) had volume node affinity conflict, 3 node(s) had unbound immediate PersistentVolumeClaims.

This message might indicate that the corresponding persistent volume fails to mount on the target node.

To remedy this problem, do the following:

  1. Use the following command to obtain the status of the persistent volume claims (PVCs) in the affected virtual machine's namespace:

    kubectl get pvc -n NAMESPACE
    

    Replace NAMESPACE with the name of the target namespace.

    The command returns output similar to the following:

    NAME                                               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
    windows-robin-disk-0                               Bound     pvc-b1a1d264-84bf-4e58-857d-f37f629d5082   25Gi       RWX            robin-block-immediate   30h
    windows-robin-disk-1                               Bound     pvc-0130b9a8-7fed-4df0-8226-d79273792a16   25Gi       RWX            robin-block-immediate   30h
    windows-robin-vm-0-restored-windows-robin-disk-0   Pending                                                                        gce-pd-gkebackup-in     26m
    
  2. Verify that the corresponding PVC has a Bound status; if the status is Pending, the storage subsystem has failed to provision the volume. In such cases, you must troubleshoot the storage subsystem configuration and ensure that the appropriate StorageClass is available.