Resolving traffic management issues in Cloud Service Mesh
This section explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.
API server connection errors in istiod logs
Istiod cannot contact the apiserver if you see errors similar to the following:
error k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to watch *crd.IstioSomeCustomResource`…dial tcp 10.43.240.1:443: connect: connection refused
You can use the regular expression string /error.*cannot list resource/ to
find this error in the logs.
This error is usually transient and if you reached the proxy logs using
kubectl, the issue might be resolved already. This error is usually caused
by events that make the API server temporarily unavailable, such as when an API
server that is not in a high availability configuration reboots for an upgrade
or autoscaling change.
The istio-init container crashes
This problem can occur when the pod iptables rules are not applied to the pod network namespace. This can be caused by:
- An incomplete istio-cni installation
- Insufficient workload pod permissions (missing
CAP_NET_ADMINpermission)
If you use the Istio CNI plugin, verify that you followed the instructions
completely. Verify that the istio-cni-node container is ready, and check the
logs. If the problem persists, establish a secure shell (SSH) into the host node
and search the node logs for nsenter commands, and see if there are any errors
present.
If you don't use the Istio CNI plugin, verify that the workload pod has
CAP_NET_ADMIN permission, which is automatically set by the sidecar injector.
Connection refused after pod starts
When a Pod starts and gets connection refused trying to connect to an
endpoint, the problem might be that the application container started before
the isto-proxy container. In this case, the application container sends the
request to istio-proxy, but the connection is refused because istio-proxy
isn't listening on the port yet.
In this case, you can:
Modify your application's startup code to make continuous requests to the
istio-proxyhealth endpoint until the application receives a 200 code. Theistio-proxyhealth endpoint is:http://localhost:15020/healthz/readyAdd a retry request mechanism to your application workload.
Listing gateways returns empty
Symptom: When you list Gateways using kubectl get gateway --all-namespaces
after successfully creating a Cloud Service Mesh Gateway, the command returns
No resources found.
This problem can happen on GKE 1.20 and later because the GKE Gateway controller
automatically installs the GKE Gateway.networking.x-k8s.io/v1alpha1 resource
in clusters. To workaround the issue:
Check if there are multiple gateway custom resources in the cluster:
kubectl api-resources | grep gatewayExample output:
gateways gw networking.istio.io/v1beta1 true Gateway gatewayclasses gc networking.x-k8s.io/v1alpha1 false GatewayClass gateways gtw networking.x-k8s.io/v1alpha1 true Gateway
If the list shows entries other than Gateways with the
apiVersionnetworking.istio.io/v1beta1, use the full resource name or the distinguishable short names in thekubectlcommand. For example, runkubectl get gworkubectl get gateways.networking.istio.ioinstead ofkubectl get gatewayto make sure istio Gateways are listed.
For more information on this issue, see Kubernetes Gateways and Istio Gateways.
Envoy proxy hanging on initialization
If debug logs indicate that the Envoy proxy is hanging during initialization, you can use the following command to identify what is blocking the process:
kubectl -n <ns> -c istio-proxy exec -it POD_NAME -- /usr/local/bin/pilot-agent request POST /init_dump
Troubleshooting 5xx HTTP response errors
You may encounter 5xx HTTP response errors when accessing applications through the Istio Ingress Gateway. Follow these steps to diagnose and resolve the issue.
- Identify the control plane
- Verify potential misconfigurations
- Verify pod discovery
- Analyze access logs
Identify the control plane
Determine the control plane version and configuration because different versions may influence diagnostic processes. To verify the state of the managed control plane, run the following command:
gcloud container fleet mesh describe --project PROJECT_ID
An ACTIVE state indicates the managed control plane is running normally.
Verify potential misconfigurations
Common misconfigurations can lead to routing failures:
- Namespace Mismatch: The
VirtualServicemust be in the same namespace as the backend GKE service. - Gateway Reference Mismatch: The
VirtualServicemust explicitly reference the correctGateway. For example, if theVirtualServicereferencesistio-system/istio-ingressgatewaybut the gateway is in thedefaultnamespace, traffic won't route correctly.
Verify pod discovery
Ensure the application pod is correctly discovered by the mesh:
istioctl proxy-config endpoints POD_NAME
Analyze access logs
Enable access logs to determine if errors originate from the backend application
or the proxy. Key fields include RESPONSE_FLAGS, UPSTREAM_LOCAL_ADDRESS, and
RESPONSE_CODE.
If logs indicate upstream errors, perform a direct curl test to the
GKE service from a pod in the same namespace. If the curl
request returns the same 5xx error, the issue originates from the backend
application itself.
Troubleshooting SSL certificate issues
SSL certificate errors at the Istio Ingress Gateway can be caused by expired certificates, protocol mismatches, or incorrect secret configurations.
Identify SSL errors
Review the logs from the Istio Ingress Gateway pod:
kubectl logs -l app=istio-ingressgateway -n GATEWAY_NAMESPACE
Verify certificates and keys
Ensure the certificate and private key are valid and match using MD5 hashes:
openssl x509 -noout -modulus -in CERTIFICATE.CRT | openssl md5
openssl rsa -noout -modulus -in PRIVATE_KEY | openssl md5
Confirm the certificate is not expired and that the Common Name (CN) or Subject Alternative Name (SAN) matches the domain.
Check TLS protocol and secret configuration
Verify the TLS version and cipher suites in the Gateway CRD. Ensure the
Kubernetes secret containing the certificate and key is in the same namespace as
the Ingress Gateway and that the credentialName matches the secret name.
Troubleshooting intermittent timeouts to external endpoints
Intermittent timeouts may occur when requests are made to external endpoints (like Third party NLB) that resolve to multiple IP addresses.
Potential cause: ServiceEntry resolution
If spec.resolution is set to DNS, Istio uses "strict DNS," which load
balances over all resolved IP addresses. Some NLBs don't support this.
Resolution
To resolve this, set the resolution on the ServiceEntry to DNS_ROUND_ROBIN.
Troubleshooting uneven traffic distribution
If traffic is not distributed evenly across application pods, check the following:
- Pod Health: Ensure all application pods were healthy during the imbalance.
- Load Balancing Algorithm: Review the
DestinationRuleconfiguration for theloadBalancersetting (e.g.,consistentHashorROUND_ROBIN). - Locality Load Balancing: Verify if
localityLbSettingis enabled. Note that this is not supported with theTRAFFIC_DIRECTORcontrol plane.
Troubleshooting service propagation and quota issues
If new services or networking configurations are not being reflected in the mesh, it may be due to resource quotas being reached in the fleet project.
Symptoms
- Networking configurations (such as
VirtualServiceorDestinationRule) are not being pushed to sidecar proxies. - New services appear "invisible" to the mesh despite being correctly defined.
Steps to identify and resolve
- Check resource quotas: Verify if the fleet project has reached its quota
for
BackendServiceresources by checking the Global internal traffic director backend services quota in the fleet project. Cloud Service Mesh typically creates oneBackendServiceper Kubernetes service port. - Review scale limitations: Ensure that your configuration remains within the supported scale limitations for Cloud Service Mesh.
- Increase quotas: If quotas have been reached, request an increase for the affected resource in the fleet project to restore normal service propagation.
Troubleshooting VirtualService evaluation
If a VirtualService is not behaving as expected, remember that routes are
evaluated in the order they are listed. When you have multiple Virtual Services
for the same host, their routes are merged. The routes from "older" Virtual
Services get prioritized, therefore placed before routes from "newer" ones in
the merged list. This reinforces the "first match" rule.