In VPC-native Google Kubernetes Engine (GKE) clusters, nodes without external IP addresses enhance security. These nodes use Cloud NAT for outbound internet connections. Packet loss can occur if a Node VM exhausts its allocated Cloud NAT ports and IP addresses, often under high outbound load, disrupting egress traffic.
Use this document to learn how to log and monitor for dropped packets, investigate your Cloud NAT configuration, and implement measures to reduce future packet loss.
This information is important for Platform admins and operators and Network administrators who manage GKE clusters with nodes without internet access and rely on Cloud NAT for external connectivity. For more information about the common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.
Diagnose packet loss
The following sections explains how to log dropped packets using Cloud Logging, and diagnose the cause of dropped packets using Cloud Monitoring.
Log dropped packets
You can log dropped packets with the following query in Cloud Logging:
resource.type="nat_gateway" resource.labels.region=REGION resource.labels.gateway_name=GATEWAY_NAME jsonPayload.allocation_status="DROPPED"
Replace the following:
REGION: the name of the region that the cluster is in.GATEWAY_NAME: the name of the Cloud NAT gateway.
This command returns a list of all packets dropped by a Cloud NAT gateway, but does not identify the cause.
Monitor causes for packet loss
To identify causes for dropped packets, query the Metrics observer in Cloud Monitoring. Packets drop for one of three reasons:
OUT_OF_RESOURCESENDPOINT_INDEPENDENT_CONFLICTNAT_ALLOCATION_FAILED
To identify packets dropped due to OUT_OF_RESOURCES or
ENDPOINT_ALLOCATION_FAILED error codes, use the following query:
fetch nat_gateway
metric 'router.googleapis.com/nat/dropped_sent_packets_count'
filter (resource.gateway_name == GATEWAY_NAME)
align rate(1m)
every 1m
group_by [metric.reason],
[value_dropped_sent_packets_count_aggregate:
aggregate(value.dropped_sent_packets_count)]
If you identify packets that drop because of these reasons, see Packets dropped with reason: out of resources and Packets dropped with reason: endpoint independent conflict for troubleshooting advice.
To identify packets dropped due to the NAT_ALLOCATION_FAILED error code, use
the following query:
fetch nat_gateway
metric 'router.googleapis.com/nat/nat_allocation_failed'
group_by 1m,
[value_nat_allocation_failed_count_true:
count_true(value.nat_allocation_failed)]
every 1m
If you identify packets that dropped for this reason, see Need to allocate more IP addresses.
Investigate Cloud NAT configuration
If the previous queries return empty results, and GKE Pods are unable to communicate to external IP addresses, use the following table to help you troubleshoot your configuration:
| Configuration | Troubleshooting |
| Cloud NAT configured to apply only to the subnet's primary IP address range. |
When Cloud NAT is configured only for the subnet's primary IP
address range, packets sent from the cluster to external IP addresses must
have a source node IP address. In this Cloud NAT configuration:
|
| Cloud NAT configured to apply only to the subnet's secondary IP address range used for Pod IPs. |
When Cloud NAT is configured only for the subnet's secondary IP address range used by the cluster's Pod IPs, packets sent from the cluster to external IP addresses must have a source Pod IP address. In this Cloud NAT configuration:
|
Reduce packet loss
After you have diagnosed the cause of your packet loss, consider using the following recommendations to reduce the likelihood of the issue from recurring in the future:
Configure the Cloud NAT gateway to use dynamic port allocation and increase the maximum number of ports per VM.
If you're using static port allocation, increase the number of minimum ports per VM.
Reduce your application's outbound packet rate. When an application makes multiple outbound connections to the same destination IP address and port, it can quickly consume all connections Cloud NAT can make to that destination using the number of allocated NAT source addresses and source port tuples.
For details about how Cloud NAT uses NAT source addresses and source ports to make connections, including limits on the number of simultaneous connections to a destination, refer to Ports and connections.
To reduce the rate of outbound connections from the application, reuse open connections. Common methods of reusing connections include connection pooling, multiplexing connections using protocols such as HTTP/2, or establishing persistent connections reused for multiple requests. For more information, see Ports and Connections.
What's next
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by
asking questions on StackOverflow
and using the
google-kubernetes-enginetag to search for similar issues. You can also join the#kubernetes-engineSlack channel for more community support. - Opening bugs or feature requests by using the public issue tracker.