You're viewing Apigee and Apigee hybrid documentation.
There is no equivalent
Apigee Edge documentation for this topic.
Symptom
When starting up, the Metrics pods remain in the CrashLoopBackoff state. This may cause periodic gaps in your metrics/graphs as the pods restart. This may also cause discrepancies with Analytics data as some sections of data are missing.
This issue can occur if your hybrid installation produces a large amount of metrics data. A large amount of data can be produced due to a high traffic load (resulting in a large number of underlying resources, for example MPs emitting metrics), or a large number of Apigee resources (for example, proxies, targets, envs, policies etc) being monitored.
Error messages
When you use kubectl to view the pod states, you see that one or more Metric pods are in the CrashLoopBackoff state. For example:
kubectl get pods -n NAMESPACE
NAME READY STATUS RESTARTS AGE
apigee-metrics-default-telemetry-proxy-1a2b3c4 0/1 CrashLoopBackoff 10 10m
apigee-metrics-adapter-apigee-telemetry-a2b3c4d 0/1 CrashLoopBackoff 10 10m
...
Possible causes
Cause | Description | Troubleshooting instructions applicable for |
---|---|---|
Metrics pods are out of memory | Telemetry pods are in CrashLoopBackoff because of insufficient memory | Apigee hybrid |
Cause 1
Metric Pods are Out of Memory (OOM) with error reason OOMKilled
.
Diagnosis
Check that the issue is occurring by inspecting the pod logs:
- List the pods to get the ID of the Metrics pod that is failing:
kubectl get pods -n APIGEE_NAMESPACE -l "app in (app, proxy, collector)"
- Check the failing pod's log:
kubectl -n APIGEE_NAMESPACE describe pods POD_NAME
For Example:
kubectl describe -n apigee pods apigee-metrics-default-telemetry-proxy-1a2b3c4
Investigate the apigee-prometheus-agg section of the output. Output like the following indicates that the container is repeatedly hitting an OOM condition:
Containers: apigee-prometheus-agg: Container ID: docker://cd893dbb06c2672c41a7d6f3f7d0de4d76742e68cef70d4250bf2d5cdfcdeae6 Image: us.gcr.io/apigee-saas-staging-repo/thirdparty/apigee-prom-prometheus/master:v2.9.2 Image ID: docker-pullable://us.gcr.io/apigee-saas-staging-repo/thirdparty/apigee-prom-prometheus/master@sha256:05350e0d1a577674442046961abf56b3e883dcd82346962f9e73f00667958f6b Port: 19090/TCP Host Port: 0/TCP Args: --config.file=/etc/prometheus/agg/prometheus.yml --storage.tsdb.path=/prometheus/agg/ --storage.tsdb.retention=48h --web.enable-admin-api --web.listen-address=127.0.0.1:19090 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 21 Oct 2020 16:53:42 +0000 Finished: Wed, 21 Oct 2020 16:54:28 +0000 Ready: False Restart Count: 1446 Limits: cpu: 500m memory: 512Mi Requests: cpu: 100m memory: 256Mi
Resolution
- Check the current container limits with the following command
kubectl -n APIGEE_NAMESPACE describe pods POD_NAME
-
Configure the metrics pod limits in your
overrides.yaml
file with the following properties:metrics: aggregator: # The apigee-prometheus-agg container in the "proxy" pod resources: limits: memory: 32Gi # default: 3Gi app: # The apigee-prometheus-app container in the "app" pod resources: limits: memory: 16Gi # default: 1Gi
- Apply the changes with the
helm upgrade
:helm upgrade telemetry apigee-telemetry/ \ --install \ --namespace APIGEE_NAMESPACE \ -f OVERRIDES_FILE
If the OOM error is still encountered after increasing the limit, you can upsize the underlying nodes to allow for more memory usage.
Related documents
Manage runtime plane components
Must gather diagnostic information
Gather the following diagnostic information, and then contact Apigee Support:
Gather Data from Prometheus Container for Troubleshooting
Start port-forwarding for the prometheus container. Repeat for both app and proxy pods. For example:
kubectl port-forward -n apigee apigee-metrics-apigee-telemetry-app-1a2-b3c4-d5ef 8081:9090
Use the script below within their cluster to collect data:
#!/bin/bash set -e # check if jq is installed jq --version &> /dev/null if [ $? -ne 0 ]; then echo "jq not installed" exit 1 fi # check if curl is installed curl --version &> /dev/null if [ $? -ne 0 ] ; then echo "curl not installed" exit 1 fi # Simple check for missing arguments if [[ $# -eq 0 ]] ; then echo 'No arguments provided' exit 1 fi # Simple check for missing arguments if [[ $# -ne 3 ]]; then echo 'Illegal number of arguments' exit 1 fi FORWARDED_PORT=${1} DEST_DIR=${2} CASE_NUMBER=${3} DIR_FULL_PATH=${DEST_DIR}/${CASE_NUMBER}_$(date +%Y_%m_%d_%H_%M_%S) CURRENT_DATE=$(date +%Y-%m-%d) # we set the default start date for query at 10 days before current date START_DATE=$(date +%Y-%m-%d -d "10 days ago") mkdir -pv ${DIR_FULL_PATH} set -x curl -s '127.0.0.1:'${FORWARDED_PORT}'/status' | tee ${DIR_FULL_PATH}/prometheus_status_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).txt curl -s '127.0.0.1:'${FORWARDED_PORT}'/config' | tee ${DIR_FULL_PATH}/prometheus_config_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).txt curl -s '127.0.0.1:'${FORWARDED_PORT}'/api/v1/targets' | tee ${DIR_FULL_PATH}/prometheus_targets_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).json curl -s '127.0.0.1:'${FORWARDED_PORT}'/api/v1/status/config' | jq . | tee ${DIR_FULL_PATH}/prometheus_status_config_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).json curl -s '127.0.0.1:'${FORWARDED_PORT}'/debug/pprof/heap' --output ${DIR_FULL_PATH}/prometheus_heap_$(date +%Y.%m.%d_%H.%M.%S).hprof curl -s '127.0.0.1:'${FORWARDED_PORT}'/debug/pprof/heap?debug=1' | tee ${DIR_FULL_PATH}/prometheus_heap_$(date +%Y.%m.%d_%H.%M.%S).txt curl -s '127.0.0.1:'${FORWARDED_PORT}'/debug/pprof/goroutine' --output ${DIR_FULL_PATH}/prometheus_goroutine_$(date +%Y.%m.%d_%H.%M.%S) curl -s '127.0.0.1:'${FORWARDED_PORT}'/debug/pprof/goroutine?debug=1' | tee ${DIR_FULL_PATH}/prometheus_goroutine_$(date +%Y.%m.%d_%H.%M.%S).txt curl -s '127.0.0.1:'${FORWARDED_PORT}'/debug/pprof/profile?seconds=10' --output ${DIR_FULL_PATH}/prometheus_profile_10_seconds_$(date +%Y.%m.%d_%H.%M.%S) curl -s '127.0.0.1:'${FORWARDED_PORT}'/api/v1/query?query=topk(30%2C%20count%20by%20(__name__)(%7B__name__%3D~%22.%2B%22%7D))&timeout=5s&start='${START_DATE}'T00:00:00.000Z&end='${CURRENT_DATE}'T23:59:59.00Z&step=15s' | jq . | tee ${DIR_FULL_PATH}/prometheus_topk_count_by_name_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).json curl -s '127.0.0.1:'${FORWARDED_PORT}'/api/v1/query?query=topk(30%2C%20count%20by%20(__name__%2C%20job)(%7B__name__%3D~%22.%2B%22%7D))&timeout=5s&start='${START_DATE}'T00:00:00.000Z&end='${CURRENT_DATE}'T23:59:59.00Z&step=15s' | jq . | tee ${DIR_FULL_PATH}/prometheus_topk_group_by_job_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).json curl -s '127.0.0.1:'${FORWARDED_PORT}'/api/v1/query?query=topk(30%2C%20count%20by%20(job)(%7B__name__%3D~%22.%2B%22%7D))&timeout=5s&start='${START_DATE}'T00:00:00.000Z&end='${CURRENT_DATE}'T23:59:59.00Z&step=15s' | jq . | tee ${DIR_FULL_PATH}/prometheus_topk_job_most_timeseries_$(hostname)-$(date +%Y.%m.%d_%H.%M.%S).json set +x ls -latrh ${DIR_FULL_PATH} tar -cvzf ${DIR_FULL_PATH}.tar.gz ${DIR_FULL_PATH}/ exit 0
Arguments:
The script takes three positional arguments.
- Port number: Set to the port on which you forwarded from (e.g. 8081).
- Directory: Base directory for output files.
- Case number: Case number, used for the generated file name.
Example Usage
./prometheus_gather.sh 8081 . 1510679