This document shows operators how to configure Google Cloud Managed Service for Prometheus to filter and collect specific Kubernetes metrics from Google Kubernetes Engine (GKE) clusters and workloads.
You should already be familiar with the following:
Metrics collection in GKE
System components in GKE clusters emit and expose Prometheus metrics. By default, all clusters ingest system metrics that are specific to GKE. Additional built-in metrics packages are available to ingest curated collections of metrics for various components.
Independently of the built-in GKE metrics packages, you can manually configure Google Cloud Managed Service for Prometheus to ingest specific Prometheus metrics. This custom configuration works well for use cases like the following:
- Monitor for specific signals in your cluster and workloads.
- Optimize costs by disabling metrics packages that you don't need and ingesting only a subset of metrics.
- Collect metrics that are emitted by components but aren't included in a metrics package.
For operators, node
metrics, which
are emitted by the kubelet process on each node, might be useful for
monitoring workloads and nodes for specific signals.
Metric collection custom resources
To filter and ingest a specific metric, you configure Prometheus relabeling rules in one of the following custom resources:
PodMonitoring: configure metrics collection from Pods in a specific namespace.ClusterPodMonitoring: configure metrics collection from Pods in any namespace.ClusterNodeMonitoring: configure metrics collection from node components like thekubeletand cAdvisor.
Metric duplication
If you filter for a metric that's also included in an active metrics package for
a cluster, then the metric appears multiple times in
Google Cloud Managed Service for Prometheus. For example, if your cluster has the
KUBELET metrics package enabled and you configure the collection of the
kubelet_running_containers metric, you'll see duplicate values.
To avoid duplication, verify that the metrics that you're collecting aren't included in one of the active metrics packages. For more information about the included metrics, see the following metrics packages:
- GKE system metrics
- Control plane metrics
- Kubernetes object state metrics
- cAdvisor and
kubeletmetrics - NVIDIA DCGM metrics
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Verify that you have the permissions that are required for this document.
- Use an existing GKE cluster or create a new cluster.
Required roles
To get the permissions that you need to collect and view metrics, ask your administrator to grant you the following IAM roles on your project:
-
Create Kubernetes resources in your cluster:
Kubernetes Engine Developer (
roles/container.developer) -
View metrics in Monitoring:
Monitoring Viewer (
roles/monitoring.viewer)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Enable Google Cloud Managed Service for Prometheus
Google Cloud Managed Service for Prometheus is enabled by default in all GKE clusters and can't be disabled in Autopilot clusters. If you're using an Autopilot cluster, then skip to the Configure metrics collection section.
To verify that Google Cloud Managed Service for Prometheus is enabled in your Standard cluster, select one of the following options:
Console
In the Google Cloud console, go to the Kubernetes clusters page.
In the row for the cluster that you want to check, click Actions > Edit. The Cluster details page opens.
In the Features section, check the value in the Managed Service for Prometheus field. If the value is Enabled, skip to the Configure metrics collection section. If the value is Disabled, enable Google Cloud Managed Service for Prometheus:
- Click Edit Managed Service for Prometheus. The Edit Managed Service for Prometheus dialog opens.
- Select the Enable Managed Service for Prometheus checkbox.
- Click Save changes.
gcloud
Check whether Google Cloud Managed Service for Prometheus is enabled in the cluster:
gcloud container clusters describe CLUSTER_NAME \
--location=CONTROL_PLANE_LOCATION \
--format='value(monitoringConfig.managedPrometheusConfig.enabled)'
Replace the following:
CLUSTER_NAME: the name of your Standard cluster.CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such asus-central1orus-central1-a.
If the output is True, then Google Cloud Managed Service for Prometheus is enabled.
If the output is False, then enable Google Cloud Managed Service for Prometheus:
gcloud container clusters update CLUSTER_NAME \
--location=CONTROL_PLANE_LOCATION \
--enable-managed-prometheus
Configure metrics collection
To configure Google Cloud Managed Service for Prometheus to ingest a specific emitted metric, you need the following information about the metric:
- The metric name, such as
kubelet_working_pods. - The endpoint that exposes the metric, such as
/metrics/cadvisoror/metrics.
To find metrics and configure collection, follow these steps:
Find a metric in the reference documentation for the specific component. For example, you can use the following resources to find a node metric:
Check the active metrics packages for the cluster:
gcloud container clusters describe CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --flatten=monitoringConfig.componentConfig \ --format='value(enableComponents)'The output is similar to the following:
SYSTEM_COMPONENTS;STORAGE;POD;DEPLOYMENT;STATEFULSET;DAEMONSET;HPA;JOBSET;CADVISOR;KUBELET;DCGMCheck whether the metric that you want to collect is in one of these active packages by using the information in Available metrics. If your metric is included in an active metrics package, duplicate metrics might appear in Monitoring. If your metric isn't in an active metrics package, proceed to the next step.
Optional: To verify that the metric is available at the metrics endpoint, query the endpoint by using the
kubectl get --rawcommand. For example, the following command queries the/metricsendpoint for thekubelet_working_podsmetric:kubectl get --raw "/api/v1/nodes/NODE_NAME/proxy/metrics" | grep "kubelet_working_pods"Replace
NODE_NAMEwith the name of the node to query.The output is similar to the following, which indicates that the metric is available:
# HELP kubelet_working_pods [ALPHA] Number of pods the kubelet is actually running, broken down by lifecycle phase, whether the pod is desired, orphaned, or runtime only (also orphaned), and whether the pod is static. An orphaned pod has been removed from local configuration or force deleted in the API and consumes resources that are not otherwise visible. # TYPE kubelet_working_pods gauge kubelet_working_pods{config="desired",lifecycle="sync",static=""} 8 kubelet_working_pods{config="desired",lifecycle="sync",static="true"} 1 kubelet_working_pods{config="desired",lifecycle="terminated",static=""} 0 kubelet_working_pods{config="desired",lifecycle="terminated",static="true"} 0 kubelet_working_pods{config="desired",lifecycle="terminating",static=""} 0 kubelet_working_pods{config="desired",lifecycle="terminating",static="true"} 0After you identify the metric that you want to collect, use a Prometheus relabeling rule in a PodMonitoring, ClusterPodMonitoring, or a ClusterNodeMonitoring custom resource to filter for that metric. The custom resource that you use depends on the properties of the metric, as described in the Metric collection custom resources section.
For example, review the following ClusterNodeMonitoring manifest:
apiVersion: monitoring.googleapis.com/v1 kind: ClusterNodeMonitoring metadata: name: kubelet-pod-counts spec: selector: matchLabels: {} endpoints: - path: "/metrics" scheme: "https" interval: "30s" tls: insecureSkipVerify: true # metricRelabeling specifies the metrics to ingest. The metrics must be # available at the specified endpoint. metricRelabeling: - action: keep sourceLabels: [__name__] regex: kubelet_(active|working)_podsThis manifest has the following properties:
- The
selector.matchLabelsfield is an empty set ({}), which matches all nodes. - The
endpoints.pathfield identifies the metrics endpoint to scrape. - The
metricRelabelingfield identifies the specific metrics to collect. In this example, Google Cloud Managed Service for Prometheus collects metrics that have the namekubelet_active_podsandkubelet_working_podsfrom the endpoint.
- The
When you deploy the custom resource, Google Cloud Managed Service for Prometheus ingests metrics that match your specified configuration. You can view and query these metrics by using the same methods that you use for other metrics, such as Cloud Monitoring.
Verify metrics collection for an example workload
The following steps show you how to configure Google Cloud Managed Service for Prometheus to collect pressure stall information (PSI) metrics from cAdvisor. In these example steps, you create a ClusterNodeMonitoring (because cAdvisor is a node component), create a test Pod that triggers PSI metrics, and then verify that the metrics appear in Monitoring. To collect the PSI metrics in this example, your cluster must run GKE version 1.35 or later.
Create a test Pod that triggers PSI metrics from cAdvisor:
Save the following Pod as
psi-pod.yaml:apiVersion: v1 kind: Pod metadata: name: cpu-pressure-pod spec: restartPolicy: Never containers: - name: cpu-stress image: registry.k8s.io/e2e-test-images/agnhost:2.47 args: - "stress" - "--cpus" - "1" resources: limits: cpu: "500m" requests: cpu: "500m"Create the Pod:
kubectl apply -f psi-pod.yaml
Verify that PSI metrics aren't already in Monitoring:
In the Google Cloud console, go to the Metrics explorer page.
In the Queries section, in the Metric field, try to select the
prometheus/container_pressure_cpu_stalled_seconds_total/countermetric. If you don't see a search result, PSI metrics aren't being collected.
Create a ClusterNodeMonitoring to filter for PSI metrics:
Save the following ClusterNodeMonitoring manifest as
psi-prometheus-collector.yaml:apiVersion: monitoring.googleapis.com/v1 kind: ClusterNodeMonitoring metadata: name: psi-collector spec: selector: matchLabels: {} endpoints: - path: /metrics/cadvisor scheme: https interval: 30s tls: insecureSkipVerify: true metricRelabeling: - action: keep sourceLabels: [__name__] regex: container_pressure_(cpu|memory|io)_(waiting|stalled)_seconds_totalCreate the ClusterNodeMonitoring:
kubectl apply -f psi-prometheus-collector.yaml
The metric might take up to 10 minutes to appear in Monitoring.
In the Google Cloud console, verify that the PSI metrics appear in Monitoring:
Go to the Metrics explorer page.
In the Queries section, in the Metric field, select the
prometheus/container_pressure_cpu_stalled_seconds_total/countermetric.Optional: In the Filter drop-down menu, select your cluster.
The Results section shows a chart with the collected PSI metric.
What's next
- Explore the available metrics packages in GKE
- Configure automatic application monitoring for workloads
- Create and manage custom Monitoring dashboards