Bug 1955445
| Summary: | Drop crio image metrics with high cardinality | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> | |
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.6 | CC: | alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa | |
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1955449 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 23:05:13 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1951052, 1955449 | |||
checked with 4.8.0-0.nightly-2021-05-05-030749, still have container_runtime_crio_image_pulls_by_name_skipped metrics now # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image "container_runtime_crio_image_pulls_by_name_skipped", # oc -n openshift-monitoring get ServiceMonitor kubelet -oyaml | grep "container_runtime_crio_image_pulls_by_digest" -C3 interval: 30s metricRelabelings: - action: drop regex: container_runtime_crio_image_pulls_by_digest|container_runtime_crio_image_layer_reuse|container_runtime_crio_image_pulls_by_name|container_runtime_crio_image_pulls_successes sourceLabels: - __name__ port: https-metrics # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=container_runtime_crio_image_pulls_by_name_skipped' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.173.144:9537", "job": "crio", "name": "registry.redhat.io/redhat/certified-operator-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-173-144.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "64270768884" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.173.144:9537", "job": "crio", "name": "registry.redhat.io/redhat/community-operator-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-173-144.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "54917569163" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.173.144:9537", "job": "crio", "name": "registry.redhat.io/redhat/redhat-marketplace-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-173-144.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "57770318501" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.173.144:9537", "job": "crio", "name": "registry.redhat.io/redhat/redhat-operator-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-173-144.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "75398626548" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.216.91:9537", "job": "crio", "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c2098ca46c151a2e4403d3dfc144fe7ed73d9ded76d1061d646296ae8bf8287", "namespace": "kube-system", "node": "ip-10-0-216-91.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "249164649" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.216.91:9537", "job": "crio", "name": "registry.redhat.io/redhat/certified-operator-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-216-91.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "2063096475" ] }, { "metric": { "__name__": "container_runtime_crio_image_pulls_by_name_skipped", "endpoint": "crio", "instance": "10.0.216.91:9537", "job": "crio", "name": "registry.redhat.io/redhat/community-operator-index:v4.8", "namespace": "kube-system", "node": "ip-10-0-216-91.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1620272999.899, "1565126265" ] } ] } } tested with 4.8.0-0.nightly-2021-05-09-105430, no container_runtime_crio_image_* metrics now # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image no result Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: The CRI-O service exposes metrics about container images which can have very high cardinality for clusters running many different images (e.g. CI clusters). These metrics aren't used actually anywhere (neither rules nor dashboards) and storing them in Prometheus increases memory usage by a lot (on some clusters, they account for more than 50% of the total series). Version-Release number of selected component (if applicable): 4.6 How reproducible: Always Steps to Reproduce: 1. Open the Prometheus UI, go to the Status > TSDB status page and look at the "Top 10 series count by metric names" section. Actual results: The following metrics (at least part of them) are listed: - container_runtime_crio_image_pulls_by_digest - container_runtime_crio_image_layer_reuse - container_runtime_crio_image_pulls_by_name - container_runtime_crio_image_pulls_successes Expected results: These metrics aren't present. Additional info: The following query returns the number of series per CRI-O metric: sort_desc(count by(__name__) ({job="crio"})) It can be used to verify that the container_runtime_crio_image_* metrics aren't present anymore.