1955445 – Drop crio image metrics with high cardinality

Bug 1955445 - Drop crio image metrics with high cardinality

Summary: Drop crio image metrics with high cardinality

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1951052 1955449
TreeView+	depends on / blocked

Reported:	2021-04-30 07:19 UTC by Simon Pasquier
Modified:	2021-07-27 23:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1955449 (view as bug list)
Environment:
Last Closed:	2021-07-27 23:05:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1133	None	open	Bug 1955445: fix dropped crio metrics	2021-04-30 07:22:11 UTC
Github	openshift cluster-monitoring-operator pull 1148	None	open	Bug 1955445: drop more CRI-O metrics	2021-05-06 09:00:53 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:06:02 UTC

Description Simon Pasquier 2021-04-30 07:19:39 UTC

Description of problem:

The CRI-O service exposes metrics about container images which can have very high cardinality for clusters running many different images (e.g. CI clusters).

These metrics aren't used actually anywhere (neither rules nor dashboards) and storing them in Prometheus increases memory usage by a lot (on some clusters, they account for more than 50% of the total series).

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always

Steps to Reproduce:

1. Open the Prometheus UI, go to the Status > TSDB status page and look at the "Top 10 series count by metric names" section.

Actual results:

The following metrics (at least part of them) are listed:
- container_runtime_crio_image_pulls_by_digest
- container_runtime_crio_image_layer_reuse
- container_runtime_crio_image_pulls_by_name
- container_runtime_crio_image_pulls_successes

Expected results:
These metrics aren't present.

Additional info:

The following query returns the number of series per CRI-O metric:
sort_desc(count by(__name__) ({job="crio"}))

It can be used to verify that the container_runtime_crio_image_* metrics aren't present anymore.

Comment 2 Junqi Zhao 2021-05-06 03:50:45 UTC

checked with 4.8.0-0.nightly-2021-05-05-030749, still have container_runtime_crio_image_pulls_by_name_skipped metrics now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image
    "container_runtime_crio_image_pulls_by_name_skipped",

#  oc -n openshift-monitoring get ServiceMonitor kubelet -oyaml | grep "container_runtime_crio_image_pulls_by_digest" -C3
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: container_runtime_crio_image_pulls_by_digest|container_runtime_crio_image_layer_reuse|container_runtime_crio_image_pulls_by_name|container_runtime_crio_image_pulls_successes
      sourceLabels:
      - __name__
    port: https-metrics


# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=container_runtime_crio_image_pulls_by_name_skipped' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.173.144:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/certified-operator-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-173-144.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "64270768884"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.173.144:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/community-operator-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-173-144.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "54917569163"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.173.144:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/redhat-marketplace-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-173-144.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "57770318501"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.173.144:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/redhat-operator-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-173-144.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "75398626548"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.216.91:9537",
          "job": "crio",
          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1c2098ca46c151a2e4403d3dfc144fe7ed73d9ded76d1061d646296ae8bf8287",
          "namespace": "kube-system",
          "node": "ip-10-0-216-91.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "249164649"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.216.91:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/certified-operator-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-216-91.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "2063096475"
        ]
      },
      {
        "metric": {
          "__name__": "container_runtime_crio_image_pulls_by_name_skipped",
          "endpoint": "crio",
          "instance": "10.0.216.91:9537",
          "job": "crio",
          "name": "registry.redhat.io/redhat/community-operator-index:v4.8",
          "namespace": "kube-system",
          "node": "ip-10-0-216-91.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1620272999.899,
          "1565126265"
        ]
      }
    ]
  }
}

Comment 4 Junqi Zhao 2021-05-10 02:18:08 UTC

tested with 4.8.0-0.nightly-2021-05-09-105430, no container_runtime_crio_image_* metrics now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image
no result

Comment 7 errata-xmlrpc 2021-07-27 23:05:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.