Bug 1955454

Summary: [4.6] Drop crio image metrics with high cardinality
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: alegrand, anpicker, erooth, juzhao, kakkoyun, lcosic, pkrupa
Target Milestone: ---Keywords: EasyFix
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1955449 Environment:
Last Closed: 2021-06-01 12:10:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1955449    
Bug Blocks: 1955452    

Description Simon Pasquier 2021-04-30 07:29:42 UTC
+++ This bug was initially created as a clone of Bug #1955449 +++

+++ This bug was initially created as a clone of Bug #1955445 +++

Description of problem:

The CRI-O service exposes metrics about container images which can have very high cardinality for clusters running many different images (e.g. CI clusters).

These metrics aren't used actually anywhere (neither rules nor dashboards) and storing them in Prometheus increases memory usage by a lot (on some clusters, they account for more than 50% of the total series).

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always

Steps to Reproduce:

1. Open the Prometheus UI, go to the Status > TSDB status page and look at the "Top 10 series count by metric names" section.

Actual results:

The following metrics (at least part of them) are listed:
- container_runtime_crio_image_pulls_by_digest
- container_runtime_crio_image_layer_reuse
- container_runtime_crio_image_pulls_by_name
- container_runtime_crio_image_pulls_successes

Expected results:
These metrics aren't present.

Additional info:

The following query returns the number of series per CRI-O metric:
sort_desc(count by(__name__) ({job="crio"}))

It can be used to verify that the container_runtime_crio_image_* metrics aren't present anymore.

Comment 1 Junqi Zhao 2021-05-19 09:16:58 UTC
reproduced with 4.6.0-0.nightly-2021-05-15-131411
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image
    "container_runtime_crio_image_layer_reuse",
    "container_runtime_crio_image_pulls_by_digest",
    "container_runtime_crio_image_pulls_by_name",
    "container_runtime_crio_image_pulls_failures",
    "container_runtime_crio_image_pulls_successes",

Comment 2 Junqi Zhao 2021-05-19 09:17:26 UTC
tested with the PR not merged, no container_runtime_crio_image_* metrics now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep container_runtime_crio_image
no result

Comment 4 Junqi Zhao 2021-05-25 05:48:53 UTC
fix is in 4.6.0-0.nightly-2021-05-24-230019 and later builds, move to VERIFIED

Comment 7 errata-xmlrpc 2021-06-01 12:10:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2100