Bug 1858249 - Metrics produce high unbound cardinality
Summary: Metrics produce high unbound cardinality
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Lukas Vlcek
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1859991
TreeView+ depends on / blocked
 
Reported: 2020-07-17 11:15 UTC by Rick Rackow
Modified: 2020-10-27 16:16 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: High cardinality index level metrics from Elasticsearch were causing issues with Prometheus. Consequence: The retention time had to be reduced in Prometheus to keep it functional. Fix: Turning off the index level metrics for now. Result: High cardinality metric relevant to Elasticsearch at the index level are not stored into Prometheus.
Clone Of:
: 1859991 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:15:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prometheus graph (234.00 KB, image/png)
2020-07-17 11:15 UTC, Rick Rackow
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 427 0 None closed Bug 1858249: Disable indices level metrics in Prometheus exporter output 2021-02-12 16:35:11 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:16:01 UTC

Description Rick Rackow 2020-07-17 11:15:33 UTC
Created attachment 1701539 [details]
prometheus graph

Description of problem:
The cluster logging stack is producing twice as many metrics as the kubelet on OSD clusters, which is not normal and causes the prometheus volume to fill up.

Version-Release number of selected component (if applicable):
4.3.25

How reproducible:
reliably

Steps to Reproduce:
1.install cluster logging
2.use prometheus to check top 10 metrics (topk(10,count by (job)({__name__=~".+"})))
3.


Additional info:
i attached said query result from one customer cluster

This is high severity as it's breaking the cluster monitoring stack on customer clusters

Comment 1 Lili Cosic 2020-07-17 11:19:57 UTC
Just my two cents: This to mee looks like high unbound cardinality metrics and if thats the case from monitoring team point of view this is considered high severity Bugzilla as well. If this was not fixed in later versions we should fix in all of them as well.

Comment 3 Lukas Vlcek 2020-07-17 19:13:35 UTC
To fix this we can disable metrics at the index level. Right now these metrics are turned on by default and it should be turned off. If any customer needs these metrics they can be provided a workaround how to turn these metrics on.
With 4.5 we switched to different data model in logging that results in a lot less indices, hence even Prometheus metrics will be less cardinal and we can re-evaluate for 4.5 and later how much index level metrics we can expose without blowing out Prometheus storage.

Comment 8 Anping Li 2020-07-24 14:41:45 UTC
Verified using the CI images.

Comment 9 Anping Li 2020-07-24 15:02:54 UTC
Please ignore Comment 8. Move back to ON_QA.

Comment 10 Anping Li 2020-07-27 11:05:08 UTC
Verified using the latest code. No index level metrics are created.

Comment 11 Lukas Vlcek 2020-07-27 14:58:47 UTC
Anli, I believe we need to verify blocking BZs first.

Maybe I cloned the BZs in a wrong order but there is a series of BZs to bring this change all the way down from 4.6 (master) to 4.3.

BZs:
1) Master: https://bugzilla.redhat.com/show_bug.cgi?id=1858249 (Verified)
2) 4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1859991    (New)   
3) 4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1860156    (New)
4) 4.3: https://bugzilla.redhat.com/show_bug.cgi?id=1860164    (New)

The chain can be seen here:
https://bugzilla.redhat.com/buglist.cgi?bug_id=1858249&bug_id_type=andblocked&format=tvp&list_id=11246685&tvp_dir=blocked

Relevant PRs:
1) https://github.com/openshift/elasticsearch-operator/pull/427 (Merged)
2) https://github.com/openshift/elasticsearch-operator/pull/431 (Open)
3) https://github.com/openshift/elasticsearch-operator/pull/432 (Open)
4) https://github.com/openshift/elasticsearch-operator/pull/433 (Open)

Unfortunately, when I originally created the first PR (https://github.com/openshift/elasticsearch-operator/pull/427) I left incorrect BZs number in the title which I fixed later so I am not sure what is actually holding whole chain back from merging now.

Do you think you can help with BZs: 2), 3) and 4) ? (see earlier for BZ numbers).

Comment 13 errata-xmlrpc 2020-10-27 16:15:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.