Bug 1858249

Summary: Metrics produce high unbound cardinality
Product: OpenShift Container Platform Reporter: Rick Rackow <rrackow>
Component: LoggingAssignee: Lukas Vlcek <lvlcek>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, cshereme, lcosic, lvlcek
Target Milestone: ---Keywords: ServiceDeliveryBlocker
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: High cardinality index level metrics from Elasticsearch were causing issues with Prometheus. Consequence: The retention time had to be reduced in Prometheus to keep it functional. Fix: Turning off the index level metrics for now. Result: High cardinality metric relevant to Elasticsearch at the index level are not stored into Prometheus.
Story Points: ---
Clone Of:
: 1859991 (view as bug list) Environment:
Last Closed: 2020-10-27 16:15:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1859991    
Attachments:
Description Flags
prometheus graph none

Description Rick Rackow 2020-07-17 11:15:33 UTC
Created attachment 1701539 [details]
prometheus graph

Description of problem:
The cluster logging stack is producing twice as many metrics as the kubelet on OSD clusters, which is not normal and causes the prometheus volume to fill up.

Version-Release number of selected component (if applicable):
4.3.25

How reproducible:
reliably

Steps to Reproduce:
1.install cluster logging
2.use prometheus to check top 10 metrics (topk(10,count by (job)({__name__=~".+"})))
3.


Additional info:
i attached said query result from one customer cluster

This is high severity as it's breaking the cluster monitoring stack on customer clusters

Comment 1 Lili Cosic 2020-07-17 11:19:57 UTC
Just my two cents: This to mee looks like high unbound cardinality metrics and if thats the case from monitoring team point of view this is considered high severity Bugzilla as well. If this was not fixed in later versions we should fix in all of them as well.

Comment 3 Lukas Vlcek 2020-07-17 19:13:35 UTC
To fix this we can disable metrics at the index level. Right now these metrics are turned on by default and it should be turned off. If any customer needs these metrics they can be provided a workaround how to turn these metrics on.
With 4.5 we switched to different data model in logging that results in a lot less indices, hence even Prometheus metrics will be less cardinal and we can re-evaluate for 4.5 and later how much index level metrics we can expose without blowing out Prometheus storage.

Comment 8 Anping Li 2020-07-24 14:41:45 UTC
Verified using the CI images.

Comment 9 Anping Li 2020-07-24 15:02:54 UTC
Please ignore Comment 8. Move back to ON_QA.

Comment 10 Anping Li 2020-07-27 11:05:08 UTC
Verified using the latest code. No index level metrics are created.

Comment 11 Lukas Vlcek 2020-07-27 14:58:47 UTC
Anli, I believe we need to verify blocking BZs first.

Maybe I cloned the BZs in a wrong order but there is a series of BZs to bring this change all the way down from 4.6 (master) to 4.3.

BZs:
1) Master: https://bugzilla.redhat.com/show_bug.cgi?id=1858249 (Verified)
2) 4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1859991    (New)   
3) 4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1860156    (New)
4) 4.3: https://bugzilla.redhat.com/show_bug.cgi?id=1860164    (New)

The chain can be seen here:
https://bugzilla.redhat.com/buglist.cgi?bug_id=1858249&bug_id_type=andblocked&format=tvp&list_id=11246685&tvp_dir=blocked

Relevant PRs:
1) https://github.com/openshift/elasticsearch-operator/pull/427 (Merged)
2) https://github.com/openshift/elasticsearch-operator/pull/431 (Open)
3) https://github.com/openshift/elasticsearch-operator/pull/432 (Open)
4) https://github.com/openshift/elasticsearch-operator/pull/433 (Open)

Unfortunately, when I originally created the first PR (https://github.com/openshift/elasticsearch-operator/pull/427) I left incorrect BZs number in the title which I fixed later so I am not sure what is actually holding whole chain back from merging now.

Do you think you can help with BZs: 2), 3) and 4) ? (see earlier for BZ numbers).

Comment 13 errata-xmlrpc 2020-10-27 16:15:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196