1858249 – Metrics produce high unbound cardinality

Bug 1858249 - Metrics produce high unbound cardinality

Summary: Metrics produce high unbound cardinality

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Lukas Vlcek
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1859991
TreeView+	depends on / blocked

Reported:	2020-07-17 11:15 UTC by Rick Rackow
Modified:	2020-10-27 16:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: High cardinality index level metrics from Elasticsearch were causing issues with Prometheus. Consequence: The retention time had to be reduced in Prometheus to keep it functional. Fix: Turning off the index level metrics for now. Result: High cardinality metric relevant to Elasticsearch at the index level are not stored into Prometheus.
Clone Of:
Clones:	1859991 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:15:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
prometheus graph (234.00 KB, image/png) 2020-07-17 11:15 UTC, Rick Rackow	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 427	0	None	closed	Bug 1858249: Disable indices level metrics in Prometheus exporter output	2021-02-12 16:35:11 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:16:01 UTC

Description Rick Rackow 2020-07-17 11:15:33 UTC

Created attachment 1701539 [details]
prometheus graph

Description of problem:
The cluster logging stack is producing twice as many metrics as the kubelet on OSD clusters, which is not normal and causes the prometheus volume to fill up.

Version-Release number of selected component (if applicable):
4.3.25

How reproducible:
reliably

Steps to Reproduce:
1.install cluster logging
2.use prometheus to check top 10 metrics (topk(10,count by (job)({__name__=~".+"})))
3.


Additional info:
i attached said query result from one customer cluster

This is high severity as it's breaking the cluster monitoring stack on customer clusters

Comment 1 Lili Cosic 2020-07-17 11:19:57 UTC

Just my two cents: This to mee looks like high unbound cardinality metrics and if thats the case from monitoring team point of view this is considered high severity Bugzilla as well. If this was not fixed in later versions we should fix in all of them as well.

Comment 3 Lukas Vlcek 2020-07-17 19:13:35 UTC

To fix this we can disable metrics at the index level. Right now these metrics are turned on by default and it should be turned off. If any customer needs these metrics they can be provided a workaround how to turn these metrics on.
With 4.5 we switched to different data model in logging that results in a lot less indices, hence even Prometheus metrics will be less cardinal and we can re-evaluate for 4.5 and later how much index level metrics we can expose without blowing out Prometheus storage.

Comment 8 Anping Li 2020-07-24 14:41:45 UTC

Verified using the CI images.

Comment 9 Anping Li 2020-07-24 15:02:54 UTC

Please ignore Comment 8. Move back to ON_QA.

Comment 10 Anping Li 2020-07-27 11:05:08 UTC

Verified using the latest code. No index level metrics are created.

Comment 11 Lukas Vlcek 2020-07-27 14:58:47 UTC

Anli, I believe we need to verify blocking BZs first.

Maybe I cloned the BZs in a wrong order but there is a series of BZs to bring this change all the way down from 4.6 (master) to 4.3.

BZs:
1) Master: https://bugzilla.redhat.com/show_bug.cgi?id=1858249 (Verified)
2) 4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1859991    (New)   
3) 4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1860156    (New)
4) 4.3: https://bugzilla.redhat.com/show_bug.cgi?id=1860164    (New)

The chain can be seen here:
https://bugzilla.redhat.com/buglist.cgi?bug_id=1858249&bug_id_type=andblocked&format=tvp&list_id=11246685&tvp_dir=blocked

Relevant PRs:
1) https://github.com/openshift/elasticsearch-operator/pull/427 (Merged)
2) https://github.com/openshift/elasticsearch-operator/pull/431 (Open)
3) https://github.com/openshift/elasticsearch-operator/pull/432 (Open)
4) https://github.com/openshift/elasticsearch-operator/pull/433 (Open)

Unfortunately, when I originally created the first PR (https://github.com/openshift/elasticsearch-operator/pull/427) I left incorrect BZs number in the title which I fixed later so I am not sure what is actually holding whole chain back from merging now.

Do you think you can help with BZs: 2), 3) and 4) ? (see earlier for BZ numbers).

Comment 13 errata-xmlrpc 2020-10-27 16:15:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.