Bug 1858249

Summary:

Metrics produce high unbound cardinality

Product:

OpenShift Container Platform

Reporter:

Rick Rackow <rrackow>

Component:

Logging

Assignee:

Lukas Vlcek <lvlcek>

Status:

CLOSED ERRATA

QA Contact:

Anping Li <anli>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

4.3.z

CC:

aos-bugs, cshereme, lcosic, lvlcek

Target Milestone:

---

Keywords:

ServiceDeliveryBlocker

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: High cardinality index level metrics from Elasticsearch were causing issues with Prometheus. Consequence: The retention time had to be reduced in Prometheus to keep it functional. Fix: Turning off the index level metrics for now. Result: High cardinality metric relevant to Elasticsearch at the index level are not stored into Prometheus.

Story Points:

---

Clone Of:

Clones:

1859991 (view as bug list)

Environment:

Last Closed:

2020-10-27 16:15:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1859991

Attachments:

Description	Flags
prometheus graph	none

Description Rick Rackow 2020-07-17 11:15:33 UTC

Created attachment 1701539 [details]
prometheus graph

Description of problem:
The cluster logging stack is producing twice as many metrics as the kubelet on OSD clusters, which is not normal and causes the prometheus volume to fill up.

Version-Release number of selected component (if applicable):
4.3.25

How reproducible:
reliably

Steps to Reproduce:
1.install cluster logging
2.use prometheus to check top 10 metrics (topk(10,count by (job)({__name__=~".+"})))
3.


Additional info:
i attached said query result from one customer cluster

This is high severity as it's breaking the cluster monitoring stack on customer clusters

Comment 1 Lili Cosic 2020-07-17 11:19:57 UTC

Just my two cents: This to mee looks like high unbound cardinality metrics and if thats the case from monitoring team point of view this is considered high severity Bugzilla as well. If this was not fixed in later versions we should fix in all of them as well.

Comment 3 Lukas Vlcek 2020-07-17 19:13:35 UTC

To fix this we can disable metrics at the index level. Right now these metrics are turned on by default and it should be turned off. If any customer needs these metrics they can be provided a workaround how to turn these metrics on.
With 4.5 we switched to different data model in logging that results in a lot less indices, hence even Prometheus metrics will be less cardinal and we can re-evaluate for 4.5 and later how much index level metrics we can expose without blowing out Prometheus storage.

Comment 8 Anping Li 2020-07-24 14:41:45 UTC

Verified using the CI images.

Comment 9 Anping Li 2020-07-24 15:02:54 UTC

Please ignore Comment 8. Move back to ON_QA.

Comment 10 Anping Li 2020-07-27 11:05:08 UTC

Verified using the latest code. No index level metrics are created.

Comment 11 Lukas Vlcek 2020-07-27 14:58:47 UTC

Anli, I believe we need to verify blocking BZs first.

Maybe I cloned the BZs in a wrong order but there is a series of BZs to bring this change all the way down from 4.6 (master) to 4.3.

BZs:
1) Master: https://bugzilla.redhat.com/show_bug.cgi?id=1858249 (Verified)
2) 4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1859991    (New)   
3) 4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1860156    (New)
4) 4.3: https://bugzilla.redhat.com/show_bug.cgi?id=1860164    (New)

The chain can be seen here:
https://bugzilla.redhat.com/buglist.cgi?bug_id=1858249&bug_id_type=andblocked&format=tvp&list_id=11246685&tvp_dir=blocked

Relevant PRs:
1) https://github.com/openshift/elasticsearch-operator/pull/427 (Merged)
2) https://github.com/openshift/elasticsearch-operator/pull/431 (Open)
3) https://github.com/openshift/elasticsearch-operator/pull/432 (Open)
4) https://github.com/openshift/elasticsearch-operator/pull/433 (Open)

Unfortunately, when I originally created the first PR (https://github.com/openshift/elasticsearch-operator/pull/427) I left incorrect BZs number in the title which I fixed later so I am not sure what is actually holding whole chain back from merging now.

Do you think you can help with BZs: 2), 3) and 4) ? (see earlier for BZ numbers).

Comment 13 errata-xmlrpc 2020-10-27 16:15:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196