Created attachment 1701539 [details] prometheus graph Description of problem: The cluster logging stack is producing twice as many metrics as the kubelet on OSD clusters, which is not normal and causes the prometheus volume to fill up. Version-Release number of selected component (if applicable): 4.3.25 How reproducible: reliably Steps to Reproduce: 1.install cluster logging 2.use prometheus to check top 10 metrics (topk(10,count by (job)({__name__=~".+"}))) 3. Additional info: i attached said query result from one customer cluster This is high severity as it's breaking the cluster monitoring stack on customer clusters
Just my two cents: This to mee looks like high unbound cardinality metrics and if thats the case from monitoring team point of view this is considered high severity Bugzilla as well. If this was not fixed in later versions we should fix in all of them as well.
To fix this we can disable metrics at the index level. Right now these metrics are turned on by default and it should be turned off. If any customer needs these metrics they can be provided a workaround how to turn these metrics on. With 4.5 we switched to different data model in logging that results in a lot less indices, hence even Prometheus metrics will be less cardinal and we can re-evaluate for 4.5 and later how much index level metrics we can expose without blowing out Prometheus storage.
Verified using the CI images.
Please ignore Comment 8. Move back to ON_QA.
Verified using the latest code. No index level metrics are created.
Anli, I believe we need to verify blocking BZs first. Maybe I cloned the BZs in a wrong order but there is a series of BZs to bring this change all the way down from 4.6 (master) to 4.3. BZs: 1) Master: https://bugzilla.redhat.com/show_bug.cgi?id=1858249 (Verified) 2) 4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1859991 (New) 3) 4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1860156 (New) 4) 4.3: https://bugzilla.redhat.com/show_bug.cgi?id=1860164 (New) The chain can be seen here: https://bugzilla.redhat.com/buglist.cgi?bug_id=1858249&bug_id_type=andblocked&format=tvp&list_id=11246685&tvp_dir=blocked Relevant PRs: 1) https://github.com/openshift/elasticsearch-operator/pull/427 (Merged) 2) https://github.com/openshift/elasticsearch-operator/pull/431 (Open) 3) https://github.com/openshift/elasticsearch-operator/pull/432 (Open) 4) https://github.com/openshift/elasticsearch-operator/pull/433 (Open) Unfortunately, when I originally created the first PR (https://github.com/openshift/elasticsearch-operator/pull/427) I left incorrect BZs number in the title which I fixed later so I am not sure what is actually holding whole chain back from merging now. Do you think you can help with BZs: 2), 3) and 4) ? (see earlier for BZ numbers).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196