Bug 1899760

Summary: etcd_request_duration_seconds_bucket metric has excessive cardinality
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: low    
Version: 4.7CC: akashem, aos-bugs, mfojtik, sbatsche, vrutkovs, wking, xxia
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:35:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Clayton Coleman 2020-11-19 23:02:46 UTC
The metric etcd_request_duration_seconds_bucket in 4.7 has 25k series on an empty cluster.  It exposes 41(!) buckets and includes every resource (150) and every verb (10). 

This cannot have such extensive cardinality.  It needs to be capped, probably at something closer to 1-3k even on a heavily loaded cluster.

Comment 1 Abu Kashem 2020-11-20 20:01:00 UTC
We opened a PR upstream to reduce the number of buckets for 'etcd_request_duration_seconds' metric - https://github.com/kubernetes/kubernetes/pull/96754
This takes us to what we had before (around 11 buckets). 

According to Clayton:
> so this takes us from 40k entries to 10k
> 10k is a lot 
> (on an idle cluster, which was 20% of total series out of 200k)


An open question is:
'apiserver_request_duration_seconds' metric has around 37 buckets and with all the labels looks like it is also a cardinality explosion - https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L103-L104

Do we want to reduce the number of buckets for this metric?


Clayton's thoughts:
> in general these two metrics are 1/4 of all series on a normal cluster
> they probably should be more like 1/16 or lower
> if we don't do that upstream we should do that when we scrape those clusters (getting david or stefan to weigh in on which parts of cardinality in those metrics we don't need)

I did some further investigation on the "etcd_request_duration_seconds" metric. It has two labels 
> "operation", "type"
> https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go#L46

Here the type label represents the underlying etcd type and also etcd key names.
I checked the number of values for the 'type' label has on my 4.7 dev cluster

> count(count by (type) (etcd_request_duration_seconds_bucket))
> 257

so 'type' does not correlate to kubernetes 'resource' which we have 
> count (count by (resource) (apiserver_request_duration_seconds_bucket))
> 145

'type' seems to be unbounded as it includes the etcd key names for the CRDs as well.

Suggestions/Comments from Clayton:
> we could do the reduction on the scrape side if we had to
> or select only a subset of key resources to track
> pods, events, namespaces, a few representative crds
> operation type definitely is useful
> resource type seems more arbitrary, certainly crd vs not
> maybe all crds should be using the same resource type
> maybe it should be by apigroup

Comment 2 Michal Fojtik 2020-12-20 20:58:21 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 3 Abu Kashem 2021-01-08 19:06:28 UTC
upstream PR: https://github.com/kubernetes/kubernetes/pull/96754

Comment 4 Michal Fojtik 2021-01-08 19:38:51 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 6 Ke Wang 2021-01-20 09:46:11 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-19-095812   True        False         58m     Cluster version is 4.7.0-0.nightly-2021-01-19-095812

On OCP 4.6 , ran etcd_request_duration_seconds_bucket query with time range 1h in Prometheus
Total time series: 25707

On OCP 4.7, ran etcd_request_duration_seconds_bucket query with time range 1h in Prometheus
Result series:  9786 

The series number is down than 4.6. 

Per the PR https://github.com/openshift/kubernetes/pull/515, the metrics etcd_request_duration_seconds_bucket shows buckets data should belong to range Buckets:    []float64{0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 15.0, 30.0, 60.0}, queried the metrics etcd_request_duration_seconds_bucket from web-console, the value of label ‘le’ indeed only in Buckets slice, so move the VERIFIED.

Comment 9 errata-xmlrpc 2021-02-24 15:35:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633