2092880 – etcdHighNumberOfLeaderChanges returns incorrect number of leadership changes

Bug 2092880 - etcdHighNumberOfLeaderChanges returns incorrect number of leadership changes

Summary: etcdHighNumberOfLeaderChanges returns incorrect number of leadership changes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Thomas Jungblut
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2010989 (view as bug list)
Depends On:
Blocks:	2102793
TreeView+	depends on / blocked

Reported:	2022-06-02 13:21 UTC by Thomas Jungblut
Modified:	2022-08-16 16:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
Last Closed:	2022-08-10 11:15:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot (186.03 KB, image/png) 2022-06-02 13:21 UTC, Thomas Jungblut	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 851	0	None	open	Bug 2092880: avoid extrapolation in leaderhip alert	2022-06-10 06:35:24 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:16:05 UTC

Description Thomas Jungblut 2022-06-02 13:21:43 UTC

Created attachment 1886098 [details]
screenshot

The current query:

> increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m])

returns bogus results, it alerts on values with 5.x, where the number of leadership changes were actually only 4 compared against the metric etcd_server_is_leader.

I believe this is an issue of extrapolation in the increase function, as described here: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

Comment 1 W. Trevor King 2022-06-02 20:32:37 UTC

extrapolation is often helpful, because we may not have metrics coverage over the whole window in order to calculate the exact number of leader elections.  I think we should keep the extrapolation, but adjust the wording from [1]:

  {{ $value }} leader changes within the last 15 minutes.

to talk about the extrapolated rate:

  Around {{ $value }} leader changes per 15 minutes.

Alternatively, you could flip it around and do something like:

  Leader elections every {{ FIXME: syntax }} minutes, averaging over the past 15 minutes.

with some gymnastics to get '15 / $value' in there.

[1]: https://github.com/openshift/cluster-etcd-operator/blob/d0ac0559067390d877af995039432481a9d44901/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L162-L163

Comment 3 W. Trevor King 2022-06-06 02:48:49 UTC

4.10 run [1] has etcdHighNumberOfLeaderChanges firing early on.  I suspect the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn" [2].  But that particular post-install situation would also be mitigated by [3].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-cgroupsv2/1533617810465361920
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=2010989#c8
[3]: https://github.com/openshift/cluster-etcd-operator/pull/804

Comment 4 Thomas Jungblut 2022-06-07 08:06:06 UTC

TODO evaluate whether we also need the initial 1h installation guard: https://github.com/openshift/cluster-etcd-operator/pull/843#discussion_r889135867

Comment 14 Thomas Jungblut 2022-07-20 13:14:25 UTC

*** Bug 2010989 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2022-08-10 11:15:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.