Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2092880

Summary: etcdHighNumberOfLeaderChanges returns incorrect number of leadership changes
Product: OpenShift Container Platform Reporter: Thomas Jungblut <tjungblu>
Component: EtcdAssignee: Thomas Jungblut <tjungblu>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.11CC: dgoodwin, sippy, wking
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
Last Closed: 2022-08-10 11:15:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2102793    
Attachments:
Description Flags
screenshot none

Description Thomas Jungblut 2022-06-02 13:21:43 UTC
Created attachment 1886098 [details]
screenshot

The current query:

> increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m])

returns bogus results, it alerts on values with 5.x, where the number of leadership changes were actually only 4 compared against the metric etcd_server_is_leader.

I believe this is an issue of extrapolation in the increase function, as described here: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

Comment 1 W. Trevor King 2022-06-02 20:32:37 UTC
extrapolation is often helpful, because we may not have metrics coverage over the whole window in order to calculate the exact number of leader elections.  I think we should keep the extrapolation, but adjust the wording from [1]:

  {{ $value }} leader changes within the last 15 minutes.

to talk about the extrapolated rate:

  Around {{ $value }} leader changes per 15 minutes.

Alternatively, you could flip it around and do something like:

  Leader elections every {{ FIXME: syntax }} minutes, averaging over the past 15 minutes.

with some gymnastics to get '15 / $value' in there.

[1]: https://github.com/openshift/cluster-etcd-operator/blob/d0ac0559067390d877af995039432481a9d44901/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L162-L163

Comment 3 W. Trevor King 2022-06-06 02:48:49 UTC
4.10 run [1] has etcdHighNumberOfLeaderChanges firing early on.  I suspect the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn" [2].  But that particular post-install situation would also be mitigated by [3].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-cgroupsv2/1533617810465361920
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=2010989#c8
[3]: https://github.com/openshift/cluster-etcd-operator/pull/804

Comment 4 Thomas Jungblut 2022-06-07 08:06:06 UTC
TODO evaluate whether we also need the initial 1h installation guard: https://github.com/openshift/cluster-etcd-operator/pull/843#discussion_r889135867

Comment 14 Thomas Jungblut 2022-07-20 13:14:25 UTC
*** Bug 2010989 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2022-08-10 11:15:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069