Created attachment 1886098 [details] screenshot The current query: > increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) returns bogus results, it alerts on values with 5.x, where the number of leadership changes were actually only 4 compared against the metric etcd_server_is_leader. I believe this is an issue of extrapolation in the increase function, as described here: https://prometheus.io/docs/prometheus/latest/querying/functions/#increase
extrapolation is often helpful, because we may not have metrics coverage over the whole window in order to calculate the exact number of leader elections. I think we should keep the extrapolation, but adjust the wording from [1]: {{ $value }} leader changes within the last 15 minutes. to talk about the extrapolated rate: Around {{ $value }} leader changes per 15 minutes. Alternatively, you could flip it around and do something like: Leader elections every {{ FIXME: syntax }} minutes, averaging over the past 15 minutes. with some gymnastics to get '15 / $value' in there. [1]: https://github.com/openshift/cluster-etcd-operator/blob/d0ac0559067390d877af995039432481a9d44901/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L162-L163
4.10 run [1] has etcdHighNumberOfLeaderChanges firing early on. I suspect the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn" [2]. But that particular post-install situation would also be mitigated by [3]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-cgroupsv2/1533617810465361920 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2010989#c8 [3]: https://github.com/openshift/cluster-etcd-operator/pull/804
TODO evaluate whether we also need the initial 1h installation guard: https://github.com/openshift/cluster-etcd-operator/pull/843#discussion_r889135867
*** Bug 2010989 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069