Bug 2092880
| Summary: | etcdHighNumberOfLeaderChanges returns incorrect number of leadership changes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Thomas Jungblut <tjungblu> | ||||
| Component: | Etcd | Assignee: | Thomas Jungblut <tjungblu> | ||||
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.11 | CC: | dgoodwin, sippy, wking | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
|
|||||
| Last Closed: | 2022-08-10 11:15:49 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2102793 | ||||||
| Attachments: |
|
||||||
|
Description
Thomas Jungblut
2022-06-02 13:21:43 UTC
extrapolation is often helpful, because we may not have metrics coverage over the whole window in order to calculate the exact number of leader elections. I think we should keep the extrapolation, but adjust the wording from [1]:
{{ $value }} leader changes within the last 15 minutes.
to talk about the extrapolated rate:
Around {{ $value }} leader changes per 15 minutes.
Alternatively, you could flip it around and do something like:
Leader elections every {{ FIXME: syntax }} minutes, averaging over the past 15 minutes.
with some gymnastics to get '15 / $value' in there.
[1]: https://github.com/openshift/cluster-etcd-operator/blob/d0ac0559067390d877af995039432481a9d44901/manifests/0000_90_etcd-operator_03_prometheusrule.yaml#L162-L163
4.10 run [1] has etcdHighNumberOfLeaderChanges firing early on. I suspect the static pod controller should grow a new metric for config revision, and the etcdHighNumberOfLeaderChanges expr could be updated to say "when the leader churn is higher than what I'd expect given the revision churn" [2]. But that particular post-install situation would also be mitigated by [3]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-cgroupsv2/1533617810465361920 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2010989#c8 [3]: https://github.com/openshift/cluster-etcd-operator/pull/804 TODO evaluate whether we also need the initial 1h installation guard: https://github.com/openshift/cluster-etcd-operator/pull/843#discussion_r889135867 *** Bug 2010989 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |