Bug 1972948 - Consecutive updates can trigger etcdHighNumberOfLeaderChanges
Summary: Consecutive updates can trigger etcdHighNumberOfLeaderChanges
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: melbeher
QA Contact: ge liu
URL:
Whiteboard: tag-ci LifecycleStale
: 1968030 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-17 00:00 UTC by W. Trevor King
Modified: 2023-03-14 04:48 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-08 12:12:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCPPLAN-6697 0 Unprioritized To Do Review open bugzillas and close them 2021-06-24 16:46:07 UTC

Description W. Trevor King 2021-06-17 00:00:29 UTC
We're seeing etcdHighNumberOfLeaderChanges in a few jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=alert.etcdHighNumberOfLeaderChanges+fired+for.*seconds+with+labels' | grep 'failures m
atch' | sort
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 57 runs, 88% failed, 32% of failures match = 28% impact
pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade (all) - 14 runs, 79% failed, 9% of failures match = 7% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 10 runs, 100% failed, 10% of failures match = 10% impact
rehearse-12581-periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact

Picking on a CVO presubmit [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h13m53s
    Jun 14 21:17:13.764: Unexpected alerts fired or pending during the upgrade:

    alert etcdHighNumberOfLeaderChanges fired for 180 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-1ql564c3-7ee27-kpf89-master-1", service="etcd", severity="warning"}

The CVO runs A->B->A rollback tests.  In that job, the issue seems to be:

1. Test suite starts updating A->B
2. A->B chugging along
3. 20:03: master-1 comes back from the MCO roll
4. 20:03: new etcd on the recovered node
5. 20:08: master-2 comes back from the MCO roll
6. 20:08: presumably a new etcd on that node too
7. 20:12: etcd operator finishes transitioning pods back to version A
8. Alert, which is running 'increase' over 15m [2], looks back at all of that^ and freaks out

I'm not sure how to adjust the alert to avoid firing in these tightly-chained updates yet, but filing the bug in case we want to use it to back a temporary openshift/origin e2e exception to unblock CVO CI.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/547/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1404502903409872896
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/7f4925a7203622d70b3007fbddfb6bc5cce6c1d9/assets/control-plane/etcd-prometheus-rule.yaml#L49

Comment 2 W. Trevor King 2021-06-25 17:52:07 UTC
We should revert [1] once this is fixed.

[1]: https://github.com/openshift/release/pull/19396

Comment 3 Michal Fojtik 2021-07-25 18:23:03 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 W. Trevor King 2021-08-04 05:00:25 UTC
I've opened [1] with the revert that restores rollback testing for CVO presubmits.  We should not close this bug without landing that.  And we can use its rehearsals to demonstrate that those rollback jobs are still impacted by the the current alert logic being a bit too picky about what constitutes acceptable leader-election density.

[1]: https://github.com/openshift/release/pull/20875

Comment 5 Michal Fojtik 2021-08-04 05:47:13 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 9 Tim Rozet 2021-09-30 20:43:18 UTC
*** Bug 1968030 has been marked as a duplicate of this bug. ***

Comment 14 Michal Fojtik 2022-02-24 04:12:43 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.


Note You need to log in before you can comment on or make changes to this bug.