Bug 1972948

Summary:	Consecutive updates can trigger etcdHighNumberOfLeaderChanges
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Etcd	Assignee:	melbeher
Status:	CLOSED NEXTRELEASE	QA Contact:	ge liu <geliu>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.9	CC:	anpicker, htariq, jluhrsen, lmohanty, melbeher, tjungblu
Target Milestone:	---	Keywords:	Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci LifecycleStale
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-08 12:12:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-06-17 00:00:29 UTC

We're seeing etcdHighNumberOfLeaderChanges in a few jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=alert.etcdHighNumberOfLeaderChanges+fired+for.*seconds+with+labels' | grep 'failures m
atch' | sort
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 57 runs, 88% failed, 32% of failures match = 28% impact
pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade (all) - 14 runs, 79% failed, 9% of failures match = 7% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 10 runs, 100% failed, 10% of failures match = 10% impact
rehearse-12581-periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact

Picking on a CVO presubmit [1]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h13m53s
    Jun 14 21:17:13.764: Unexpected alerts fired or pending during the upgrade:

    alert etcdHighNumberOfLeaderChanges fired for 180 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-1ql564c3-7ee27-kpf89-master-1", service="etcd", severity="warning"}

The CVO runs A->B->A rollback tests.  In that job, the issue seems to be:

1. Test suite starts updating A->B
2. A->B chugging along
3. 20:03: master-1 comes back from the MCO roll
4. 20:03: new etcd on the recovered node
5. 20:08: master-2 comes back from the MCO roll
6. 20:08: presumably a new etcd on that node too
7. 20:12: etcd operator finishes transitioning pods back to version A
8. Alert, which is running 'increase' over 15m [2], looks back at all of that^ and freaks out

I'm not sure how to adjust the alert to avoid firing in these tightly-chained updates yet, but filing the bug in case we want to use it to back a temporary openshift/origin e2e exception to unblock CVO CI.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/547/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1404502903409872896
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/7f4925a7203622d70b3007fbddfb6bc5cce6c1d9/assets/control-plane/etcd-prometheus-rule.yaml#L49

Comment 2 W. Trevor King 2021-06-25 17:52:07 UTC

We should revert [1] once this is fixed.

[1]: https://github.com/openshift/release/pull/19396

Comment 3 Michal Fojtik 2021-07-25 18:23:03 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 W. Trevor King 2021-08-04 05:00:25 UTC

I've opened [1] with the revert that restores rollback testing for CVO presubmits.  We should not close this bug without landing that.  And we can use its rehearsals to demonstrate that those rollback jobs are still impacted by the the current alert logic being a bit too picky about what constitutes acceptable leader-election density.

[1]: https://github.com/openshift/release/pull/20875

Comment 5 Michal Fojtik 2021-08-04 05:47:13 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 9 Tim Rozet 2021-09-30 20:43:18 UTC

*** Bug 1968030 has been marked as a duplicate of this bug. ***

Comment 14 Michal Fojtik 2022-02-24 04:12:43 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.