We're seeing etcdHighNumberOfLeaderChanges in a few jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=alert.etcdHighNumberOfLeaderChanges+fired+for.*seconds+with+labels' | grep 'failures m atch' | sort periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 75% failed, 33% of failures match = 25% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 57 runs, 88% failed, 32% of failures match = 28% impact pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade (all) - 14 runs, 79% failed, 9% of failures match = 7% impact pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 10 runs, 100% failed, 10% of failures match = 10% impact rehearse-12581-periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact Picking on a CVO presubmit [1]: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 2h13m53s Jun 14 21:17:13.764: Unexpected alerts fired or pending during the upgrade: alert etcdHighNumberOfLeaderChanges fired for 180 seconds with labels: {endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-1ql564c3-7ee27-kpf89-master-1", service="etcd", severity="warning"} The CVO runs A->B->A rollback tests. In that job, the issue seems to be: 1. Test suite starts updating A->B 2. A->B chugging along 3. 20:03: master-1 comes back from the MCO roll 4. 20:03: new etcd on the recovered node 5. 20:08: master-2 comes back from the MCO roll 6. 20:08: presumably a new etcd on that node too 7. 20:12: etcd operator finishes transitioning pods back to version A 8. Alert, which is running 'increase' over 15m [2], looks back at all of that^ and freaks out I'm not sure how to adjust the alert to avoid firing in these tightly-chained updates yet, but filing the bug in case we want to use it to back a temporary openshift/origin e2e exception to unblock CVO CI. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/547/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1404502903409872896 [2]: https://github.com/openshift/cluster-monitoring-operator/blob/7f4925a7203622d70b3007fbddfb6bc5cce6c1d9/assets/control-plane/etcd-prometheus-rule.yaml#L49
We should revert [1] once this is fixed. [1]: https://github.com/openshift/release/pull/19396
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
I've opened [1] with the revert that restores rollback testing for CVO presubmits. We should not close this bug without landing that. And we can use its rehearsals to demonstrate that those rollback jobs are still impacted by the the current alert logic being a bit too picky about what constitutes acceptable leader-election density. [1]: https://github.com/openshift/release/pull/20875
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
*** Bug 1968030 has been marked as a duplicate of this bug. ***
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.