Bug 2053380

Summary: alert/etcdHighNumberOfLeaderChanges should not be at or above info
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Allen Ray <alray>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: bparees
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-05 23:04:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-02-11 06:34:10 UTC
Seen in 4.10 CI [1]:

  : [bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info	0s
  etcdHighNumberOfLeaderChanges was at or above info for at least 13m27s on platformidentification.JobType{Release:"4.10", FromRelease:"4.10", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=5m8.4s): pending for 9m3s, firing for 13m27s:

  Feb 10 07:18:21.616 - 299s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-168-211.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-168-211.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
  Feb 10 07:18:51.616 - 269s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-171-133.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-171-133.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}
  Feb 10 07:19:21.616 - 239s  W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-198-128.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-198-128.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"}

This test case seems to have a tuneable threshold, so sometimes it gets marked as a flake, and sometimes it is fatal, and I'm not sure I have the regexp correct to distinguish those cases.  But whatever's going on is very popular in 4.10+ jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=etcdHighNumberOfLeaderChanges+was+at+or+above+info+for+at+least' | grep 'failures match' | grep -v 'pull-ci-' | sort
...
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 130 runs, 75% failed, 7% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 95 runs, 54% failed, 4% of failures match = 2% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws (all) - 11 runs, 36% failed, 50% of failures match = 18% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-crun (all) - 16 runs, 19% failed, 33% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-csi (all) - 8 runs, 13% failed, 200% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy (all) - 9 runs, 44% failed, 25% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial (all) - 11 runs, 36% failed, 75% of failures match = 27% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 87 runs, 45% failed, 18% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-azure (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-gcp-csi (all) - 8 runs, 38% failed, 67% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere (all) - 12 runs, 42% failed, 40% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-csi (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-upi-serial (all) - 12 runs, 42% failed, 40% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws (all) - 13 runs, 31% failed, 175% of failures match = 54% impact
...many more jobs...

There are a number of existing bugs around etcdHighNumberOfLeaderChanges, but it feels like this one is different:

* Bug 1972948 is about consecutive updates and bug 2008313 is about rollback jobs.  But a bunch of the above hits are not even update jobs at all.
* Bug 2010989 is about a Late check.  That's a different test-case than the one that's biting me, but maybe the test case was renamed or something?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1491663535468449792

Comment 2 W. Trevor King 2022-04-05 23:04:10 UTC
Hunting around, seems like bug 2010989 is also working on this issue.  Let's consolidate there.

*** This bug has been marked as a duplicate of bug 2010989 ***