Bug 2053380
| Summary: | alert/etcdHighNumberOfLeaderChanges should not be at or above info | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Etcd | Assignee: | Allen Ray <alray> |
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | bparees |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-04-05 23:04:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
this is a pretty common failure(3.58% of all failures as of right now): https://search.ci.openshift.org/?search=alert%2FetcdHighNumberOfLeaderChanges+should+not+be+at+or+above+info&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job recent occurrence: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1508817579408363520 Hunting around, seems like bug 2010989 is also working on this issue. Let's consolidate there. *** This bug has been marked as a duplicate of bug 2010989 *** |
Seen in 4.10 CI [1]: : [bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info 0s etcdHighNumberOfLeaderChanges was at or above info for at least 13m27s on platformidentification.JobType{Release:"4.10", FromRelease:"4.10", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=5m8.4s): pending for 9m3s, firing for 13m27s: Feb 10 07:18:21.616 - 299s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-168-211.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-168-211.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Feb 10 07:18:51.616 - 269s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-171-133.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-171-133.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} Feb 10 07:19:21.616 - 239s W alert/etcdHighNumberOfLeaderChanges ns/openshift-etcd pod/etcd-ip-10-0-198-128.ec2.internal ALERTS{alertname="etcdHighNumberOfLeaderChanges", alertstate="firing", endpoint="etcd-metrics", job="etcd", namespace="openshift-etcd", pod="etcd-ip-10-0-198-128.ec2.internal", prometheus="openshift-monitoring/k8s", service="etcd", severity="warning"} This test case seems to have a tuneable threshold, so sometimes it gets marked as a flake, and sometimes it is fatal, and I'm not sure I have the regexp correct to distinguish those cases. But whatever's going on is very popular in 4.10+ jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=etcdHighNumberOfLeaderChanges+was+at+or+above+info+for+at+least' | grep 'failures match' | grep -v 'pull-ci-' | sort ... periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 130 runs, 75% failed, 7% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 95 runs, 54% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws (all) - 11 runs, 36% failed, 50% of failures match = 18% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-crun (all) - 16 runs, 19% failed, 33% of failures match = 6% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-csi (all) - 8 runs, 13% failed, 200% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy (all) - 9 runs, 44% failed, 25% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial (all) - 11 runs, 36% failed, 75% of failures match = 27% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 87 runs, 45% failed, 18% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-azure (all) - 8 runs, 75% failed, 17% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-gcp-csi (all) - 8 runs, 38% failed, 67% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere (all) - 12 runs, 42% failed, 40% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-csi (all) - 8 runs, 25% failed, 50% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-upi-serial (all) - 12 runs, 42% failed, 40% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-aws (all) - 13 runs, 31% failed, 175% of failures match = 54% impact ...many more jobs... There are a number of existing bugs around etcdHighNumberOfLeaderChanges, but it feels like this one is different: * Bug 1972948 is about consecutive updates and bug 2008313 is about rollback jobs. But a bunch of the above hits are not even update jobs at all. * Bug 2010989 is about a Late check. That's a different test-case than the one that's biting me, but maybe the test case was renamed or something? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1491663535468449792