Bug 1952268
| Summary: | etcd operator should not set Degraded=True EtcdMembersDegraded on healthy machine-config node reboots | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | |
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.8 | CC: | kgarriso, rphillips, sbatsche, wlewis | |
| Target Milestone: | --- | Keywords: | Upgrades | |
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1976988 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 23:02:36 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1976988 | |||
*** Bug 1946784 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 I'm looking into another bug but am still seeing this get hit a lot: https://search.ci.openshift.org/?search=clusteroperator%2Fetcd+should+not+change+condition%2FDegraded&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 73 runs, 85% failed, 115% of failures match = 97% impact The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
From CI runs like [1]: : [bz-Etcd] clusteroperator/etcd should not change condition/Degraded Run #0: Failed 0s 3 unexpected clusteroperator state transitions during e2e test run Apr 21 14:28:20.460 - 112s E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-243-65.us-east-2.compute.internal is unhealthy Apr 21 14:34:21.216 - 68s E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-156-170.us-east-2.compute.internal is unhealthy Apr 21 14:39:43.199 - 45s E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-177-210.us-east-2.compute.internal is unhealthy Handy interval chart down at the bottom of [1] shows the Degraded=True intervals occurring as the machine-config operator rolls each of the control-plane nodes. Very popular in updates ending in 4.8+: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/etcd+should+not+change+condition/Degraded' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 15 runs, 93% failed, 93% of failures match = 87% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 75% failed, 133% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 15 runs, 100% failed, 87% of failures match = 87% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 50% of failures match = 44% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 86% failed, 100% of failures match = 86% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328