Bug 1946784
| Summary: | EtcdMembers_UnhealthyMembers in 4.8 update CI | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> |
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.8 | CC: | skolicha |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | --- | Flags: | mfojtik:
needinfo?
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | LifecycleReset | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-07 16:58:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2021-04-06 20:07:40 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. Not as bad as when I opened the bug, but we still see these in CI today: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=reason/EtcdMembers_UnhealthyMembers' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-18322-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-18336-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact rehearse-18383-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. This problem is addressed by the recently implemented fix for bug https://bugzilla.redhat.com/show_bug.cgi?id=1952268. It appears rebooting a node and getting it to the point where the kubelet is running static pods can take more than the default 2 minutes. This causes etcd operator to set Degraded=True on healthy machine-config node reboots. As a workaround, we set a custom interia of 5 minute duration on the NodeControllerDegraded and EtcdMembersDegraded conditions. Closing as a duplicate the above bug. *** This bug has been marked as a duplicate of bug 1952268 *** With bug 1952268 landed a few days ago, confirming that CI isn't seeing this reason as often: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=reason/EtcdMembers_UnhealthyMembers' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 90% failed, 6% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 11% of failures match = 11% impact pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade (all) - 32 runs, 84% failed, 4% of failures match = 3% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade (all) - 29 runs, 90% failed, 4% of failures match = 3% impact rehearse-18322-pull-ci-openshift-cluster-etcd-operator-release-4.9-e2e-gcp-disruptive-ovn (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-18409-pull-ci-openshift-cluster-etcd-operator-release-4.9-e2e-gcp-disruptive-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact So some rehearsals and PR presubmits (where all sorts of things can break), some 4.7->4.8 updates (4.7 hasn't been fixed yet), and... maybe periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade is still in here because we haven't accepted a candidate CI build since the fix landed in master/4.8? Anyhow, looks much better now, and possibly completely fixed in 4.8; thanks :) |