Bug 2040533
Summary: | Install fails to bootstrap, complaining about DefragControllerDegraded and sad members | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Etcd | Assignee: | melbeher | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ge liu <geliu> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.9 | CC: | alray, aos-bugs, bleanhar, dcbw, deads, dgoodwin, ehashman, harpatil, kenzhang, melbeher, nagrawal, rphillips, tjungblu | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2050250 (view as bug list) | Environment: | ||
Last Closed: | 2022-09-08 09:38:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2042501, 2050250 |
Description
W. Trevor King
2022-01-14 00:49:32 UTC
Patch from https://github.com/openshift/kubernetes/pull/1140 is in 4.10.0-0.ci-2022-01-25-065550 or later. To confirm the fix, we need to see that this error is reduced over the next 1000 runs. We also have opened a revert (https://github.com/openshift/kubernetes/pull/1142) of the PR that likely caused the bug in case the patch above does not work. *** Bug 2047501 has been marked as a duplicate of this bug. *** The nightly payload was updated about 30 hours ago. Since then, the install success rate has climbed 4%, inline with our expectations This fix worked. Excellent news! Thanks for confirming. David found another failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26796/pull-ci-openshift-origin-master-e2e-aws-serial/1487126569716551680 which I confirmed as another instance of the same race. It seems we may still be hitting the race condition, although much more rarely. I discussed reverting the upstream patch that adds fullnames of static pods on the upstream revert PR here: https://github.com/kubernetes/kubernetes/pull/107734#issuecomment-1024768384 Seeing as that PR fixed a similar/worse issue with static pod handling, we may want to try to determine a patch to address the remaining races rather than roll back. There is work upstream bening done on this PR: https://github.com/kubernetes/kubernetes/pull/107854 and possibly https://github.com/kubernetes/kubernetes/pull/107900 *** Bug 2048756 has been marked as a duplicate of this bug. *** *** Bug 2053255 has been marked as a duplicate of this bug. *** Harshal added UpgradeBlocker back on 2022-02-14, which is a trigger for an evaluation process [1]. But this bug is verified for 4.11 and bug 2050250 is verified for 4.10.0. The 4.9.z bug 2050253 is still NEW, but 4.9.z has been out for a while, and I don't get the impression that this series is fixing a recent regression within the 4.9 z stream. So I'm going to drop UpgradeBlocker here. If folks feel like I'm missing something, feel free to restore the keyword, and fill out an impact statement [2] that sets the context, explains which A->B updates are vulnerable, and explains why blocking them is required to keep customers safe. [1]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#summary [2]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#impact-statement-request Checking CI logs, I still see these failures. https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available https://search.ci.openshift.org/?search=Failed+to+wait+for+bootstrapping+to+complete&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&search=EtcdMembersDegraded Is it worth moving this back to Node, updating the description and/or doc text to explain what is fixed now, and then creating a new bug for any follow-up etcd work? I haven't been following all that closely, but things like kublet pod-handling fixes seem like something that could be tracked and potentially backported independently of... whatever is left for etcd to do. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |