Bug 2040533

Summary: Install fails to bootstrap, complaining about DefragControllerDegraded and sad members
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: melbeher
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9CC: alray, aos-bugs, bleanhar, dcbw, deads, dgoodwin, ehashman, harpatil, kenzhang, melbeher, nagrawal, rphillips, tjungblu
Target Milestone: ---Keywords: Upgrades
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2050250 (view as bug list) Environment:
Last Closed: 2022-09-08 09:38:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 2042501, 2050250    

Description W. Trevor King 2022-01-14 00:49:32 UTC
Seen in a 4.10 cluster-bot job (launched with an unqualified 'launch') [1]:

level=info msg=Waiting up to 30m0s (until 12:32AM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is terminated: Error: e.go:84 +0x32
level=error msg=StaticPodsDegraded: internal/poll.(*pollDesc).waitRead(...)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_poll_runtime.go:89
level=error msg=StaticPodsDegraded: internal/poll.(*FD).Accept(0xc0005af400)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_unix.go:402 +0x22c
level=error msg=StaticPodsDegraded: net.(*netFD).accept(0xc0005af400)
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220114003251.tar.gz"
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete

So fairly clear that master-0 had some issue.  Or maybe it's master-2?  Anyhow, I dunno if we need the whole stack trace included in the message.  The installer's stdout (or maybe also includes its stderr) has the same stack trace noise [2].

By the time gather-extra collects the ClusterOperators, the message is a bit more compact:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | select(.type == "Degraded").message'
ClusterMemberControllerDegraded: unhealthy members found during reconciling members
DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ci-ln-79cqxdk-72292-sfjqr-master-2_openshift-etcd(c147ec293a1fd2923fcdbedc3337d507)

Only seen in in 4.10/dev CI, and not crazy common, but also not crazy rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available' | grep 'failures match'
release-openshift-origin-installer-launch-gcp-modern (all) - 275 runs, 37% failed, 3% of failures match = 1% impact
pull-ci-kubevirt-hyperconverged-cluster-operator-main-hco-e2e-upgrade-prev-index-azure (all) - 56 runs, 75% failed, 2% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn (all) - 14 runs, 64% failed, 11% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
pull-ci-openshift-origin-master-e2e-agnostic-cmd (all) - 42 runs, 48% failed, 5% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp-upgrade (all) - 48 runs, 54% failed, 4% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 65 runs, 71% failed, 4% of failures match = 3% impact
openshift-kubernetes-1116-nightly-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-1113-nightly-4.10-e2e-vsphere-upi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-serial (all) - 45 runs, 76% failed, 3% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 16 runs, 56% failed, 11% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 90 runs, 99% failed, 1% of failures match = 1% impact
pull-ci-openshift-console-master-e2e-gcp-console (all) - 143 runs, 55% failed, 1% of failures match = 1% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888#1:build-log.txt%3A49
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/ipi-install-install/build-log.txt

Comment 5 Elana Hashman 2022-01-25 17:15:42 UTC
Patch from https://github.com/openshift/kubernetes/pull/1140 is in 4.10.0-0.ci-2022-01-25-065550 or later. To confirm the fix, we need to see that this error is reduced over the next 1000 runs.

We also have opened a revert (https://github.com/openshift/kubernetes/pull/1142) of the PR that likely caused the bug in case the patch above does not work.

Comment 9 Ryan Phillips 2022-01-28 14:48:39 UTC
*** Bug 2047501 has been marked as a duplicate of this bug. ***

Comment 10 David Eads 2022-01-28 14:52:21 UTC
The nightly payload was updated about 30 hours ago.  Since then, the install success rate has climbed 4%, inline with our expectations

This fix worked.

Comment 11 Elana Hashman 2022-01-28 19:38:29 UTC
Excellent news! Thanks for confirming.

Comment 12 Elana Hashman 2022-01-29 00:34:48 UTC
David found another failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26796/pull-ci-openshift-origin-master-e2e-aws-serial/1487126569716551680 which I confirmed as another instance of the same race.

It seems we may still be hitting the race condition, although much more rarely. 

I discussed reverting the upstream patch that adds fullnames of static pods on the upstream revert PR here: https://github.com/kubernetes/kubernetes/pull/107734#issuecomment-1024768384

Seeing as that PR fixed a similar/worse issue with static pod handling, we may want to try to determine a patch to address the remaining races rather than roll back.

Comment 13 Ryan Phillips 2022-02-01 18:44:18 UTC
There is work upstream bening done on this PR: https://github.com/kubernetes/kubernetes/pull/107854 and possibly https://github.com/kubernetes/kubernetes/pull/107900

Comment 14 Ryan Phillips 2022-02-03 14:25:05 UTC
*** Bug 2048756 has been marked as a duplicate of this bug. ***

Comment 19 Harshal Patil 2022-02-11 16:17:31 UTC
*** Bug 2053255 has been marked as a duplicate of this bug. ***

Comment 23 W. Trevor King 2022-03-02 21:28:26 UTC
Harshal added UpgradeBlocker back on 2022-02-14, which is a trigger for an evaluation process [1].  But this bug is verified for 4.11 and bug 2050250 is verified for 4.10.0.  The 4.9.z bug 2050253 is still NEW, but 4.9.z has been out for a while, and I don't get the impression that this series is fixing a recent regression within the 4.9 z stream.  So I'm going to drop UpgradeBlocker here.  If folks feel like I'm missing something, feel free to restore the keyword, and fill out an impact statement [2] that sets the context, explains which A->B updates are vulnerable, and explains why blocking them is required to keep customers safe.

[1]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#summary
[2]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#impact-statement-request

Comment 28 W. Trevor King 2022-04-28 07:01:04 UTC
Is it worth moving this back to Node, updating the description and/or doc text to explain what is fixed now, and then creating a new bug for any follow-up etcd work?  I haven't been following all that closely, but things like kublet pod-handling fixes seem like something that could be tracked and potentially backported independently of... whatever is left for etcd to do.

Comment 36 Red Hat Bugzilla 2023-11-18 04:25:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days