Bug 2040533

Summary:	Install fails to bootstrap, complaining about DefragControllerDegraded and sad members
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Etcd	Assignee:	melbeher
Status:	CLOSED CURRENTRELEASE	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.9	CC:	alray, aos-bugs, bleanhar, dcbw, deads, dgoodwin, ehashman, harpatil, kenzhang, melbeher, nagrawal, rphillips, tjungblu
Target Milestone:	---	Keywords:	Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2050250 (view as bug list)		Environment:
Last Closed:	2022-09-08 09:38:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2042501, 2050250

Description W. Trevor King 2022-01-14 00:49:32 UTC

Seen in a 4.10 cluster-bot job (launched with an unqualified 'launch') [1]:

level=info msg=Waiting up to 30m0s (until 12:32AM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
...
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is terminated: Error: e.go:84 +0x32
level=error msg=StaticPodsDegraded: internal/poll.(*pollDesc).waitRead(...)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_poll_runtime.go:89
level=error msg=StaticPodsDegraded: internal/poll.(*FD).Accept(0xc0005af400)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_unix.go:402 +0x22c
level=error msg=StaticPodsDegraded: net.(*netFD).accept(0xc0005af400)
...
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220114003251.tar.gz"
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete

So fairly clear that master-0 had some issue.  Or maybe it's master-2?  Anyhow, I dunno if we need the whole stack trace included in the message.  The installer's stdout (or maybe also includes its stderr) has the same stack trace noise [2].

By the time gather-extra collects the ClusterOperators, the message is a bit more compact:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | select(.type == "Degraded").message'
ClusterMemberControllerDegraded: unhealthy members found during reconciling members
DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ci-ln-79cqxdk-72292-sfjqr-master-2_openshift-etcd(c147ec293a1fd2923fcdbedc3337d507)

Only seen in in 4.10/dev CI, and not crazy common, but also not crazy rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available' | grep 'failures match'
release-openshift-origin-installer-launch-gcp-modern (all) - 275 runs, 37% failed, 3% of failures match = 1% impact
pull-ci-kubevirt-hyperconverged-cluster-operator-main-hco-e2e-upgrade-prev-index-azure (all) - 56 runs, 75% failed, 2% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn (all) - 14 runs, 64% failed, 11% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
pull-ci-openshift-origin-master-e2e-agnostic-cmd (all) - 42 runs, 48% failed, 5% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp-upgrade (all) - 48 runs, 54% failed, 4% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 65 runs, 71% failed, 4% of failures match = 3% impact
openshift-kubernetes-1116-nightly-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-1113-nightly-4.10-e2e-vsphere-upi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-serial (all) - 45 runs, 76% failed, 3% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 16 runs, 56% failed, 11% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 90 runs, 99% failed, 1% of failures match = 1% impact
pull-ci-openshift-console-master-e2e-gcp-console (all) - 143 runs, 55% failed, 1% of failures match = 1% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888#1:build-log.txt%3A49
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/ipi-install-install/build-log.txt

Comment 5 Elana Hashman 2022-01-25 17:15:42 UTC

Patch from https://github.com/openshift/kubernetes/pull/1140 is in 4.10.0-0.ci-2022-01-25-065550 or later. To confirm the fix, we need to see that this error is reduced over the next 1000 runs.

We also have opened a revert (https://github.com/openshift/kubernetes/pull/1142) of the PR that likely caused the bug in case the patch above does not work.

Comment 9 Ryan Phillips 2022-01-28 14:48:39 UTC

*** Bug 2047501 has been marked as a duplicate of this bug. ***

Comment 10 David Eads 2022-01-28 14:52:21 UTC

The nightly payload was updated about 30 hours ago.  Since then, the install success rate has climbed 4%, inline with our expectations

This fix worked.

Comment 11 Elana Hashman 2022-01-28 19:38:29 UTC

Excellent news! Thanks for confirming.

Comment 12 Elana Hashman 2022-01-29 00:34:48 UTC

David found another failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26796/pull-ci-openshift-origin-master-e2e-aws-serial/1487126569716551680 which I confirmed as another instance of the same race.

It seems we may still be hitting the race condition, although much more rarely. 

I discussed reverting the upstream patch that adds fullnames of static pods on the upstream revert PR here: https://github.com/kubernetes/kubernetes/pull/107734#issuecomment-1024768384

Seeing as that PR fixed a similar/worse issue with static pod handling, we may want to try to determine a patch to address the remaining races rather than roll back.

Comment 13 Ryan Phillips 2022-02-01 18:44:18 UTC

There is work upstream bening done on this PR: https://github.com/kubernetes/kubernetes/pull/107854 and possibly https://github.com/kubernetes/kubernetes/pull/107900

Comment 14 Ryan Phillips 2022-02-03 14:25:05 UTC

*** Bug 2048756 has been marked as a duplicate of this bug. ***

Comment 19 Harshal Patil 2022-02-11 16:17:31 UTC

*** Bug 2053255 has been marked as a duplicate of this bug. ***

Comment 23 W. Trevor King 2022-03-02 21:28:26 UTC

Harshal added UpgradeBlocker back on 2022-02-14, which is a trigger for an evaluation process [1].  But this bug is verified for 4.11 and bug 2050250 is verified for 4.10.0.  The 4.9.z bug 2050253 is still NEW, but 4.9.z has been out for a while, and I don't get the impression that this series is fixing a recent regression within the 4.9 z stream.  So I'm going to drop UpgradeBlocker here.  If folks feel like I'm missing something, feel free to restore the keyword, and fill out an impact statement [2] that sets the context, explains which A->B updates are vulnerable, and explains why blocking them is required to keep customers safe.

[1]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#summary
[2]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#impact-statement-request

Comment 24 Sunil Choudhary 2022-03-16 10:10:11 UTC

Checking CI logs, I still see these failures.

https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available
https://search.ci.openshift.org/?search=Failed+to+wait+for+bootstrapping+to+complete&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&search=EtcdMembersDegraded

Comment 28 W. Trevor King 2022-04-28 07:01:04 UTC

Is it worth moving this back to Node, updating the description and/or doc text to explain what is fixed now, and then creating a new bug for any follow-up etcd work?  I haven't been following all that closely, but things like kublet pod-handling fixes seem like something that could be tracked and potentially backported independently of... whatever is left for etcd to do.

Comment 36 Red Hat Bugzilla 2023-11-18 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days