Bug 2040533 - Install fails to bootstrap, complaining about DefragControllerDegraded and sad members
Summary: Install fails to bootstrap, complaining about DefragControllerDegraded and sa...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: melbeher
QA Contact: ge liu
URL:
Whiteboard:
: 2047501 2048756 2053255 (view as bug list)
Depends On:
Blocks: 2042501 2050250
TreeView+ depends on / blocked
 
Reported: 2022-01-14 00:49 UTC by W. Trevor King
Modified: 2023-11-18 04:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2050250 (view as bug list)
Environment:
Last Closed: 2022-09-08 09:38:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 1140 0 None Merged Bug 2040533: UPSTREAM: 107695: kubelet: fix podstatus not containing pod full name 2022-02-11 17:52:57 UTC
Github openshift kubernetes pull 1157 0 None Merged Bug 2040533: UPSTREAM: 107900: Pods that have terminated before starting should not block startup 2022-06-17 22:07:34 UTC
Github openshift kubernetes pull 1176 0 None Merged Bug 2040533: UPSTREAM: <drop>: Ignore container notfound error while getPodstatuses 2022-06-17 22:07:31 UTC

Internal Links: 2042501

Description W. Trevor King 2022-01-14 00:49:32 UTC
Seen in a 4.10 cluster-bot job (launched with an unqualified 'launch') [1]:

level=info msg=Waiting up to 30m0s (until 12:32AM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
...
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is terminated: Error: e.go:84 +0x32
level=error msg=StaticPodsDegraded: internal/poll.(*pollDesc).waitRead(...)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_poll_runtime.go:89
level=error msg=StaticPodsDegraded: internal/poll.(*FD).Accept(0xc0005af400)
level=error msg=StaticPodsDegraded: 	internal/poll/fd_unix.go:402 +0x22c
level=error msg=StaticPodsDegraded: net.(*netFD).accept(0xc0005af400)
...
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220114003251.tar.gz"
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete

So fairly clear that master-0 had some issue.  Or maybe it's master-2?  Anyhow, I dunno if we need the whole stack trace included in the message.  The installer's stdout (or maybe also includes its stderr) has the same stack trace noise [2].

By the time gather-extra collects the ClusterOperators, the message is a bit more compact:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | select(.type == "Degraded").message'
ClusterMemberControllerDegraded: unhealthy members found during reconciling members
DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ci-ln-79cqxdk-72292-sfjqr-master-2_openshift-etcd(c147ec293a1fd2923fcdbedc3337d507)

Only seen in in 4.10/dev CI, and not crazy common, but also not crazy rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available' | grep 'failures match'
release-openshift-origin-installer-launch-gcp-modern (all) - 275 runs, 37% failed, 3% of failures match = 1% impact
pull-ci-kubevirt-hyperconverged-cluster-operator-main-hco-e2e-upgrade-prev-index-azure (all) - 56 runs, 75% failed, 2% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn (all) - 14 runs, 64% failed, 11% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
pull-ci-openshift-origin-master-e2e-agnostic-cmd (all) - 42 runs, 48% failed, 5% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp-upgrade (all) - 48 runs, 54% failed, 4% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 65 runs, 71% failed, 4% of failures match = 3% impact
openshift-kubernetes-1116-nightly-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-1113-nightly-4.10-e2e-vsphere-upi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-serial (all) - 45 runs, 76% failed, 3% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 16 runs, 56% failed, 11% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 90 runs, 99% failed, 1% of failures match = 1% impact
pull-ci-openshift-console-master-e2e-gcp-console (all) - 143 runs, 55% failed, 1% of failures match = 1% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888#1:build-log.txt%3A49
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/ipi-install-install/build-log.txt

Comment 5 Elana Hashman 2022-01-25 17:15:42 UTC
Patch from https://github.com/openshift/kubernetes/pull/1140 is in 4.10.0-0.ci-2022-01-25-065550 or later. To confirm the fix, we need to see that this error is reduced over the next 1000 runs.

We also have opened a revert (https://github.com/openshift/kubernetes/pull/1142) of the PR that likely caused the bug in case the patch above does not work.

Comment 9 Ryan Phillips 2022-01-28 14:48:39 UTC
*** Bug 2047501 has been marked as a duplicate of this bug. ***

Comment 10 David Eads 2022-01-28 14:52:21 UTC
The nightly payload was updated about 30 hours ago.  Since then, the install success rate has climbed 4%, inline with our expectations

This fix worked.

Comment 11 Elana Hashman 2022-01-28 19:38:29 UTC
Excellent news! Thanks for confirming.

Comment 12 Elana Hashman 2022-01-29 00:34:48 UTC
David found another failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26796/pull-ci-openshift-origin-master-e2e-aws-serial/1487126569716551680 which I confirmed as another instance of the same race.

It seems we may still be hitting the race condition, although much more rarely. 

I discussed reverting the upstream patch that adds fullnames of static pods on the upstream revert PR here: https://github.com/kubernetes/kubernetes/pull/107734#issuecomment-1024768384

Seeing as that PR fixed a similar/worse issue with static pod handling, we may want to try to determine a patch to address the remaining races rather than roll back.

Comment 13 Ryan Phillips 2022-02-01 18:44:18 UTC
There is work upstream bening done on this PR: https://github.com/kubernetes/kubernetes/pull/107854 and possibly https://github.com/kubernetes/kubernetes/pull/107900

Comment 14 Ryan Phillips 2022-02-03 14:25:05 UTC
*** Bug 2048756 has been marked as a duplicate of this bug. ***

Comment 19 Harshal Patil 2022-02-11 16:17:31 UTC
*** Bug 2053255 has been marked as a duplicate of this bug. ***

Comment 23 W. Trevor King 2022-03-02 21:28:26 UTC
Harshal added UpgradeBlocker back on 2022-02-14, which is a trigger for an evaluation process [1].  But this bug is verified for 4.11 and bug 2050250 is verified for 4.10.0.  The 4.9.z bug 2050253 is still NEW, but 4.9.z has been out for a while, and I don't get the impression that this series is fixing a recent regression within the 4.9 z stream.  So I'm going to drop UpgradeBlocker here.  If folks feel like I'm missing something, feel free to restore the keyword, and fill out an impact statement [2] that sets the context, explains which A->B updates are vulnerable, and explains why blocking them is required to keep customers safe.

[1]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#summary
[2]: https://github.com/openshift/enhancements/tree/20183fad84f682c83e60386593c0eca717ee5bc9/enhancements/update/update-blocker-lifecycle#impact-statement-request

Comment 28 W. Trevor King 2022-04-28 07:01:04 UTC
Is it worth moving this back to Node, updating the description and/or doc text to explain what is fixed now, and then creating a new bug for any follow-up etcd work?  I haven't been following all that closely, but things like kublet pod-handling fixes seem like something that could be tracked and potentially backported independently of... whatever is left for etcd to do.

Comment 36 Red Hat Bugzilla 2023-11-18 04:25:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.