Seen in a 4.10 cluster-bot job (launched with an unqualified 'launch') :
level=info msg=Waiting up to 30m0s (until 12:32AM) for bootstrapping to complete...
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=error msg=Cluster operator etcd Degraded is True with ClusterMemberController_Error::DefragController_Error::EtcdMembers_UnhealthyMembers::StaticPods_Error: ClusterMemberControllerDegraded: unhealthy members found during reconciling members
level=error msg=DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
level=error msg=StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is terminated: Error: e.go:84 +0x32
level=error msg=StaticPodsDegraded: internal/poll.(*pollDesc).waitRead(...)
level=error msg=StaticPodsDegraded: internal/poll/fd_poll_runtime.go:89
level=error msg=StaticPodsDegraded: internal/poll.(*FD).Accept(0xc0005af400)
level=error msg=StaticPodsDegraded: internal/poll/fd_unix.go:402 +0x22c
level=error msg=StaticPodsDegraded: net.(*netFD).accept(0xc0005af400)
level=info msg=Pulling debug logs from the bootstrap machine
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20220114003251.tar.gz"
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=fatal msg=Bootstrap failed to complete
So fairly clear that master-0 had some issue. Or maybe it's master-2? Anyhow, I dunno if we need the whole stack trace included in the message. The installer's stdout (or maybe also includes its stderr) has the same stack trace noise .
By the time gather-extra collects the ClusterOperators, the message is a bit more compact:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1481776320365989888/artifacts/launch/gather-extra/artifacts/clusteroperators.json | jq -r '.items | select(.metadata.name == "etcd").status.conditions | select(.type == "Degraded").message'
ClusterMemberControllerDegraded: unhealthy members found during reconciling members
DefragControllerDegraded: cluster is unhealthy: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
EtcdMembersDegraded: 3 of 4 members are available, ci-ln-79cqxdk-72292-sfjqr-master-0 is unhealthy
StaticPodsDegraded: pod/etcd-ci-ln-79cqxdk-72292-sfjqr-master-2 container "etcd-health-monitor" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-health-monitor pod=etcd-ci-ln-79cqxdk-72292-sfjqr-master-2_openshift-etcd(c147ec293a1fd2923fcdbedc3337d507)
Only seen in in 4.10/dev CI, and not crazy common, but also not crazy rare:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available' | grep 'failures match'
release-openshift-origin-installer-launch-gcp-modern (all) - 275 runs, 37% failed, 3% of failures match = 1% impact
pull-ci-kubevirt-hyperconverged-cluster-operator-main-hco-e2e-upgrade-prev-index-azure (all) - 56 runs, 75% failed, 2% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn (all) - 14 runs, 64% failed, 11% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
pull-ci-openshift-origin-master-e2e-agnostic-cmd (all) - 42 runs, 48% failed, 5% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp-upgrade (all) - 48 runs, 54% failed, 4% of failures match = 2% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 65 runs, 71% failed, 4% of failures match = 3% impact
openshift-kubernetes-1116-nightly-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-1113-nightly-4.10-e2e-vsphere-upi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-serial (all) - 45 runs, 76% failed, 3% of failures match = 2% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 16 runs, 56% failed, 11% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 90 runs, 99% failed, 1% of failures match = 1% impact
pull-ci-openshift-console-master-e2e-gcp-console (all) - 143 runs, 55% failed, 1% of failures match = 1% impact
Patch from https://github.com/openshift/kubernetes/pull/1140 is in 4.10.0-0.ci-2022-01-25-065550 or later. To confirm the fix, we need to see that this error is reduced over the next 1000 runs.
We also have opened a revert (https://github.com/openshift/kubernetes/pull/1142) of the PR that likely caused the bug in case the patch above does not work.
*** Bug 2047501 has been marked as a duplicate of this bug. ***
The nightly payload was updated about 30 hours ago. Since then, the install success rate has climbed 4%, inline with our expectations
This fix worked.
Excellent news! Thanks for confirming.
David found another failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26796/pull-ci-openshift-origin-master-e2e-aws-serial/1487126569716551680 which I confirmed as another instance of the same race.
It seems we may still be hitting the race condition, although much more rarely.
I discussed reverting the upstream patch that adds fullnames of static pods on the upstream revert PR here: https://github.com/kubernetes/kubernetes/pull/107734#issuecomment-1024768384
Seeing as that PR fixed a similar/worse issue with static pod handling, we may want to try to determine a patch to address the remaining races rather than roll back.
There is work upstream bening done on this PR: https://github.com/kubernetes/kubernetes/pull/107854 and possibly https://github.com/kubernetes/kubernetes/pull/107900
*** Bug 2048756 has been marked as a duplicate of this bug. ***
*** Bug 2053255 has been marked as a duplicate of this bug. ***
Harshal added UpgradeBlocker back on 2022-02-14, which is a trigger for an evaluation process . But this bug is verified for 4.11 and bug 2050250 is verified for 4.10.0. The 4.9.z bug 2050253 is still NEW, but 4.9.z has been out for a while, and I don't get the impression that this series is fixing a recent regression within the 4.9 z stream. So I'm going to drop UpgradeBlocker here. If folks feel like I'm missing something, feel free to restore the keyword, and fill out an impact statement  that sets the context, explains which A->B updates are vulnerable, and explains why blocking them is required to keep customers safe.
Checking CI logs, I still see these failures.
Is it worth moving this back to Node, updating the description and/or doc text to explain what is fixed now, and then creating a new bug for any follow-up etcd work? I haven't been following all that closely, but things like kublet pod-handling fixes seem like something that could be tracked and potentially backported independently of... whatever is left for etcd to do.