Checking CI logs, I still see these failures. https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available https://search.ci.openshift.org/?search=Failed+to+wait+for+bootstrapping+to+complete&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&search=EtcdMembersDegraded
Looking into those results a bit more closely, there are some false positives. For example, [1] failed 16h ago, but it's: INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.10.0-rc.1) INFO[2022-02-10T13:40:15Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.10.0-rc.1 INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907) INFO[2022-02-10T13:40:15Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907 And #1169 landed after rc.1 [2]. Pulling out matching install versions: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to' | jq -r 'to_entries[].value | select(length > 1)["Resolved release initial to"][].context[]' | sed 's/.*initial to //' | sort | uniq -c 1 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64 2 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-054720 1 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-10-062935 4 registry.ci.openshift.org/ocp/release:4.10.0-rc.1 1 registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-09-142639 With comment 1 putting the first 4.10 nightly build with #1169 our fix around 2022-02-07 20:35Z, that does leave a few concerning entries. Let's find that 4.10.0-0.nightly-2022-02-10-062935 run: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to.*4. 10.0-0.nightly-2022-02-10-062935' | jq -r 'to_entries[] | select(.value | length > 1).key' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712 Finding the kubelet version: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712/artifacts/e2e-gcp-serial/nodes/ci-op-tk0156gz-f5b0e-cg7fn-master-0/journal | zgrep hyperkube | head -n1 Feb 10 06:51:36.410913 ci-op-tk0156gz-f5b0e-cg7fn-master-0 machine-config-daemon[2147]: openshift-hyperkube 4.10.0-202201230027.p0.g06791f6.assembly.stream.el8 -> 4.10.0-202202090636.p0.g759c22b.assembly.stream.el8 I guess that's the MCD pivoting from whatever kubelet was in the bootimage to our nightly's kubelet at commit 759c22b. $ git --no-pager log --first-parent --date=short --format='%an %h %s' -2 origin/release-4.10 OpenShift Merge Robot 2e8bad73c83 Merge pull request #1164 from openshift-cherrypick-robot/cherry-pick-977-to-release-4.10 OpenShift Merge Robot 759c22b674b Merge pull request #1169 from rphillips/backports/107900_4.10 Ok, so that does seem like it's seeing our symptoms with a kubelet that contains #1169. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1491768922326700032 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2022-02-11-021652?from=4.10.0-rc.1
Also, bug 2053255 has been opened to track some remaining races. Maybe there's some way to confirm that this bug fixes whatever it fixed, without confusing with whatever is going on in bug 2053255. Or maybe they're the same thing, and it's better to close 2053255 as a dup and keep this series open while we bottom out the remaining races.
Checking CI logs, I don't see any recent failures. https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056