Bug 2050250 - Install fails to bootstrap, complaining about DefragControllerDegraded and sad members
Summary: Install fails to bootstrap, complaining about DefragControllerDegraded and sa...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
Depends On: 2040533
Blocks: 2042501 2050253
TreeView+ depends on / blocked
Reported: 2022-02-03 14:36 UTC by Ryan Phillips
Modified: 2022-03-10 16:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2040533
: 2050253 (view as bug list)
Last Closed: 2022-03-10 16:43:22 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 1169 0 None Merged [release-4.10] Bug 2050250: Upstream 107900 static pod fix 2022-02-16 13:57:32 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:43:45 UTC

Comment 5 W. Trevor King 2022-02-11 07:05:26 UTC
Looking into those results a bit more closely, there are some false positives.  For example, [1] failed 16h ago, but it's:

INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.10.0-rc.1) 
INFO[2022-02-10T13:40:15Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.10.0-rc.1 
INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907) 
INFO[2022-02-10T13:40:15Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907 

And #1169 landed after rc.1 [2].  Pulling out matching install versions:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to' | jq -r 'to_entries[].value | select(length > 1)["Resolved release initial to"][].context[]' | sed 's/.*initial to //' | sort | uniq -c
      1 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
      2 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-054720
      1 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-10-062935
      4 registry.ci.openshift.org/ocp/release:4.10.0-rc.1
      1 registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-09-142639

With comment 1 putting the first 4.10 nightly build with #1169 our fix around 2022-02-07 20:35Z, that does leave a few concerning entries.  Let's find that 4.10.0-0.nightly-2022-02-10-062935 run:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to.*4.
10.0-0.nightly-2022-02-10-062935' | jq -r 'to_entries[] | select(.value | length > 1).key'

Finding the kubelet version:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712/artifacts/e2e-gcp-serial/nodes/ci-op-tk0156gz-f5b0e-cg7fn-master-0/journal | zgrep hyperkube | head -n1
Feb 10 06:51:36.410913 ci-op-tk0156gz-f5b0e-cg7fn-master-0 machine-config-daemon[2147]:   openshift-hyperkube 4.10.0-202201230027.p0.g06791f6.assembly.stream.el8 -> 4.10.0-202202090636.p0.g759c22b.assembly.stream.el8

I guess that's the MCD pivoting from whatever kubelet was in the bootimage to our nightly's kubelet at commit 759c22b.

$ git --no-pager log --first-parent --date=short --format='%an %h %s' -2 origin/release-4.10
OpenShift Merge Robot 2e8bad73c83 Merge pull request #1164 from openshift-cherrypick-robot/cherry-pick-977-to-release-4.10
OpenShift Merge Robot 759c22b674b Merge pull request #1169 from rphillips/backports/107900_4.10

Ok, so that does seem like it's seeing our symptoms with a kubelet that contains #1169.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1491768922326700032
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2022-02-11-021652?from=4.10.0-rc.1

Comment 6 W. Trevor King 2022-02-11 07:12:25 UTC
Also, bug 2053255 has been opened to track some remaining races.  Maybe there's some way to confirm that this bug fixes whatever it fixed, without confusing with whatever is going on in bug 2053255.  Or maybe they're the same thing, and it's better to close 2053255 as a dup and keep this series open while we bottom out the remaining races.

Comment 12 errata-xmlrpc 2022-03-10 16:43:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.