Bug 2050250

Summary: Install fails to bootstrap, complaining about DefragControllerDegraded and sad members
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: alray, aos-bugs, bleanhar, dcbw, deads, dgoodwin, ehashman, nagrawal, rphillips, schoudha, wking
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2040533
: 2050253 (view as bug list) Environment:
Last Closed: 2022-03-10 16:43:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2040533    
Bug Blocks: 2042501, 2050253    

Comment 5 W. Trevor King 2022-02-11 07:05:26 UTC
Looking into those results a bit more closely, there are some false positives.  For example, [1] failed 16h ago, but it's:

INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.10.0-rc.1) 
INFO[2022-02-10T13:40:15Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.10.0-rc.1 
INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907) 
INFO[2022-02-10T13:40:15Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907 

And #1169 landed after rc.1 [2].  Pulling out matching install versions:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to' | jq -r 'to_entries[].value | select(length > 1)["Resolved release initial to"][].context[]' | sed 's/.*initial to //' | sort | uniq -c
      1 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
      2 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-054720
      1 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-10-062935
      4 registry.ci.openshift.org/ocp/release:4.10.0-rc.1
      1 registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-09-142639

With comment 1 putting the first 4.10 nightly build with #1169 our fix around 2022-02-07 20:35Z, that does leave a few concerning entries.  Let's find that 4.10.0-0.nightly-2022-02-10-062935 run:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to.*4.
10.0-0.nightly-2022-02-10-062935' | jq -r 'to_entries[] | select(.value | length > 1).key'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712

Finding the kubelet version:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712/artifacts/e2e-gcp-serial/nodes/ci-op-tk0156gz-f5b0e-cg7fn-master-0/journal | zgrep hyperkube | head -n1
Feb 10 06:51:36.410913 ci-op-tk0156gz-f5b0e-cg7fn-master-0 machine-config-daemon[2147]:   openshift-hyperkube 4.10.0-202201230027.p0.g06791f6.assembly.stream.el8 -> 4.10.0-202202090636.p0.g759c22b.assembly.stream.el8

I guess that's the MCD pivoting from whatever kubelet was in the bootimage to our nightly's kubelet at commit 759c22b.

$ git --no-pager log --first-parent --date=short --format='%an %h %s' -2 origin/release-4.10
OpenShift Merge Robot 2e8bad73c83 Merge pull request #1164 from openshift-cherrypick-robot/cherry-pick-977-to-release-4.10
OpenShift Merge Robot 759c22b674b Merge pull request #1169 from rphillips/backports/107900_4.10

Ok, so that does seem like it's seeing our symptoms with a kubelet that contains #1169.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1491768922326700032
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2022-02-11-021652?from=4.10.0-rc.1

Comment 6 W. Trevor King 2022-02-11 07:12:25 UTC
Also, bug 2053255 has been opened to track some remaining races.  Maybe there's some way to confirm that this bug fixes whatever it fixed, without confusing with whatever is going on in bug 2053255.  Or maybe they're the same thing, and it's better to close 2053255 as a dup and keep this series open while we bottom out the remaining races.

Comment 12 errata-xmlrpc 2022-03-10 16:43:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056