2050250 – Install fails to bootstrap, complaining about DefragControllerDegraded and sad members

Bug 2050250 - Install fails to bootstrap, complaining about DefragControllerDegraded and sad members

Summary: Install fails to bootstrap, complaining about DefragControllerDegraded and sa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	2040533
Blocks:	2042501 2050253
TreeView+	depends on / blocked

Reported:	2022-02-03 14:36 UTC by Ryan Phillips
Modified:	2022-03-10 16:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2040533
Clones:	2050253 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:43:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1169	0	None	Merged	[release-4.10] Bug 2050250: Upstream 107900 static pod fix	2022-02-16 13:57:32 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:43:45 UTC

Comment 4 Sunil Choudhary 2022-02-11 06:08:03 UTC

Checking CI logs, I still see these failures.

https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available
https://search.ci.openshift.org/?search=Failed+to+wait+for+bootstrapping+to+complete&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&search=EtcdMembersDegraded

Comment 5 W. Trevor King 2022-02-11 07:05:26 UTC

Looking into those results a bit more closely, there are some false positives.  For example, [1] failed 16h ago, but it's:

INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.10.0-rc.1) 
INFO[2022-02-10T13:40:15Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.10.0-rc.1 
INFO[2022-02-10T13:40:15Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907) 
INFO[2022-02-10T13:40:15Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-10-132907 

And #1169 landed after rc.1 [2].  Pulling out matching install versions:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to' | jq -r 'to_entries[].value | select(length > 1)["Resolved release initial to"][].context[]' | sed 's/.*initial to //' | sort | uniq -c
      1 quay.io/openshift-release-dev/ocp-release:4.10.0-rc.1-x86_64
      2 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-054720
      1 registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-10-062935
      4 registry.ci.openshift.org/ocp/release:4.10.0-rc.1
      1 registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-02-09-142639

With comment 1 putting the first 4.10 nightly build with #1169 our fix around 2022-02-07 20:35Z, that does leave a few concerning entries.  Let's find that 4.10.0-0.nightly-2022-02-10-062935 run:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available&search=Resolved+release+initial+to.*4.
10.0-0.nightly-2022-02-10-062935' | jq -r 'to_entries[] | select(.value | length > 1).key'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712

Finding the kubelet version:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.10/1491663600396275712/artifacts/e2e-gcp-serial/nodes/ci-op-tk0156gz-f5b0e-cg7fn-master-0/journal | zgrep hyperkube | head -n1
Feb 10 06:51:36.410913 ci-op-tk0156gz-f5b0e-cg7fn-master-0 machine-config-daemon[2147]:   openshift-hyperkube 4.10.0-202201230027.p0.g06791f6.assembly.stream.el8 -> 4.10.0-202202090636.p0.g759c22b.assembly.stream.el8

I guess that's the MCD pivoting from whatever kubelet was in the bootimage to our nightly's kubelet at commit 759c22b.

$ git --no-pager log --first-parent --date=short --format='%an %h %s' -2 origin/release-4.10
OpenShift Merge Robot 2e8bad73c83 Merge pull request #1164 from openshift-cherrypick-robot/cherry-pick-977-to-release-4.10
OpenShift Merge Robot 759c22b674b Merge pull request #1169 from rphillips/backports/107900_4.10

Ok, so that does seem like it's seeing our symptoms with a kubelet that contains #1169.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1491768922326700032
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2022-02-11-021652?from=4.10.0-rc.1

Comment 6 W. Trevor King 2022-02-11 07:12:25 UTC

Also, bug 2053255 has been opened to track some remaining races.  Maybe there's some way to confirm that this bug fixes whatever it fixed, without confusing with whatever is going on in bug 2053255.  Or maybe they're the same thing, and it's better to close 2053255 as a dup and keep this series open while we bottom out the remaining races.

Comment 9 Sunil Choudhary 2022-02-21 17:42:41 UTC

Checking CI logs, I don't see any recent failures.

https://search.ci.openshift.org/?maxAge=48h&type=junit&search=DefragControllerDegraded:+cluster+is+unhealthy:+3+of+4+members+are+available

Comment 12 errata-xmlrpc 2022-03-10 16:43:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.