Bug 1768262

Summary:	node failed to upgrade - master node not ready
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED ERRATA	QA Contact:	MinLi <minmli>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aos-bugs, jokerman, schoudha
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:10:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2019-11-03 18:21:59 UTC

Description of problem:
Nov  1 04:09:50.155: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-10-31-223009: the cluster operator kube-apiserver is degraded
Nov  1 04:09:50.155: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-146-62.ec2.internal" not ready


Nov  1 04:17:10.494: INFO: Condition Ready of node ip-10-0-146-62.ec2.internal is false, but Node is tainted by NodeController with [{node-role.kubernetes.io/master  NoSchedule <nil>} {node.kubernetes.io/unschedulable  NoSchedule 2019-11-01 03:26:21 +0000 UTC} {node.kubernetes.io/unreachable  NoSchedule 2019-11-01 03:27:14 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2019-11-01 03:27:20 +0000 UTC}]. Failure


in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10319

Comment 1 Ben Parees 2019-11-03 18:24:38 UTC

possibly related but probably not since it's Azure, not AWS: worker node failed to upgrade/become ready:

Nov  1 04:48:22.871: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Nov  1 04:48:22.871: INFO: Unexpected error occurred: Pools did not complete upgrade: timed out waiting for the condition

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.2/200

Feel free to split it out as a separate bug after investigation.

Comment 2 Ben Parees 2019-11-03 18:25:52 UTC

recurrence of the AWS upgrade failure from the initial BZ description:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10333

Comment 3 Ryan Phillips 2019-11-07 18:25:40 UTC

In the 10333 upgrade test the ip-10-0-137-155.ec2.internal node did not come back. It is hard to tell as to why, since those master logs are missing.

Comment 4 Ryan Phillips 2019-11-11 19:46:55 UTC

Build: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264

Log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264/build-log.txt

1. At 11:36:29 ip-10-0-151-180 node reboots
2. Upon reboot, there are a number of pods exiting with 255 (or other) error codes

I suspect a timeout needs to be bumped within the unit tests.

Comment 6 MinLi 2019-11-19 09:02:27 UTC

test several times, not reproduce, verified.

Comment 8 errata-xmlrpc 2020-01-23 11:10:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062