1768262 – node failed to upgrade - master node not ready

Bug 1768262 - node failed to upgrade - master node not ready

Summary: node failed to upgrade - master node not ready

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Ryan Phillips
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-03 18:21 UTC by Ben Parees
Modified:	2020-01-23 11:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:10:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24129	0	'None'	closed	Bug 1768262: Bump nodes ready timeout	2020-11-06 09:55:01 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:10:40 UTC

Description Ben Parees 2019-11-03 18:21:59 UTC

Description of problem:
Nov  1 04:09:50.155: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-10-31-223009: the cluster operator kube-apiserver is degraded
Nov  1 04:09:50.155: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-146-62.ec2.internal" not ready


Nov  1 04:17:10.494: INFO: Condition Ready of node ip-10-0-146-62.ec2.internal is false, but Node is tainted by NodeController with [{node-role.kubernetes.io/master  NoSchedule <nil>} {node.kubernetes.io/unschedulable  NoSchedule 2019-11-01 03:26:21 +0000 UTC} {node.kubernetes.io/unreachable  NoSchedule 2019-11-01 03:27:14 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2019-11-01 03:27:20 +0000 UTC}]. Failure


in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10319

Comment 1 Ben Parees 2019-11-03 18:24:38 UTC

possibly related but probably not since it's Azure, not AWS: worker node failed to upgrade/become ready:

Nov  1 04:48:22.871: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Nov  1 04:48:22.871: INFO: Unexpected error occurred: Pools did not complete upgrade: timed out waiting for the condition

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.2/200

Feel free to split it out as a separate bug after investigation.

Comment 2 Ben Parees 2019-11-03 18:25:52 UTC

recurrence of the AWS upgrade failure from the initial BZ description:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10333

Comment 3 Ryan Phillips 2019-11-07 18:25:40 UTC

In the 10333 upgrade test the ip-10-0-137-155.ec2.internal node did not come back. It is hard to tell as to why, since those master logs are missing.

Comment 4 Ryan Phillips 2019-11-11 19:46:55 UTC

Build: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264

Log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264/build-log.txt

1. At 11:36:29 ip-10-0-151-180 node reboots
2. Upon reboot, there are a number of pods exiting with 255 (or other) error codes

I suspect a timeout needs to be bumped within the unit tests.

Comment 6 MinLi 2019-11-19 09:02:27 UTC

test several times, not reproduce, verified.

Comment 8 errata-xmlrpc 2020-01-23 11:10:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.