Bug 1768262

Summary: node failed to upgrade - master node not ready
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: MinLi <minmli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, jokerman, schoudha
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:10:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ben Parees 2019-11-03 18:21:59 UTC
Description of problem:
Nov  1 04:09:50.155: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-10-31-223009: the cluster operator kube-apiserver is degraded
Nov  1 04:09:50.155: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-146-62.ec2.internal" not ready

Nov  1 04:17:10.494: INFO: Condition Ready of node ip-10-0-146-62.ec2.internal is false, but Node is tainted by NodeController with [{node-role.kubernetes.io/master  NoSchedule <nil>} {node.kubernetes.io/unschedulable  NoSchedule 2019-11-01 03:26:21 +0000 UTC} {node.kubernetes.io/unreachable  NoSchedule 2019-11-01 03:27:14 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2019-11-01 03:27:20 +0000 UTC}]. Failure


Comment 1 Ben Parees 2019-11-03 18:24:38 UTC
possibly related but probably not since it's Azure, not AWS: worker node failed to upgrade/become ready:

Nov  1 04:48:22.871: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Nov  1 04:48:22.871: INFO: Unexpected error occurred: Pools did not complete upgrade: timed out waiting for the condition


Feel free to split it out as a separate bug after investigation.

Comment 2 Ben Parees 2019-11-03 18:25:52 UTC
recurrence of the AWS upgrade failure from the initial BZ description:


Comment 3 Ryan Phillips 2019-11-07 18:25:40 UTC
In the 10333 upgrade test the ip-10-0-137-155.ec2.internal node did not come back. It is hard to tell as to why, since those master logs are missing.

Comment 4 Ryan Phillips 2019-11-11 19:46:55 UTC
Build: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264

Log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2648/pull-ci-openshift-installer-master-e2e-aws-upgrade/3264/build-log.txt

1. At 11:36:29 ip-10-0-151-180 node reboots
2. Upon reboot, there are a number of pods exiting with 255 (or other) error codes

I suspect a timeout needs to be bumped within the unit tests.

Comment 6 MinLi 2019-11-19 09:02:27 UTC
test several times, not reproduce, verified.

Comment 8 errata-xmlrpc 2020-01-23 11:10:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.