Bug 1778904

Summary: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) ".*" not ready
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: RHCOSAssignee: Ben Howard <behoward>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.3.0CC: aos-bugs, bbreard, ccoleman, dustymabe, hongli, imcleod, jchaloup, jligon, jokerman, lmohanty, mgugino, miabbott, mmasters, mpatel, nberry, nstielau, scuppett, sdodson, smilner, vrutkovs, wsun, wzheng, xxia
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:14:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1783973    
Bug Blocks: 1779125, 1775849, 1781141, 1781313, 1785307    

Description W. Trevor King 2019-12-02 18:22:54 UTC
82% of the past 24 hours of 4.4 upgrade failures include this NodeControllerDegraded degradation [1].  An example job is [2], which is from registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-11-22-122829 to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-12-02-145545 (but which passed).  Another example job is [3], which is from registry.svc.ci.openshift.org/ocp/release:4.2.9 to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-12-02-145545, which failed:

Dec  2 16:31:41.298: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded
Dec  2 16:31:41.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready
...
Dec  2 16:35:01.298: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded
Dec  2 16:35:01.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready
Dec  2 16:35:11.300: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete
...
Dec  2 16:40:51.297: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete
Dec  2 16:41:01.301: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded
Dec  2 16:41:01.302: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready
...
Dec  2 16:44:01.297: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded
Dec  2 16:44:01.297: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready
Dec  2 16:44:11.297: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete
...


Looking at that node:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/artifacts/e2e-aws-upgrade/nodes.json | jq '.items[] | select(.metadata.name == "ip-10-0-136-232.ec2.internal").status.conditions'
[
  {
    "lastHeartbeatTime": "2019-12-02T16:17:48Z",
    "lastTransitionTime": "2019-12-02T16:19:00Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "MemoryPressure"
  },
  {
    "lastHeartbeatTime": "2019-12-02T16:17:48Z",
    "lastTransitionTime": "2019-12-02T16:19:00Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "DiskPressure"
  },
  {
    "lastHeartbeatTime": "2019-12-02T16:17:48Z",
    "lastTransitionTime": "2019-12-02T16:19:00Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "PIDPressure"
  },
  {
    "lastHeartbeatTime": "2019-12-02T16:17:48Z",
    "lastTransitionTime": "2019-12-02T16:19:00Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "Ready"
  }
]

No journal logs from that machine:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/artifacts/e2e-aws-upgrade/nodes/masters-journal | gunzip | grep ' ip-10-0-136-232 '

Not sure what the next place to look would be.  Might be an MCO thing, but I'm not clear on how to distinguish between MCO and Node (or machine-API?) issues for "node mysteriously disappeared".

[1]: https://search.svc.ci.openshift.org/chart?name=^release-openshift-origin-installer-e2e-aws-upgrade&search=clusteroperator/kube-apiserver%20changed%20Degraded%20to%20True:%20NodeControllerDegradedMasterNodesReady&search=Container%20setup%20exited%20with%20code%20.,%20reason%20Error
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11907
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906

Comment 4 Michael Gugino 2019-12-02 21:36:04 UTC
Most likely not the machine-api here.  We don't do anything for upgrades.


From [1]:

Dec 02 16:18:19.920 I node/ip-10-0-136-232.ec2.internal Written pending config rendered-master-da34d17ae70ea8f01209895ce7c77282
Dec 02 16:18:19.929 I node/ip-10-0-136-232.ec2.internal Node will reboot into config rendered-master-da34d17ae70ea8f01209895ce7c77282
Dec 02 16:18:20.185 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-136-232.ec2.internal Received signal to terminate, becoming unready, but keeping serving
Dec 02 16:18:20.322 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-136-232.ec2.internal All pre-shutdown hooks have been finished
Dec 02 16:18:23.097 I openshift-apiserver OpenShift API started failing: Get https://api.ci-op-i2rxzx1t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Dec 02 16:18:30.099 - 59s   E openshift-apiserver OpenShift API is not responding to GET requests


So, we can see the MCD started to reboot the host.  Either the host never went down because something was blocking the reboot, or the host never came back up, or the host came back up and the kubelet never started.

1: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/build-log.txt

Comment 5 W. Trevor King 2019-12-02 22:03:50 UTC
I've floated [1] to gather console logs on AWS for at least some of our tests (although not UPI, etc. in that PR), which might help narrow this down.

[1]: https://github.com/openshift/release/pull/6189

Comment 6 Vadim Rutkovsky 2019-12-04 16:31:34 UTC
Instance output: http://pastebin.test.redhat.com/819174

[    5.669591] coreos-cryptfs[1000]: coreos-cryptfs: FAILED to open /dev/disk/by-label/crypt_rootfs

Comment 7 Clayton Coleman 2019-12-04 23:04:56 UTC
Marking urgent because 4.2 to 4.3 upgrades are broken and this was implicated.  If not, reopen https://bugzilla.redhat.com/show_bug.cgi?id=1778529

Comment 8 Mrunal Patel 2019-12-05 15:10:44 UTC
*** Bug 1779108 has been marked as a duplicate of this bug. ***

Comment 11 Jan Chaloupka 2019-12-10 15:22:51 UTC
Happening in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/153/build-log.txt series as well. Third master never come back.

Comment 12 Ben Howard 2019-12-10 15:40:38 UTC
This should have been fixed. The Dracut date shows that the build happened on Dec 04. This should be fixed with the next machine-os-content release.

Comment 15 Andrew McDermott 2019-12-11 13:53:35 UTC
*** Bug 1781062 has been marked as a duplicate of this bug. ***

Comment 21 Xingxing Xia 2019-12-13 09:16:34 UTC
Bug 1765219#c6 hit same issue.

Comment 25 Ben Howard 2019-12-16 20:28:33 UTC
Closing this out as completed. 

> Bug 1765219#c6 hit same issue.
The issue may appear to be the same but it is unrelated.

Comment 27 Wei Sun 2019-12-17 03:24:42 UTC
From 4.2.2 to current 4.3 latest payload 4.3.0-0.nightly-2019-12-13-180405(3 days ago) , the upgrade still failed.Move to the bug to MODIFIED to wait for new good payload.

Comment 28 W. Trevor King 2019-12-17 03:42:35 UTC
If we're going to wait for a nightly build (makes sense), we probably want to move back to ASSIGNED, since there haven't been any 4.3 nightlies built in the past 3 days [1].  Looks like the fix for this issue was in 43.81.201912131630.0 with sha256:78142ad80f5b184f314351c9c48f78db357de4fbea4c6c98bb88a10a6b9118d8 -> sha256:689b8e3c9138e6526c5b01a983868215c2d355847dc0ffef25da4dbce047bc26 [2].  And the most recent accepted nightly is still running the former [3].

[1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/1108/?log#log
[3]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-13-180405

Comment 30 W. Trevor King 2019-12-18 19:25:31 UTC
Need to go MODIFIED.  ART's sweeper will move it to ON_QA on the next nightly, and in parallel sweep this bug into the appropriate errata.

Comment 32 Wenjing Zheng 2019-12-19 06:01:28 UTC
Upgrade succeed from 4.2.12 to 4.3.0-0.nightly-2019-12-18-224446.

Comment 33 W. Trevor King 2019-12-19 21:40:39 UTC
Testing with 4.2.12 -> 4.3.0-0.nightly-2019-12-19-131558 on AWS with the mirror config flavor (i.e. tell cluster-bot 'test upgrade 4.2.12 4.3.0-0.nightly-2019-12-19-131558 aws,mirror'), I hit this again [1].  Must-gather failed [2]

[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9 created
[must-gather-f7jbm] OUT gather did not start: timed out waiting for the condition

Same missing node symptoms:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/nodes.json | jq -r '.items[] | {metadata, conditions: ([.status.conditions[] | {key: .type, value: .}] | from_entries)} | select(.conditions.Ready.status == "Unknown") | .conditions.Ready.lastHeartbeatTime + " " + .metadata.name + " " + .metadata.annotations["machine.openshift.io/machine"]'
2019-12-19T19:24:51Z ip-10-0-129-98.ec2.internal openshift-machine-api/ci-op-kjw9pck0-881b5-26vrn-master-2
2019-12-19T19:25:01Z ip-10-0-135-25.ec2.internal openshift-machine-api/ci-op-kjw9pck0-881b5-26vrn-worker-us-east-1a-5ktbk

Test harness didn't attempt to gather console logs [3,4], not sure why not yet.  Comparing with the passing nightly from [5]:

$ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-18-224446) | jq -r '.config.config.Labels.version'
43.81.201912181720.0
$ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558) | jq -r '.config.config.Labels.version'
43.81.201912181720.0

Other release-image changes seem unrelated (ironic, Kuryr, test suite [6]).  I dunno how the mirror test flavor would trigger this, so I checked for other recent occurrences [7], and there are a handful, including this 4.2.12 -> 4.3.0-0.nightly-2019-12-19-185024 non-mirror job [8].  So I'm going to open this back up until we have a better understanding of why these are still turning up...

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/must-gather/must-gather.log
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/nodes/
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/container-logs/teardown.log
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1778904#c32
[6]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-19-131558?from=4.3.0-0.nightly-2019-12-18-224446
[7]: https://search.svc.ci.openshift.org/?search=NodeControllerDegradedMasterNodesReady&maxAge=24h&context=-1&type=build-log
[8]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/270

Comment 34 W. Trevor King 2019-12-20 00:10:36 UTC
New job launched between the same targets sailed right through [1].  I'll launch some more to get better statistics after I copy [2] over into [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/44
[2]: https://github.com/openshift/release/pull/6189
[3]: https://github.com/openshift/ci-tools/blob/master/pkg/steps/clusterinstall/template.go

Comment 35 Steve Milner 2019-12-20 14:50:59 UTC
Interestingly enough runs after 38 have all passed (39-44) https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2

Comment 36 Vadim Rutkovsky 2019-12-20 14:59:41 UTC
(In reply to W. Trevor King from comment #33)
> Must-gather failed [2]
> 
> [must-gather      ] OUT pod for plug-in image
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9 created
> [must-gather-f7jbm] OUT gather did not start: timed out waiting for the
> condition

It did manage to gather data in some jobs from [7] https://search.svc.ci.openshift.org/?search=NodeControllerDegradedMasterNodesReady&maxAge=24h&context=-1&type=build-log:

* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12920
* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925
* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166

Comment 37 Vadim Rutkovsky 2019-12-20 15:25:12 UTC
(In reply to W. Trevor King from comment #33)
> So I'm going to
> open this back up until we have a better understanding of why these are
> still turning up...

It seems master services are throwing this error when machine is being rebooted.

ip-10-0-154-95.ec2.internal MCD pre-pivot log: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-f2v96_machine-config-daemon_previous.log

ip-10-0-154-95.ec2.internal MCD post-reboot log: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-f2v96_machine-config-daemon_previous.log

Then master services give up and set Degraded:

Dec 19 20:40:23.976 E clusteroperator/kube-apiserver changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) "ip-10-0-154-95.ec2.internal" not ready
Dec 19 20:40:23.976 E clusteroperator/kube-scheduler changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) "ip-10-0-154-95.ec2.internal" not ready


Soon kube-apiserver services notice the node is back:

Dec 19 20:40:29.116 W clusteroperator/kube-scheduler changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-154-95.ec2.internal pods/openshift-kube-scheduler-ip-10-0-154-95.ec2.internal container="scheduler" is not ready\nNodeControllerDegraded: All master node(s) are ready

Dec 19 20:40:29.142 W clusteroperator/kube-apiserver changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-154-95.ec2.internal pods/kube-apiserver-ip-10-0-154-95.ec2.internal container="kube-apiserver-5" is not ready\nNodeControllerDegraded: All master node(s) are ready


It seems reboot took a bit too long (extra 7 seconds?) so master operators flipped the Degraded status

Comment 38 Steve Milner 2019-12-20 19:10:25 UTC
After further discussion on this it was decided the issue now seen is a new bug. Moving this back to ON_QA and a new bug will be opened.

Comment 39 Michael Nguyen 2019-12-20 19:37:12 UTC
Since the new failure seen is a new issue I am setting back to VERIFIED based on https://bugzilla.redhat.com/show_bug.cgi?id=1778904#c32

Comment 41 errata-xmlrpc 2020-01-23 11:14:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 42 Micah Abbott 2020-01-30 15:34:27 UTC
*** Bug 1785737 has been marked as a duplicate of this bug. ***