Hide Forgot
82% of the past 24 hours of 4.4 upgrade failures include this NodeControllerDegraded degradation [1]. An example job is [2], which is from registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-11-22-122829 to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-12-02-145545 (but which passed). Another example job is [3], which is from registry.svc.ci.openshift.org/ocp/release:4.2.9 to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2019-12-02-145545, which failed: Dec 2 16:31:41.298: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded Dec 2 16:31:41.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready ... Dec 2 16:35:01.298: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded Dec 2 16:35:01.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready Dec 2 16:35:11.300: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete ... Dec 2 16:40:51.297: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete Dec 2 16:41:01.301: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded Dec 2 16:41:01.302: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready ... Dec 2 16:44:01.297: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.ci-2019-12-02-145545: the cluster operator kube-apiserver is degraded Dec 2 16:44:01.297: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready Dec 2 16:44:11.297: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.ci-2019-12-02-145545: 13% complete ... Looking at that node: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/artifacts/e2e-aws-upgrade/nodes.json | jq '.items[] | select(.metadata.name == "ip-10-0-136-232.ec2.internal").status.conditions' [ { "lastHeartbeatTime": "2019-12-02T16:17:48Z", "lastTransitionTime": "2019-12-02T16:19:00Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure" }, { "lastHeartbeatTime": "2019-12-02T16:17:48Z", "lastTransitionTime": "2019-12-02T16:19:00Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure" }, { "lastHeartbeatTime": "2019-12-02T16:17:48Z", "lastTransitionTime": "2019-12-02T16:19:00Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "PIDPressure" }, { "lastHeartbeatTime": "2019-12-02T16:17:48Z", "lastTransitionTime": "2019-12-02T16:19:00Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready" } ] No journal logs from that machine: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/artifacts/e2e-aws-upgrade/nodes/masters-journal | gunzip | grep ' ip-10-0-136-232 ' Not sure what the next place to look would be. Might be an MCO thing, but I'm not clear on how to distinguish between MCO and Node (or machine-API?) issues for "node mysteriously disappeared". [1]: https://search.svc.ci.openshift.org/chart?name=^release-openshift-origin-installer-e2e-aws-upgrade&search=clusteroperator/kube-apiserver%20changed%20Degraded%20to%20True:%20NodeControllerDegradedMasterNodesReady&search=Container%20setup%20exited%20with%20code%20.,%20reason%20Error [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11907 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906
Most likely not the machine-api here. We don't do anything for upgrades. From [1]: Dec 02 16:18:19.920 I node/ip-10-0-136-232.ec2.internal Written pending config rendered-master-da34d17ae70ea8f01209895ce7c77282 Dec 02 16:18:19.929 I node/ip-10-0-136-232.ec2.internal Node will reboot into config rendered-master-da34d17ae70ea8f01209895ce7c77282 Dec 02 16:18:20.185 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-136-232.ec2.internal Received signal to terminate, becoming unready, but keeping serving Dec 02 16:18:20.322 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-136-232.ec2.internal All pre-shutdown hooks have been finished Dec 02 16:18:23.097 I openshift-apiserver OpenShift API started failing: Get https://api.ci-op-i2rxzx1t-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Dec 02 16:18:30.099 - 59s E openshift-apiserver OpenShift API is not responding to GET requests So, we can see the MCD started to reboot the host. Either the host never went down because something was blocking the reboot, or the host never came back up, or the host came back up and the kubelet never started. 1: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11906/build-log.txt
I've floated [1] to gather console logs on AWS for at least some of our tests (although not UPI, etc. in that PR), which might help narrow this down. [1]: https://github.com/openshift/release/pull/6189
Instance output: http://pastebin.test.redhat.com/819174 [ 5.669591] coreos-cryptfs[1000]: coreos-cryptfs: FAILED to open /dev/disk/by-label/crypt_rootfs
Marking urgent because 4.2 to 4.3 upgrades are broken and this was implicated. If not, reopen https://bugzilla.redhat.com/show_bug.cgi?id=1778529
*** Red HatBug 1779108 has been marked as a duplicate of this bug. ***
Happening in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/153/build-log.txt series as well. Third master never come back.
This should have been fixed. The Dracut date shows that the build happened on Dec 04. This should be fixed with the next machine-os-content release.
*** Bug 1781062 has been marked as a duplicate of this bug. ***
Bug 1765219#c6 hit same issue.
Closing this out as completed. > Bug 1765219#c6 hit same issue. The issue may appear to be the same but it is unrelated.
From 4.2.2 to current 4.3 latest payload 4.3.0-0.nightly-2019-12-13-180405(3 days ago) , the upgrade still failed.Move to the bug to MODIFIED to wait for new good payload.
If we're going to wait for a nightly build (makes sense), we probably want to move back to ASSIGNED, since there haven't been any 4.3 nightlies built in the past 3 days [1]. Looks like the fix for this issue was in 43.81.201912131630.0 with sha256:78142ad80f5b184f314351c9c48f78db357de4fbea4c6c98bb88a10a6b9118d8 -> sha256:689b8e3c9138e6526c5b01a983868215c2d355847dc0ffef25da4dbce047bc26 [2]. And the most recent accepted nightly is still running the former [3]. [1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly [2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/1108/?log#log [3]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-13-180405
Need to go MODIFIED. ART's sweeper will move it to ON_QA on the next nightly, and in parallel sweep this bug into the appropriate errata.
Upgrade succeed from 4.2.12 to 4.3.0-0.nightly-2019-12-18-224446.
Testing with 4.2.12 -> 4.3.0-0.nightly-2019-12-19-131558 on AWS with the mirror config flavor (i.e. tell cluster-bot 'test upgrade 4.2.12 4.3.0-0.nightly-2019-12-19-131558 aws,mirror'), I hit this again [1]. Must-gather failed [2] [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9 created [must-gather-f7jbm] OUT gather did not start: timed out waiting for the condition Same missing node symptoms: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/nodes.json | jq -r '.items[] | {metadata, conditions: ([.status.conditions[] | {key: .type, value: .}] | from_entries)} | select(.conditions.Ready.status == "Unknown") | .conditions.Ready.lastHeartbeatTime + " " + .metadata.name + " " + .metadata.annotations["machine.openshift.io/machine"]' 2019-12-19T19:24:51Z ip-10-0-129-98.ec2.internal openshift-machine-api/ci-op-kjw9pck0-881b5-26vrn-master-2 2019-12-19T19:25:01Z ip-10-0-135-25.ec2.internal openshift-machine-api/ci-op-kjw9pck0-881b5-26vrn-worker-us-east-1a-5ktbk Test harness didn't attempt to gather console logs [3,4], not sure why not yet. Comparing with the passing nightly from [5]: $ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-18-224446) | jq -r '.config.config.Labels.version' 43.81.201912181720.0 $ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558) | jq -r '.config.config.Labels.version' 43.81.201912181720.0 Other release-image changes seem unrelated (ironic, Kuryr, test suite [6]). I dunno how the mirror test flavor would trigger this, so I checked for other recent occurrences [7], and there are a handful, including this 4.2.12 -> 4.3.0-0.nightly-2019-12-19-185024 non-mirror job [8]. So I'm going to open this back up until we have a better understanding of why these are still turning up... [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/must-gather/must-gather.log [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/nodes/ [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/38/artifacts/e2e-aws-upgrade/container-logs/teardown.log [5]: https://bugzilla.redhat.com/show_bug.cgi?id=1778904#c32 [6]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-19-131558?from=4.3.0-0.nightly-2019-12-18-224446 [7]: https://search.svc.ci.openshift.org/?search=NodeControllerDegradedMasterNodesReady&maxAge=24h&context=-1&type=build-log [8]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/270
New job launched between the same targets sailed right through [1]. I'll launch some more to get better statistics after I copy [2] over into [3]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/44 [2]: https://github.com/openshift/release/pull/6189 [3]: https://github.com/openshift/ci-tools/blob/master/pkg/steps/clusterinstall/template.go
Interestingly enough runs after 38 have all passed (39-44) https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2
(In reply to W. Trevor King from comment #33) > Must-gather failed [2] > > [must-gather ] OUT pod for plug-in image > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9 created > [must-gather-f7jbm] OUT gather did not start: timed out waiting for the > condition It did manage to gather data in some jobs from [7] https://search.svc.ci.openshift.org/?search=NodeControllerDegradedMasterNodesReady&maxAge=24h&context=-1&type=build-log: * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12920 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166
(In reply to W. Trevor King from comment #33) > So I'm going to > open this back up until we have a better understanding of why these are > still turning up... It seems master services are throwing this error when machine is being rebooted. ip-10-0-154-95.ec2.internal MCD pre-pivot log: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-f2v96_machine-config-daemon_previous.log ip-10-0-154-95.ec2.internal MCD post-reboot log: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12925/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-f2v96_machine-config-daemon_previous.log Then master services give up and set Degraded: Dec 19 20:40:23.976 E clusteroperator/kube-apiserver changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) "ip-10-0-154-95.ec2.internal" not ready Dec 19 20:40:23.976 E clusteroperator/kube-scheduler changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) "ip-10-0-154-95.ec2.internal" not ready Soon kube-apiserver services notice the node is back: Dec 19 20:40:29.116 W clusteroperator/kube-scheduler changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-154-95.ec2.internal pods/openshift-kube-scheduler-ip-10-0-154-95.ec2.internal container="scheduler" is not ready\nNodeControllerDegraded: All master node(s) are ready Dec 19 20:40:29.142 W clusteroperator/kube-apiserver changed Degraded to False: AsExpected: StaticPodsDegraded: nodes/ip-10-0-154-95.ec2.internal pods/kube-apiserver-ip-10-0-154-95.ec2.internal container="kube-apiserver-5" is not ready\nNodeControllerDegraded: All master node(s) are ready It seems reboot took a bit too long (extra 7 seconds?) so master operators flipped the Degraded status
After further discussion on this it was decided the issue now seen is a new bug. Moving this back to ON_QA and a new bug will be opened.
Since the new failure seen is a new issue I am setting back to VERIFIED based on https://bugzilla.redhat.com/show_bug.cgi?id=1778904#c32
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
*** Bug 1785737 has been marked as a duplicate of this bug. ***