Description of problem: Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/ 1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?) 2) Presumably something is failing but this message doesn't make it clear what 3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades). https://bugzilla.redhat.com/show_bug.cgi?id=1690088 is for kube-apiserver, this one is for kube-scheduler.
Hit this issue when upgrade from 4.0.0-0.nightly-2019-03-19-004004 to 4.0.0-0.nightly-2019-03-20-153904. Upgrade failed. { "lastTransitionTime": "2019-03-21T07:12:36Z", "message": "Cluster operator kube-scheduler is reporting a failure: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: pods \"installer-6-ip-10-0-131-197.us-east-2.compute.internal\" not found", "reason": "ClusterOperatorFailing", "status": "True", "type": "Failing" }, { "lastTransitionTime": "2019-03-21T06:01:05Z", "message": "Unable to apply 4.0.0-0.nightly-2019-03-20-153904: the cluster operator kube-scheduler is failing", "reason": "ClusterOperatorFailing", "status": "True", "type": "Progressing" },
Created attachment 1546777 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 15 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'clusteroperator/kube-scheduler .* NodeInstallerFailing: 0 nodes are failing on revision' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
*** Bug 1691600 has been marked as a duplicate of this bug. ***
There have been some recent changes to library-go in this area: https://github.com/openshift/library-go/pull/313 https://github.com/openshift/library-go/pull/312 Could be that a bump in library-go could fix this.
I believe this is fixed by https://github.com/openshift/library-go/pull/312
Please check if it could be verified.
Verified to be fixed. [nathan@localhost 0410]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.11 True False 7m4s Cluster version is 4.0.0-0.11 (upgraded from 4.0.0-0.9) The message is enhanced during upgrade [nathan@localhost 0410]$ oc logs openshift-kube-scheduler-operator-5476946d7f-j8kdp|grep NodeInstallerFailing I0410 06:59:47.830854 1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0410 07:00:34.676275 1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758