Bug 1690153

Summary: clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NodeAssignee: ravig <rgudimet>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, gblomqui, jiajliu, jokerman, mmccomas, rcook, rgudimet, sjenning, weinliu, wsun
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:46:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC none

Description Ben Parees 2019-03-18 21:35:03 UTC
Description of problem:
Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/


1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?)
2) Presumably something is failing but this message doesn't make it clear what
3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades).


https://bugzilla.redhat.com/show_bug.cgi?id=1690088 is for kube-apiserver, this one is for kube-scheduler.

Comment 2 liujia 2019-03-21 07:14:07 UTC
Hit this issue when upgrade from 4.0.0-0.nightly-2019-03-19-004004 to 4.0.0-0.nightly-2019-03-20-153904. Upgrade failed.
    {
      "lastTransitionTime": "2019-03-21T07:12:36Z",
      "message": "Cluster operator kube-scheduler is reporting a failure: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: pods \"installer-6-ip-10-0-131-197.us-east-2.compute.internal\" not found",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-03-21T06:01:05Z",
      "message": "Unable to apply 4.0.0-0.nightly-2019-03-20-153904: the cluster operator kube-scheduler is failing",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Progressing"
    },

Comment 3 W. Trevor King 2019-03-22 05:52:44 UTC
Created attachment 1546777 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 15 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-scheduler .* NodeInstallerFailing: 0 nodes are failing on revision'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 5 Xingxing Xia 2019-03-26 03:14:02 UTC
*** Bug 1691600 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-03-28 18:15:48 UTC
There have been some recent changes to library-go in this area:
https://github.com/openshift/library-go/pull/313
https://github.com/openshift/library-go/pull/312

Could be that a bump in library-go could fix this.

Comment 7 Seth Jennings 2019-04-01 20:45:42 UTC
I believe this is fixed by
https://github.com/openshift/library-go/pull/312

Comment 9 Wei Sun 2019-04-10 03:19:54 UTC
Please check if it could be verified.

Comment 10 Weinan Liu 2019-04-10 07:19:38 UTC
Verified to be fixed.

[nathan@localhost 0410]$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.11   True        False         7m4s      Cluster version is 4.0.0-0.11

(upgraded from 4.0.0-0.9)

The message is enhanced during upgrade


[nathan@localhost 0410]$ oc logs openshift-kube-scheduler-operator-5476946d7f-j8kdp|grep NodeInstallerFailing
I0410 06:59:47.830854       1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0410 07:00:34.676275       1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

Comment 12 errata-xmlrpc 2019-06-04 10:46:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758