1690153 – clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision

Bug 1690153 - clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision

Summary: clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing:...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	ravig
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1691600 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-18 21:35 UTC by Ben Parees
Modified:	2019-06-04 10:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (341.09 KB, image/svg+xml) 2019-03-22 05:52 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:10 UTC

Description Ben Parees 2019-03-18 21:35:03 UTC

Description of problem:
Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/


1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?)
2) Presumably something is failing but this message doesn't make it clear what
3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades).


https://bugzilla.redhat.com/show_bug.cgi?id=1690088 is for kube-apiserver, this one is for kube-scheduler.

Comment 2 liujia 2019-03-21 07:14:07 UTC

Hit this issue when upgrade from 4.0.0-0.nightly-2019-03-19-004004 to 4.0.0-0.nightly-2019-03-20-153904. Upgrade failed.
    {
      "lastTransitionTime": "2019-03-21T07:12:36Z",
      "message": "Cluster operator kube-scheduler is reporting a failure: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: pods \"installer-6-ip-10-0-131-197.us-east-2.compute.internal\" not found",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-03-21T06:01:05Z",
      "message": "Unable to apply 4.0.0-0.nightly-2019-03-20-153904: the cluster operator kube-scheduler is failing",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Progressing"
    },

Comment 3 W. Trevor King 2019-03-22 05:52:44 UTC

Created attachment 1546777 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 15 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-scheduler .* NodeInstallerFailing: 0 nodes are failing on revision'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 5 Xingxing Xia 2019-03-26 03:14:02 UTC

*** Bug 1691600 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-03-28 18:15:48 UTC

There have been some recent changes to library-go in this area:
https://github.com/openshift/library-go/pull/313
https://github.com/openshift/library-go/pull/312

Could be that a bump in library-go could fix this.

Comment 7 Seth Jennings 2019-04-01 20:45:42 UTC

I believe this is fixed by
https://github.com/openshift/library-go/pull/312

Comment 9 Wei Sun 2019-04-10 03:19:54 UTC

Please check if it could be verified.

Comment 10 Weinan Liu 2019-04-10 07:19:38 UTC

Verified to be fixed.

[nathan@localhost 0410]$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.11   True        False         7m4s      Cluster version is 4.0.0-0.11

(upgraded from 4.0.0-0.9)

The message is enhanced during upgrade


[nathan@localhost 0410]$ oc logs openshift-kube-scheduler-operator-5476946d7f-j8kdp|grep NodeInstallerFailing
I0410 06:59:47.830854       1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0410 07:00:34.676275       1 status_controller.go:156] clusteroperator/kube-scheduler diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-10T06:59:17Z","reason":"NodeInstallerFailingInstallerPodFailed","status":"True","type":"Failing"},{"lastTransitionTime":"2019-04-10T06:58:51Z","message":"Progressing: 3 nodes are at revision 7","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-10T06:58:37Z","message":"Available: 3 nodes are active; 3 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-10T06:58:37Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

Comment 12 errata-xmlrpc 2019-06-04 10:46:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.