1690088 – clusteroperator/kube-apiserver: NodeInstallerFailing after upgrade (0 nodes are failing on revision...)

Bug 1690088 - clusteroperator/kube-apiserver: NodeInstallerFailing after upgrade (0 nodes are failing on revision...)

Summary: clusteroperator/kube-apiserver: NodeInstallerFailing after upgrade (0 nodes a...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-18 18:53 UTC by Ben Parees
Modified:	2020-05-14 11:17 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-14 11:17:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (340.45 KB, image/svg+xml) 2019-03-22 05:51 UTC, W. Trevor King	no flags	Details
View All

Description Ben Parees 2019-03-18 18:53:07 UTC

Description of problem:
Mar 18 17:13:41.279 E clusteroperator/kube-apiserver changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending
Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/


1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?)
2) Presumably something is failing but this message doesn't make it clear what
3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades).

Comment 1 Ben Parees 2019-03-18 21:36:50 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1690153 opened for kube-scheduler, this one is for kube-apiserver.

Comment 5 W. Trevor King 2019-03-22 05:51:58 UTC

Created attachment 1546776 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 47 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-apiserver .* NodeInstallerFailing: 0 nodes are failing on revision'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 7 Ben Parees 2019-03-25 12:35:05 UTC

openshift-controller-manager was referenced but if you look at the clusteroperator objects you can see a number of operators did not move to the new version, which means the CVO is not being accurate in its reporting in terms of what it is blocked on.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/clusteroperators.json

I didn't investigate all of them, but the openshift-controller-manager itself has at least created pods at the new version:

I0318 17:30:42.273629       1 controller_manager.go:41] Starting controllers on 0.0.0.0:8443 (v4.0.0-alpha.0+f7ad8ee-1673-dirty)
I0318 17:30:42.276219       1 controller_manager.go:52] DeploymentConfig controller using images from "registry.svc.ci.openshift.org/ocp/4.0-2019-03-18-152932@sha256:60aa870f57529048199aa4691d8aa7dc9b65614dcd45c6fe550277016415ab02"


https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/pods/openshift-controller-manager_controller-manager-b4d9n_controller-manager.log.gz


(152932 is the new version).

Unfortunately I don't think we have the actual openshift controller manager daemonset to see what it was reporting in terms of replicas at the new version, which is how the openshift controller manager operator decides when it is safe to report the new version.  My guess is we still had old replicas running, thus we couldn't report the new version.

Comment 8 Ben Parees 2019-03-25 12:43:45 UTC

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1692353 for the problem that the CVO does not report all the operators it is waiting for.

Adam, I think you'll want to work with Steve to get the must-gather tool run against all the cluster operators in the case of upgrade failures, w/o that information it's going to be tough to debug this (and again, it wasn't just the controller manager operator, so most likely there was a cluster-wide problem that kept something from rolling out).

Comment 9 Michal Fojtik 2019-03-25 13:00:16 UTC

More debugging:

etcd-member-ip-10-0-136-114.ec2.internal                              Initialized=True, Ready=True, ContainersReady=True, PodScheduled=True
  [!] Container "etcd-member" restarted 1 times, last exit 255 caused by:
   with peer fdb8f20332e073fe (stream MsgApp v2 writer)
2019-03-18 17:22:31.740347 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream MsgApp v2 reader)
2019-03-18 17:22:31.740840 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream Message reader)
2019-03-18 17:22:31.747235 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream Message reader)
2019-03-18 17:22:31.749906 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream MsgApp v2 reader)
2019-03-18 17:22:31.764756 I | etcdserver: 41789addb42b0807 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
2019-03-18 17:22:32.004281 I | etcdserver: published {Name:etcd-member-ip-10-0-136-114.ec2.internal ClientURLs:[https://10.0.136.114:2379]} to cluster 30b8baee8832e2d0
2019-03-18 17:22:32.004323 I | embed: ready to serve client requests
2019-03-18 17:22:32.006218 I | embed: serving client requests on [::]:2379
2019-03-18 17:22:32.015455 I | embed: rejected connection from "127.0.0.1:56496" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/18 17:22:32 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
proto: no coders for int
proto: no encoder for ValueSize int [GetProperties]
2019-03-18 17:22:34.961154 W | etcdserver: request "header:<ID:8358234032595949200 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" mod_revision:49038 > success:<request_put:<key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" value_size:2066 >> failure:<>>" with result "size:18" took too long (134.727982ms) to execute
2019-03-18 17:22:35.502047 N | pkg/osutil: received terminated signal, shutting down...

There are some SDN failures:

  [!] Container "sdn" restarted 8 times, last exit 255 caused by:

Comment 14 Michal Fojtik 2020-05-12 10:53:20 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 15 Ben Parees 2020-05-12 13:15:30 UTC

I would still like to see the messaging fixed, as described in the initial report:
https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c0

and reclarified here:
https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c12

Note You need to log in before you can comment on or make changes to this bug.