Bug 1690088

Summary:

clusteroperator/kube-apiserver: NodeInstallerFailing after upgrade (0 nodes are failing on revision...)

Product:

OpenShift Container Platform

Reporter:

Ben Parees <bparees>

Component:

kube-apiserver

Assignee:

Stefan Schimanski <sttts>

Status:

CLOSED WONTFIX

QA Contact:

Xingxing Xia <xxia>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.1.0

CC:

aos-bugs, calfonso, decarr, jokerman, mfojtik, mmccomas, nagrawal, rvokal, wking

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-14 11:17:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC	none

Description Ben Parees 2019-03-18 18:53:07 UTC

Description of problem:
Mar 18 17:13:41.279 E clusteroperator/kube-apiserver changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending
Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/


1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?)
2) Presumably something is failing but this message doesn't make it clear what
3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades).

Comment 1 Ben Parees 2019-03-18 21:36:50 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1690153 opened for kube-scheduler, this one is for kube-apiserver.

Comment 5 W. Trevor King 2019-03-22 05:51:58 UTC

Created attachment 1546776 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 47 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-apiserver .* NodeInstallerFailing: 0 nodes are failing on revision'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 7 Ben Parees 2019-03-25 12:35:05 UTC

openshift-controller-manager was referenced but if you look at the clusteroperator objects you can see a number of operators did not move to the new version, which means the CVO is not being accurate in its reporting in terms of what it is blocked on.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/clusteroperators.json

I didn't investigate all of them, but the openshift-controller-manager itself has at least created pods at the new version:

I0318 17:30:42.273629       1 controller_manager.go:41] Starting controllers on 0.0.0.0:8443 (v4.0.0-alpha.0+f7ad8ee-1673-dirty)
I0318 17:30:42.276219       1 controller_manager.go:52] DeploymentConfig controller using images from "registry.svc.ci.openshift.org/ocp/4.0-2019-03-18-152932@sha256:60aa870f57529048199aa4691d8aa7dc9b65614dcd45c6fe550277016415ab02"


https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/pods/openshift-controller-manager_controller-manager-b4d9n_controller-manager.log.gz


(152932 is the new version).

Unfortunately I don't think we have the actual openshift controller manager daemonset to see what it was reporting in terms of replicas at the new version, which is how the openshift controller manager operator decides when it is safe to report the new version.  My guess is we still had old replicas running, thus we couldn't report the new version.

Comment 8 Ben Parees 2019-03-25 12:43:45 UTC

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1692353 for the problem that the CVO does not report all the operators it is waiting for.

Adam, I think you'll want to work with Steve to get the must-gather tool run against all the cluster operators in the case of upgrade failures, w/o that information it's going to be tough to debug this (and again, it wasn't just the controller manager operator, so most likely there was a cluster-wide problem that kept something from rolling out).

Comment 9 Michal Fojtik 2019-03-25 13:00:16 UTC

More debugging:

etcd-member-ip-10-0-136-114.ec2.internal                              Initialized=True, Ready=True, ContainersReady=True, PodScheduled=True
  [!] Container "etcd-member" restarted 1 times, last exit 255 caused by:
   with peer fdb8f20332e073fe (stream MsgApp v2 writer)
2019-03-18 17:22:31.740347 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream MsgApp v2 reader)
2019-03-18 17:22:31.740840 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream Message reader)
2019-03-18 17:22:31.747235 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream Message reader)
2019-03-18 17:22:31.749906 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream MsgApp v2 reader)
2019-03-18 17:22:31.764756 I | etcdserver: 41789addb42b0807 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
2019-03-18 17:22:32.004281 I | etcdserver: published {Name:etcd-member-ip-10-0-136-114.ec2.internal ClientURLs:[https://10.0.136.114:2379]} to cluster 30b8baee8832e2d0
2019-03-18 17:22:32.004323 I | embed: ready to serve client requests
2019-03-18 17:22:32.006218 I | embed: serving client requests on [::]:2379
2019-03-18 17:22:32.015455 I | embed: rejected connection from "127.0.0.1:56496" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/18 17:22:32 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
proto: no coders for int
proto: no encoder for ValueSize int [GetProperties]
2019-03-18 17:22:34.961154 W | etcdserver: request "header:<ID:8358234032595949200 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" mod_revision:49038 > success:<request_put:<key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" value_size:2066 >> failure:<>>" with result "size:18" took too long (134.727982ms) to execute
2019-03-18 17:22:35.502047 N | pkg/osutil: received terminated signal, shutting down...

There are some SDN failures:

  [!] Container "sdn" restarted 8 times, last exit 255 caused by:

Comment 14 Michal Fojtik 2020-05-12 10:53:20 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 15 Ben Parees 2020-05-12 13:15:30 UTC

I would still like to see the messaging fixed, as described in the initial report:
https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c0

and reclarified here:
https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c12