Description of problem: Mar 18 17:13:41.279 E clusteroperator/kube-apiserver changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 6:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending Mar 18 17:14:07.969 E clusteroperator/kube-scheduler changed Failing to True: NodeInstallerFailing: NodeInstallerFailing: 0 nodes are failing on revision 4:\nNodeInstallerFailing: static pod has been installed, but is not ready while new revision is pending https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/ 1) Message is confusing (nodeinstaller is failing because 0 nodes are failing?) 2) Presumably something is failing but this message doesn't make it clear what 3) Whatever is actually failing needs to be triaged so it doesn't fail (should not have failures during upgrades).
https://bugzilla.redhat.com/show_bug.cgi?id=1690153 opened for kube-scheduler, this one is for kube-apiserver.
Created attachment 1546776 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 47 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'clusteroperator/kube-apiserver .* NodeInstallerFailing: 0 nodes are failing on revision' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
openshift-controller-manager was referenced but if you look at the clusteroperator objects you can see a number of operators did not move to the new version, which means the CVO is not being accurate in its reporting in terms of what it is blocked on. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/clusteroperators.json I didn't investigate all of them, but the openshift-controller-manager itself has at least created pods at the new version: I0318 17:30:42.273629 1 controller_manager.go:41] Starting controllers on 0.0.0.0:8443 (v4.0.0-alpha.0+f7ad8ee-1673-dirty) I0318 17:30:42.276219 1 controller_manager.go:52] DeploymentConfig controller using images from "registry.svc.ci.openshift.org/ocp/4.0-2019-03-18-152932@sha256:60aa870f57529048199aa4691d8aa7dc9b65614dcd45c6fe550277016415ab02" https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/pods/openshift-controller-manager_controller-manager-b4d9n_controller-manager.log.gz (152932 is the new version). Unfortunately I don't think we have the actual openshift controller manager daemonset to see what it was reporting in terms of replicas at the new version, which is how the openshift controller manager operator decides when it is safe to report the new version. My guess is we still had old replicas running, thus we couldn't report the new version.
I opened https://bugzilla.redhat.com/show_bug.cgi?id=1692353 for the problem that the CVO does not report all the operators it is waiting for. Adam, I think you'll want to work with Steve to get the must-gather tool run against all the cluster operators in the case of upgrade failures, w/o that information it's going to be tough to debug this (and again, it wasn't just the controller manager operator, so most likely there was a cluster-wide problem that kept something from rolling out).
More debugging: etcd-member-ip-10-0-136-114.ec2.internal Initialized=True, Ready=True, ContainersReady=True, PodScheduled=True [!] Container "etcd-member" restarted 1 times, last exit 255 caused by: with peer fdb8f20332e073fe (stream MsgApp v2 writer) 2019-03-18 17:22:31.740347 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream MsgApp v2 reader) 2019-03-18 17:22:31.740840 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream Message reader) 2019-03-18 17:22:31.747235 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream Message reader) 2019-03-18 17:22:31.749906 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream MsgApp v2 reader) 2019-03-18 17:22:31.764756 I | etcdserver: 41789addb42b0807 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s) 2019-03-18 17:22:32.004281 I | etcdserver: published {Name:etcd-member-ip-10-0-136-114.ec2.internal ClientURLs:[https://10.0.136.114:2379]} to cluster 30b8baee8832e2d0 2019-03-18 17:22:32.004323 I | embed: ready to serve client requests 2019-03-18 17:22:32.006218 I | embed: serving client requests on [::]:2379 2019-03-18 17:22:32.015455 I | embed: rejected connection from "127.0.0.1:56496" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "") WARNING: 2019/03/18 17:22:32 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry. proto: no coders for int proto: no encoder for ValueSize int [GetProperties] 2019-03-18 17:22:34.961154 W | etcdserver: request "header:<ID:8358234032595949200 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" mod_revision:49038 > success:<request_put:<key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" value_size:2066 >> failure:<>>" with result "size:18" took too long (134.727982ms) to execute 2019-03-18 17:22:35.502047 N | pkg/osutil: received terminated signal, shutting down... There are some SDN failures: [!] Container "sdn" restarted 8 times, last exit 255 caused by:
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale". If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
I would still like to see the messaging fixed, as described in the initial report: https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c0 and reclarified here: https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c12