Bug 1690088
| Summary: | clusteroperator/kube-apiserver: NodeInstallerFailing after upgrade (0 nodes are failing on revision...) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | ||||
| Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Xingxing Xia <xxia> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 4.1.0 | CC: | aos-bugs, calfonso, decarr, jokerman, mfojtik, mmccomas, nagrawal, rvokal, wking | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.5.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-05-14 11:17:36 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Ben Parees
2019-03-18 18:53:07 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1690153 opened for kube-scheduler, this one is for kube-apiserver. Created attachment 1546776 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 47 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'clusteroperator/kube-apiserver .* NodeInstallerFailing: 0 nodes are failing on revision' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log openshift-controller-manager was referenced but if you look at the clusteroperator objects you can see a number of operators did not move to the new version, which means the CVO is not being accurate in its reporting in terms of what it is blocked on. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/clusteroperators.json I didn't investigate all of them, but the openshift-controller-manager itself has at least created pods at the new version: I0318 17:30:42.273629 1 controller_manager.go:41] Starting controllers on 0.0.0.0:8443 (v4.0.0-alpha.0+f7ad8ee-1673-dirty) I0318 17:30:42.276219 1 controller_manager.go:52] DeploymentConfig controller using images from "registry.svc.ci.openshift.org/ocp/4.0-2019-03-18-152932@sha256:60aa870f57529048199aa4691d8aa7dc9b65614dcd45c6fe550277016415ab02" https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/259/artifacts/e2e-aws-upgrade/pods/openshift-controller-manager_controller-manager-b4d9n_controller-manager.log.gz (152932 is the new version). Unfortunately I don't think we have the actual openshift controller manager daemonset to see what it was reporting in terms of replicas at the new version, which is how the openshift controller manager operator decides when it is safe to report the new version. My guess is we still had old replicas running, thus we couldn't report the new version. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1692353 for the problem that the CVO does not report all the operators it is waiting for. Adam, I think you'll want to work with Steve to get the must-gather tool run against all the cluster operators in the case of upgrade failures, w/o that information it's going to be tough to debug this (and again, it wasn't just the controller manager operator, so most likely there was a cluster-wide problem that kept something from rolling out). More debugging:
etcd-member-ip-10-0-136-114.ec2.internal Initialized=True, Ready=True, ContainersReady=True, PodScheduled=True
[!] Container "etcd-member" restarted 1 times, last exit 255 caused by:
with peer fdb8f20332e073fe (stream MsgApp v2 writer)
2019-03-18 17:22:31.740347 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream MsgApp v2 reader)
2019-03-18 17:22:31.740840 I | rafthttp: established a TCP streaming connection with peer 3500ccc0ee47fc1e (stream Message reader)
2019-03-18 17:22:31.747235 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream Message reader)
2019-03-18 17:22:31.749906 I | rafthttp: established a TCP streaming connection with peer fdb8f20332e073fe (stream MsgApp v2 reader)
2019-03-18 17:22:31.764756 I | etcdserver: 41789addb42b0807 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
2019-03-18 17:22:32.004281 I | etcdserver: published {Name:etcd-member-ip-10-0-136-114.ec2.internal ClientURLs:[https://10.0.136.114:2379]} to cluster 30b8baee8832e2d0
2019-03-18 17:22:32.004323 I | embed: ready to serve client requests
2019-03-18 17:22:32.006218 I | embed: serving client requests on [::]:2379
2019-03-18 17:22:32.015455 I | embed: rejected connection from "127.0.0.1:56496" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/18 17:22:32 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
proto: no coders for int
proto: no encoder for ValueSize int [GetProperties]
2019-03-18 17:22:34.961154 W | etcdserver: request "header:<ID:8358234032595949200 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" mod_revision:49038 > success:<request_put:<key:\"/openshift.io/pods/openshift-marketplace/certified-operators-5866dd7865-bv8gc\" value_size:2066 >> failure:<>>" with result "size:18" took too long (134.727982ms) to execute
2019-03-18 17:22:35.502047 N | pkg/osutil: received terminated signal, shutting down...
There are some SDN failures:
[!] Container "sdn" restarted 8 times, last exit 255 caused by:
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale". If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. I would still like to see the messaging fixed, as described in the initial report: https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c0 and reclarified here: https://bugzilla.redhat.com/show_bug.cgi?id=1690088#c12 |