Bug 1858763
Summary: | Static pod installer controller deadlocks with non-existing installer pod, WAS: kube-apisrever of clsuter operator always with incorrect status due to pleg error | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> | |
Component: | kube-apiserver | Assignee: | Luis Sanchez <sanchezl> | |
Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.5 | CC: | aos-bugs, cblecker, cmeadors, jokerman, jupierce, mfojtik, schoudha, sttts, travi, wking, xxia | |
Target Milestone: | --- | Keywords: | ServiceDeliveryImpact, Upgrades | |
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1874597 1909600 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:16:00 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1822018 | |||
Bug Blocks: | 1817419, 1822016, 1874597, 1909600, 1949370 |
Description
Ke Wang
2020-07-20 10:35:44 UTC
Something doesn't match up here. All the nodes show Ready and all the static pods show Running at revision 7. Seems like it the operator is seeing something we aren't or not seeing something that we are. From kubelet logs: 767:Jul 20 05:41:36.046042 control-plane-0 hyperkube[1463]: E0720 05:41:36.045882 1463 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 19751:Jul 20 05:43:46.238236 control-plane-1 hyperkube[1472]: I0720 05:43:46.237981 1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 LastTransitionTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 19761:Jul 20 05:43:46.272003 control-plane-1 hyperkube[1472]: E0720 05:43:46.271855 1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 50484:Jul 20 05:46:49.843351 control-plane-2 hyperkube[1462]: E0720 05:46:49.843327 1462 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 172942:Jul 20 06:01:41.422914 control-plane-0 hyperkube[1482]: I0720 06:01:41.421909 1482 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 LastTransitionTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 173201:Jul 20 06:01:41.510137 control-plane-0 hyperkube[1482]: E0720 06:01:41.509003 1482 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 204954:Jul 20 06:03:52.931798 control-plane-1 hyperkube[1472]: I0720 06:03:52.931676 1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 LastTransitionTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 204963:Jul 20 06:03:52.952973 control-plane-1 hyperkube[1472]: E0720 06:03:52.951399 1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 239275:Jul 20 06:06:12.498371 control-plane-2 hyperkube[1471]: I0720 06:06:12.460705 1471 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 LastTransitionTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 239293:Jul 20 06:06:12.500399 control-plane-2 hyperkube[1471]: E0720 06:06:12.498580 1471 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] So the operator(s) (all control-plane operators notice that!) do not lie. Kubelet report the PLEG error. I also don't see the PLEG error showing up in the kubeapiserver operator CR lateron. So as far as I see the condition is right (also the PLEG error in between is right). What seems to be the case though is that installer pods have disappeared: - lastTransitionTime: "2020-07-20T05:39:04Z" message: |- 1 nodes are failing on revision 7: pods "installer-7-control-plane-0" not found reason: InstallerPodFailed status: "True" type: NodeInstallerDegraded *** Bug 1723966 has been marked as a duplicate of this bug. *** There is still one PR pending. *** Bug 1861899 has been marked as a duplicate of this bug. *** If a cluster gets stuck in this state, the following steps can be used to allow the kube-apiserver-operator to recover: - Identify the node that has the failed installation oc get kubeapiservers cluster -o json | jq .status.nodeStatuses - Zero out the currentRevision of the node in question (in this example, it was the third master) curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/currentRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status" - Zero out the lastFailedRevision of the node in question (in this example, it was the third master) curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/lastFailedRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status" - If needed, delete and restart the kube-apiserver-operator pod oc delete pods -n openshift-kube-apiserver-operator -l app=kube-apiserver-operator *** Bug 1817419 has been marked as a duplicate of this bug. *** UPI installed one disconnected cluster vSphere6.7 successfully, see cluster detail, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-03-084733 True False 41m Cluster version is 4.6.0-0.nightly-2020-09-03-084733 $ oc get infrastructures.config.openshift.io -o json | jq .items[0].spec.platformSpec { "type": "VSphere" } $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 52m v1.19.0-rc.2+b5dc585-dirty compute-1 Ready worker 52m v1.19.0-rc.2+b5dc585-dirty control-plane-0 Ready master 67m v1.19.0-rc.2+b5dc585-dirty control-plane-1 Ready master 67m v1.19.0-rc.2+b5dc585-dirty control-plane-2 Ready master 67m v1.19.0-rc.2+b5dc585-dirty $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-03-084733 True False False 52s cloud-credential 4.6.0-0.nightly-2020-09-03-084733 True False False 68m cluster-autoscaler 4.6.0-0.nightly-2020-09-03-084733 True False False 54m config-operator 4.6.0-0.nightly-2020-09-03-084733 True False False 63m console 4.6.0-0.nightly-2020-09-03-084733 True False False 17m csi-snapshot-controller 4.6.0-0.nightly-2020-09-03-084733 True False False 21m dns 4.6.0-0.nightly-2020-09-03-084733 True False False 60m etcd 4.6.0-0.nightly-2020-09-03-084733 True False False 61m image-registry 4.6.0-0.nightly-2020-09-03-084733 True False False 22m ingress 4.6.0-0.nightly-2020-09-03-084733 True False False 48m insights 4.6.0-0.nightly-2020-09-03-084733 True False False 55m kube-apiserver 4.6.0-0.nightly-2020-09-03-084733 True False False 59m kube-controller-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 60m kube-scheduler 4.6.0-0.nightly-2020-09-03-084733 True False False 59m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-03-084733 True False False 23m machine-api 4.6.0-0.nightly-2020-09-03-084733 True False False 51m machine-approver 4.6.0-0.nightly-2020-09-03-084733 True False False 60m machine-config 4.6.0-0.nightly-2020-09-03-084733 True False False 60m marketplace 4.6.0-0.nightly-2020-09-03-084733 True False False 23m monitoring 4.6.0-0.nightly-2020-09-03-084733 True False False 17m network 4.6.0-0.nightly-2020-09-03-084733 True False False 48m node-tuning 4.6.0-0.nightly-2020-09-03-084733 True False False 63m openshift-apiserver 4.6.0-0.nightly-2020-09-03-084733 True False False 18m openshift-controller-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 46m openshift-samples 4.6.0-0.nightly-2020-09-03-084733 True False False 17m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 61m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-03-084733 True False False 61m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-03-084733 True False False 17m service-ca 4.6.0-0.nightly-2020-09-03-084733 True False False 63m storage 4.6.0-0.nightly-2020-09-03-084733 True False False 63m The cluster works well, so move the bug Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |