Description of problem: UPI install one disconnected cluster vSphere6.7, checked the cluster operators, $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ... kube-apiserver 4.5.2 True True True 124m Version-Release number of selected component (if applicable): 4.5.2-x86_64 How reproducible: Always Reproduced steps: UPI install one disconnected cluster vSphere6.7 Actual results: $ oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 3h49m v1.18.3+b74c5ed compute-1 Ready worker 3h49m v1.18.3+b74c5ed compute-2 Ready worker 3h49m v1.18.3+b74c5ed control-plane-0 Ready master 4h2m v1.18.3+b74c5ed control-plane-1 Ready master 4h2m v1.18.3+b74c5ed control-plane-2 Ready master 4h2m v1.18.3+b74c5ed $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.2 True False 109m Error while reconciling 4.5.2: the cluster operator kube-apiserver is degraded $ oc describe co kube-apiserver ... Status: Conditions: Last Transition Time: 2020-07-20T05:41:04Z Message: NodeInstallerDegraded: 1 nodes are failing on revision 7: NodeInstallerDegraded: pods "installer-7-control-plane-0" not found Reason: NodeInstaller_InstallerPodFailed Status: True Type: Degraded Last Transition Time: 2020-07-20T05:20:51Z Message: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2020-07-20T05:23:46Z Message: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7 Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-07-20T05:20:44Z ... Actually, checked the kube-apiservers, they all are revision 7 as expected. $ oc get pod -n openshift-kube-apiserver --show-labels NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-control-plane-0 4/4 Running 0 120m apiserver=true,app=openshift-kube-apiserver,revision=7 kube-apiserver-control-plane-1 4/4 Running 0 123m apiserver=true,app=openshift-kube-apiserver,revision=7 kube-apiserver-control-plane-2 4/4 Running 0 124m apiserver=true,app=openshift-kube-apiserver,revision=7 $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-56498d4959-sgprw | grep -n '1 nodes are failing on revision 7' 181:I0720 06:06:12.740042 1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])","reason":"NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} 182:I0720 06:06:12.801423 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])" 183:I0720 06:06:12.882579 1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} 184:I0720 06:06:12.947788 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found" Expected results: kube-apiserver of cluster-operator should be the correct status. Additional info:
Something doesn't match up here. All the nodes show Ready and all the static pods show Running at revision 7. Seems like it the operator is seeing something we aren't or not seeing something that we are.
From kubelet logs: 767:Jul 20 05:41:36.046042 control-plane-0 hyperkube[1463]: E0720 05:41:36.045882 1463 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 19751:Jul 20 05:43:46.238236 control-plane-1 hyperkube[1472]: I0720 05:43:46.237981 1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 LastTransitionTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 19761:Jul 20 05:43:46.272003 control-plane-1 hyperkube[1472]: E0720 05:43:46.271855 1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 50484:Jul 20 05:46:49.843351 control-plane-2 hyperkube[1462]: E0720 05:46:49.843327 1462 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 172942:Jul 20 06:01:41.422914 control-plane-0 hyperkube[1482]: I0720 06:01:41.421909 1482 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 LastTransitionTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 173201:Jul 20 06:01:41.510137 control-plane-0 hyperkube[1482]: E0720 06:01:41.509003 1482 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 204954:Jul 20 06:03:52.931798 control-plane-1 hyperkube[1472]: I0720 06:03:52.931676 1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 LastTransitionTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 204963:Jul 20 06:03:52.952973 control-plane-1 hyperkube[1472]: E0720 06:03:52.951399 1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] 239275:Jul 20 06:06:12.498371 control-plane-2 hyperkube[1471]: I0720 06:06:12.460705 1471 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 LastTransitionTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} 239293:Jul 20 06:06:12.500399 control-plane-2 hyperkube[1471]: E0720 06:06:12.498580 1471 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] So the operator(s) (all control-plane operators notice that!) do not lie. Kubelet report the PLEG error.
I also don't see the PLEG error showing up in the kubeapiserver operator CR lateron. So as far as I see the condition is right (also the PLEG error in between is right). What seems to be the case though is that installer pods have disappeared: - lastTransitionTime: "2020-07-20T05:39:04Z" message: |- 1 nodes are failing on revision 7: pods "installer-7-control-plane-0" not found reason: InstallerPodFailed status: "True" type: NodeInstallerDegraded
*** Bug 1723966 has been marked as a duplicate of this bug. ***
There is still one PR pending.
*** Bug 1861899 has been marked as a duplicate of this bug. ***
If a cluster gets stuck in this state, the following steps can be used to allow the kube-apiserver-operator to recover: - Identify the node that has the failed installation oc get kubeapiservers cluster -o json | jq .status.nodeStatuses - Zero out the currentRevision of the node in question (in this example, it was the third master) curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/currentRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status" - Zero out the lastFailedRevision of the node in question (in this example, it was the third master) curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/lastFailedRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status" - If needed, delete and restart the kube-apiserver-operator pod oc delete pods -n openshift-kube-apiserver-operator -l app=kube-apiserver-operator
*** Bug 1817419 has been marked as a duplicate of this bug. ***
UPI installed one disconnected cluster vSphere6.7 successfully, see cluster detail, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-03-084733 True False 41m Cluster version is 4.6.0-0.nightly-2020-09-03-084733 $ oc get infrastructures.config.openshift.io -o json | jq .items[0].spec.platformSpec { "type": "VSphere" } $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 52m v1.19.0-rc.2+b5dc585-dirty compute-1 Ready worker 52m v1.19.0-rc.2+b5dc585-dirty control-plane-0 Ready master 67m v1.19.0-rc.2+b5dc585-dirty control-plane-1 Ready master 67m v1.19.0-rc.2+b5dc585-dirty control-plane-2 Ready master 67m v1.19.0-rc.2+b5dc585-dirty $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-03-084733 True False False 52s cloud-credential 4.6.0-0.nightly-2020-09-03-084733 True False False 68m cluster-autoscaler 4.6.0-0.nightly-2020-09-03-084733 True False False 54m config-operator 4.6.0-0.nightly-2020-09-03-084733 True False False 63m console 4.6.0-0.nightly-2020-09-03-084733 True False False 17m csi-snapshot-controller 4.6.0-0.nightly-2020-09-03-084733 True False False 21m dns 4.6.0-0.nightly-2020-09-03-084733 True False False 60m etcd 4.6.0-0.nightly-2020-09-03-084733 True False False 61m image-registry 4.6.0-0.nightly-2020-09-03-084733 True False False 22m ingress 4.6.0-0.nightly-2020-09-03-084733 True False False 48m insights 4.6.0-0.nightly-2020-09-03-084733 True False False 55m kube-apiserver 4.6.0-0.nightly-2020-09-03-084733 True False False 59m kube-controller-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 60m kube-scheduler 4.6.0-0.nightly-2020-09-03-084733 True False False 59m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-03-084733 True False False 23m machine-api 4.6.0-0.nightly-2020-09-03-084733 True False False 51m machine-approver 4.6.0-0.nightly-2020-09-03-084733 True False False 60m machine-config 4.6.0-0.nightly-2020-09-03-084733 True False False 60m marketplace 4.6.0-0.nightly-2020-09-03-084733 True False False 23m monitoring 4.6.0-0.nightly-2020-09-03-084733 True False False 17m network 4.6.0-0.nightly-2020-09-03-084733 True False False 48m node-tuning 4.6.0-0.nightly-2020-09-03-084733 True False False 63m openshift-apiserver 4.6.0-0.nightly-2020-09-03-084733 True False False 18m openshift-controller-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 46m openshift-samples 4.6.0-0.nightly-2020-09-03-084733 True False False 17m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-03-084733 True False False 61m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-03-084733 True False False 61m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-03-084733 True False False 17m service-ca 4.6.0-0.nightly-2020-09-03-084733 True False False 63m storage 4.6.0-0.nightly-2020-09-03-084733 True False False 63m The cluster works well, so move the bug Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196