the customer did an an installation of ocp cluster but he noticed that constantly and without taking any action nodes go to NotReady state. the actual state is that the master node ocp02-hbtsw-master-0 is not ready. [openshift@bastion ~]$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ocp02-hbtsw-master-0 NotReady master 64d v1.18.3+47c0e71 10.15.2.10 10.15.2.10 Red Hat Enterprise Linux CoreOS 45.82.202009181447-0 (Ootpa) 4.18.0-193.23.1.el8_2.x86_64 cri-o://1.18.3-15.rhaos4.5.gitae4ef7b.el8 [openshift@bastion ~]$ oc describe co machine-config Name: machine-config Conditions: Last Transition Time: 2020-10-13T02:02:33Z Message: Cluster version is 4.5.13 Status: False Type: Progressing Last Transition Time: 2020-11-29T17:18:52Z Message: Failed to resync 4.5.13 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 10, updated: 10, ready: 9, unavailable: 1) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2020-11-29T17:18:52Z Message: Cluster not available for 4.5.13 Status: False Type: Available Last Transition Time: 2020-09-28T19:58:39Z Reason: AsExpected Status: True Type: Upgradeable 2020-11-29T17:21:24.297252241Z E1129 17:21:24.297181 1 task.go:81] error running apply for clusteroperator "kube-controller-manage r" (98 of 586): Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: nod e "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) 2020-11-29T17:21:24.297365581Z I1129 17:21:24.297338 1 task_graph.go:575] Running 10 on 1 2020-11-29T17:21:24.297437956Z I1129 17:21:24.297415 1 task_graph.go:575] Running 11 on 1 2020-11-29T17:21:24.297500247Z I1129 17:21:24.297476 1 task_graph.go:568] Canceled worker 1 2020-11-29T17:21:24.298339975Z E1129 17:21:24.297564 1 task.go:81] error running apply for clusteroperator "kube-scheduler" (107 o f 586): Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-ma ster-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) 2020-11-29T17:21:24.298526604Z I1129 17:21:24.298496 1 task_graph.go:568] Canceled worker 0 2020-11-29T17:21:24.298644952Z I1129 17:21:24.298620 1 task_graph.go:588] Workers finished 2020-11-29T17:21:24.299209076Z I1129 17:21:24.299149 1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) update context deadline exceeded at 49 of 586 update context deadline exceeded at 49 of 586 Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)] 2020-11-29T17:21:24.299322938Z I1129 17:21:24.299292 1 sync_worker.go:818] Summarizing 2 errors 2020-11-29T17:21:24.299502102Z I1129 17:21:24.299468 1 sync_worker.go:822] Update error 98 of 586: ClusterOperatorDegraded Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) (*errors.errorString: cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)) 2020-11-29T17:21:24.299616247Z I1129 17:21:24.299585 1 sync_worker.go:822] Update error 107 of 586: ClusterOperatorDegraded Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) (*errors.errorString: cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)) 2020-11-29T17:21:24.300303118Z E1129 17:21:24.299734 1 sync_worker.go:329] unable to synchronize image (waiting 43.131425612s): Multiple errors are preventing progress: 2020-11-29T17:21:24.300303118Z * Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) 2020-11-29T17:21:24.300303118Z * Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) 2020-11-29T17:21:24.300435013Z I1129 17:21:24.299407 1 task_graph.go:524] Stopped graph walker due to cancel NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.5.13 True True False 64d etcd 4.5.13 True False True 64d kube-apiserver 4.5.13 True True True 64d kube-controller-manager 4.5.13 True False True 64d 2020-11-30T00:32:01.604882302Z I1130 00:32:01.604160 1 request.go:621] Throttling request took 1.150287646s, request: GET:https:// 10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager 2020-11-30T00:32:06.315470534Z I1130 00:32:06.315364 1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-11-29T17:10:47Z","message":"NodeControllerDegraded: The master nodes not ready: node \" ocp02-hbtsw-master-0\" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)","reason":"NodeController_MasterNodesReady","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-10-09T20:00:01Z","message":"NodeInstallerProgressing: 3 nodes are at revision 9","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"20 20-09-28T20:00:30Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-09-28T19:58:36Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} 2020-11-30T00:32:06.377539913Z I1130 00:32:06.377440 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"7ed3e6d2-06fd-40a7-98bc-f45e5d41b760", APIVersion:" apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"ocp02-hbtsw-master-0\" not re ady since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nTargetConfigControllerDegraded : \"serviceaccount/localhost-recovery-client\": etcdserver: leader changed" to "NodeControllerDegraded: The master nodes not ready: node \"ocp02-hbtsw-master-0\" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) " 2020-11-30T00:32:07.404959603Z I1130 00:32:07.404889 1 request.go:621] Throttling request took 1.079196601s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/config 2020-11-30T00:32:09.404471534Z I1130 00:32:09.404369 1 request.go:621] Throttling request took 1.082331542s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods?labelSelector=app%3Dinstaller 2020-11-30T00:38:00.111436125Z I1130 00:38:00.111293 1 request.go:621] Throttling request took 1.052669199s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/config 2020-11-30T00:38:01.311209377Z I1130 00:38:01.309755 1 request.go:621] Throttling request took 1.174773078s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller-config 2020-11-29T08:00:25.299022322Z W1129 08:00:25.298524 1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ocp02-hbtsw-master-0" does not exist kube-scheduler 4.5.13 True False True 64d machine-config 4.5.13 False False True 2d20h monitoring 4.5.13 False False True 2d20h network 4.5.13 True True True 64d openshift-apiserver 4.5.13 True False True 3h21m openshift-controller-manager 4.5.13 True True False 2d10h
I'm seeing a bunch of errors in this cluster, which shouldnt be the case for a fresh install.. First a few required operators seem to be down as mentioned above. Digging through a bit I see one old kubelet error in mcd logs on ocp02-hbtsw-master-2 2020-11-26T14:42:34.738281168Z W1126 14:42:34.738123 4814 daemon.go:595] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused The mco says (which would be the missing mcd and node) 2020-11-30T14:30:24.325371923Z E1130 14:30:24.325239 1 operator.go:331] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 10, updated: 10, ready: 9, unavailable: 1) looking thru cluster-scoped-resources/core/nodes for the master-o.yaml it seems to be having issues with a missing node: - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: MemoryPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: DiskPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: PIDPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: Ready In the kubelet logs I also see for another of the masters (maybe unrelated): ``` Nov 30 05:04:09.780566 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.780528 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/e183874d-26e4-49e4-9ae8-28a24c6a17d4/volumes/kubernetes.io~secret/openshift-apiserver-operator-token-qzvts ... Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788680 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/ed3fbe7d-bc37-47db-aa3f-d54656d8581b/volumes/kubernetes.io~secret/serving-cert Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788777 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/46e2147d-43e4-4ad4-8f73-57b90a77c118/volumes/kubernetes.io~secret/openshift-config-operator-token-5trz7 Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788812 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /run/containers/storage/overlay-containers/3bc2c63f394e374f2bf8533dea1779e0b7839f1c03c3de4c91dc11814604ecaf/userdata/shm Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09. I also see a missing kube-apiserver/openshift-apiserver/kubecontrollermanager, etc...I'll pass this to the Node Team to further investigate what happened to the missing node (and maybe that's also related to the failed kubelet health checks on the other masters?) and they might choose to pass it to the Vsphere team as I can't discern if this is a Vsphere specific issue or not. But I don't think that the MCO caused the failed install but is a symptom of some other problem that needs to be untangled.
This was fixed in BZ1901208 and will likely be released into the 4.6 point release this week. *** This bug has been marked as a duplicate of bug 1901208 ***