Bug 1904248

Summary:	Vsphere: nodes on non ready state
Product:	OpenShift Container Platform	Reporter:	mchebbi <mchebbi>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	aos-bugs, kgarriso, tsweeney
Version:	4.5	Keywords:	UpcomingSprint
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-07 22:19:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description mchebbi@redhat.com 2020-12-03 22:09:03 UTC

the customer did an an installation of ocp cluster but he noticed that constantly and without taking any action nodes go to NotReady state.

the actual state is that the master node  ocp02-hbtsw-master-0 is not ready.

[openshift@bastion ~]$  oc get nodes -o wide
NAME                       STATUS     ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME

ocp02-hbtsw-master-0       NotReady   master   64d   v1.18.3+47c0e71   10.15.2.10    10.15.2.10    Red Hat Enterprise Linux CoreOS 45.82.202009181447-0 (Ootpa)   4.18.0-193.23.1.el8_2.x86_64   cri-o://1.18.3-15.rhaos4.5.gitae4ef7b.el8
[openshift@bastion ~]$  oc describe co machine-config
Name:         machine-config

  Conditions:
    Last Transition Time:  2020-10-13T02:02:33Z
    Message:               Cluster version is 4.5.13
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-11-29T17:18:52Z
    Message:               Failed to resync 4.5.13 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 10, updated: 10, ready: 9, unavailable: 1)
    Reason:                MachineConfigDaemonFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-11-29T17:18:52Z
    Message:               Cluster not available for 4.5.13
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-09-28T19:58:39Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable


2020-11-29T17:21:24.297252241Z E1129 17:21:24.297181       1 task.go:81] error running apply for clusteroperator "kube-controller-manage
r" (98 of 586): Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: nod
e "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
2020-11-29T17:21:24.297365581Z I1129 17:21:24.297338       1 task_graph.go:575] Running 10 on 1
2020-11-29T17:21:24.297437956Z I1129 17:21:24.297415       1 task_graph.go:575] Running 11 on 1
2020-11-29T17:21:24.297500247Z I1129 17:21:24.297476       1 task_graph.go:568] Canceled worker 1
2020-11-29T17:21:24.298339975Z E1129 17:21:24.297564       1 task.go:81] error running apply for clusteroperator "kube-scheduler" (107 o
f 586): Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-ma
ster-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
2020-11-29T17:21:24.298526604Z I1129 17:21:24.298496       1 task_graph.go:568] Canceled worker 0
2020-11-29T17:21:24.298644952Z I1129 17:21:24.298620       1 task_graph.go:588] Workers finished
2020-11-29T17:21:24.299209076Z I1129 17:21:24.299149       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) update context deadline exceeded at 49 of 586 update context deadline exceeded at 49 of 586 Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)]
2020-11-29T17:21:24.299322938Z I1129 17:21:24.299292       1 sync_worker.go:818] Summarizing 2 errors
2020-11-29T17:21:24.299502102Z I1129 17:21:24.299468       1 sync_worker.go:822] Update error 98 of 586: ClusterOperatorDegraded Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) (*errors.errorString: cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.))

2020-11-29T17:21:24.299616247Z I1129 17:21:24.299585       1 sync_worker.go:822] Update error 107 of 586: ClusterOperatorDegraded Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) (*errors.errorString: cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.))
2020-11-29T17:21:24.300303118Z E1129 17:21:24.299734       1 sync_worker.go:329] unable to synchronize image (waiting 43.131425612s): Multiple errors are preventing progress:
2020-11-29T17:21:24.300303118Z * Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
2020-11-29T17:21:24.300303118Z * Cluster operator kube-scheduler is reporting a failure: NodeControllerDegraded: The master nodes not ready: node "ocp02-hbtsw-master-0" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
2020-11-29T17:21:24.300435013Z I1129 17:21:24.299407       1 task_graph.go:524] Stopped graph walker due to cancel

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE

dns                                        4.5.13    True        True          False      64d
etcd                                       4.5.13    True        False         True       64d
kube-apiserver                             4.5.13    True        True          True       64d
kube-controller-manager                    4.5.13    True        False         True       64d

2020-11-30T00:32:01.604882302Z I1130 00:32:01.604160       1 request.go:621] Throttling request took 1.150287646s, request: GET:https://
10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager
2020-11-30T00:32:06.315470534Z I1130 00:32:06.315364       1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-11-29T17:10:47Z","message":"NodeControllerDegraded: The master nodes not ready: node \"
ocp02-hbtsw-master-0\" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)","reason":"NodeController_MasterNodesReady","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-10-09T20:00:01Z","message":"NodeInstallerProgressing: 3 nodes are at revision 9","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"20
20-09-28T20:00:30Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-09-28T19:58:36Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2020-11-30T00:32:06.377539913Z I1130 00:32:06.377440       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"7ed3e6d2-06fd-40a7-98bc-f45e5d41b760", APIVersion:"
apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"ocp02-hbtsw-master-0\" not re
ady since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nTargetConfigControllerDegraded
: \"serviceaccount/localhost-recovery-client\": etcdserver: leader changed" to "NodeControllerDegraded: The master nodes not ready: node
 \"ocp02-hbtsw-master-0\" not ready since 2020-11-29 17:08:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
"
2020-11-30T00:32:07.404959603Z I1130 00:32:07.404889       1 request.go:621] Throttling request took 1.079196601s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/config
2020-11-30T00:32:09.404471534Z I1130 00:32:09.404369       1 request.go:621] Throttling request took 1.082331542s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods?labelSelector=app%3Dinstaller
2020-11-30T00:38:00.111436125Z I1130 00:38:00.111293       1 request.go:621] Throttling request took 1.052669199s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/config
2020-11-30T00:38:01.311209377Z I1130 00:38:01.309755       1 request.go:621] Throttling request took 1.174773078s, request: GET:https://10.19.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller-config

2020-11-29T08:00:25.299022322Z W1129 08:00:25.298524       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ocp02-hbtsw-master-0" does not exist


kube-scheduler                             4.5.13    True        False         True       64d
machine-config                             4.5.13    False       False         True       2d20h


monitoring                                 4.5.13    False       False         True       2d20h
network                                    4.5.13    True        True          True       64d
openshift-apiserver                        4.5.13    True        False         True       3h21m
openshift-controller-manager               4.5.13    True        True          False      2d10h

Comment 1 Kirsten Garrison 2020-12-04 01:54:57 UTC

I'm seeing a bunch of errors in this cluster, which shouldnt be the case for a fresh install.. First a few required operators seem to be down as mentioned above.

Digging through a bit I see one old kubelet error in mcd logs on ocp02-hbtsw-master-2
2020-11-26T14:42:34.738281168Z W1126 14:42:34.738123    4814 daemon.go:595] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused

The mco says (which would be the missing mcd and node)

2020-11-30T14:30:24.325371923Z E1130 14:30:24.325239       1 operator.go:331] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 10, updated: 10, ready: 9, unavailable: 1)

looking thru cluster-scoped-resources/core/nodes for the master-o.yaml it seems to be having issues with a missing node:

  - lastHeartbeatTime: "2020-11-29T17:07:06Z"
    lastTransitionTime: "2020-11-29T17:08:47Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: MemoryPressure
  - lastHeartbeatTime: "2020-11-29T17:07:06Z"
    lastTransitionTime: "2020-11-29T17:08:47Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: DiskPressure
  - lastHeartbeatTime: "2020-11-29T17:07:06Z"
    lastTransitionTime: "2020-11-29T17:08:47Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: PIDPressure
  - lastHeartbeatTime: "2020-11-29T17:07:06Z"
    lastTransitionTime: "2020-11-29T17:08:47Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready

In the kubelet logs I also see for another of the masters (maybe unrelated):
```
Nov 30 05:04:09.780566 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.780528 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/e183874d-26e4-49e4-9ae8-28a24c6a17d4/volumes/kubernetes.io~secret/openshift-apiserver-operator-token-qzvts
...
Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788680 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/ed3fbe7d-bc37-47db-aa3f-d54656d8581b/volumes/kubernetes.io~secret/serving-cert
Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788777 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/46e2147d-43e4-4ad4-8f73-57b90a77c118/volumes/kubernetes.io~secret/openshift-config-operator-token-5trz7
Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788812 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /run/containers/storage/overlay-containers/3bc2c63f394e374f2bf8533dea1779e0b7839f1c03c3de4c91dc11814604ecaf/userdata/shm
Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.

I also see a missing kube-apiserver/openshift-apiserver/kubecontrollermanager, etc...I'll pass this to the Node Team to further investigate what happened to the missing node (and maybe that's also related to the failed kubelet health checks on the other masters?) and they might choose to pass it to the Vsphere team as I can't discern if this is a Vsphere specific issue or not. But I don't think that the MCO caused the failed install but is a symptom of some other problem that needs to be untangled.

Comment 3 Ryan Phillips 2020-12-07 22:19:26 UTC

This was fixed in BZ1901208 and will likely be released into the 4.6 point release this week.

*** This bug has been marked as a duplicate of bug 1901208 ***