Bug 1904248
Summary: | Vsphere: nodes on non ready state | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | mchebbi <mchebbi> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | aos-bugs, kgarriso, tsweeney |
Version: | 4.5 | Keywords: | UpcomingSprint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-07 22:19:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
mchebbi@redhat.com
2020-12-03 22:09:03 UTC
I'm seeing a bunch of errors in this cluster, which shouldnt be the case for a fresh install.. First a few required operators seem to be down as mentioned above. Digging through a bit I see one old kubelet error in mcd logs on ocp02-hbtsw-master-2 2020-11-26T14:42:34.738281168Z W1126 14:42:34.738123 4814 daemon.go:595] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused The mco says (which would be the missing mcd and node) 2020-11-30T14:30:24.325371923Z E1130 14:30:24.325239 1 operator.go:331] timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 10, updated: 10, ready: 9, unavailable: 1) looking thru cluster-scoped-resources/core/nodes for the master-o.yaml it seems to be having issues with a missing node: - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: MemoryPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: DiskPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: PIDPressure - lastHeartbeatTime: "2020-11-29T17:07:06Z" lastTransitionTime: "2020-11-29T17:08:47Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: Ready In the kubelet logs I also see for another of the masters (maybe unrelated): ``` Nov 30 05:04:09.780566 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.780528 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/e183874d-26e4-49e4-9ae8-28a24c6a17d4/volumes/kubernetes.io~secret/openshift-apiserver-operator-token-qzvts ... Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788680 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/ed3fbe7d-bc37-47db-aa3f-d54656d8581b/volumes/kubernetes.io~secret/serving-cert Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788777 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/kubelet/pods/46e2147d-43e4-4ad4-8f73-57b90a77c118/volumes/kubernetes.io~secret/openshift-config-operator-token-5trz7 Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09.788812 2956766 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /run/containers/storage/overlay-containers/3bc2c63f394e374f2bf8533dea1779e0b7839f1c03c3de4c91dc11814604ecaf/userdata/shm Nov 30 05:04:09.789122 ocp02-hbtsw-master-2 hyperkube[2956766]: I1130 05:04:09. I also see a missing kube-apiserver/openshift-apiserver/kubecontrollermanager, etc...I'll pass this to the Node Team to further investigate what happened to the missing node (and maybe that's also related to the failed kubelet health checks on the other masters?) and they might choose to pass it to the Vsphere team as I can't discern if this is a Vsphere specific issue or not. But I don't think that the MCO caused the failed install but is a symptom of some other problem that needs to be untangled. This was fixed in BZ1901208 and will likely be released into the 4.6 point release this week. *** This bug has been marked as a duplicate of bug 1901208 *** |