Bug 2008745
Summary: | Cluster unable to come back in healthy state after a manual reboot | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Praveen Kumar <prkumar> |
Component: | Installer | Assignee: | Rom Freiman <rfreiman> |
Installer sub component: | Single Node OpenShift | QA Contact: | Omri Hochman <ohochman> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | aos-bugs, athomas, cfergeau, dhellmann, dhughes, ehashman, eparis, ercohen, jokerman, mmasters, sasha, tsze |
Version: | 4.9 | Flags: | ehashman:
needinfo-
|
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-08 13:55:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Praveen Kumar
2021-09-29 04:45:42 UTC
I didn't manage to download the logs before the node was lost, but did get some snippets It seems that after the reboot kubelet hanged and it took about an hour until it actually started. it wrote 6 log line per minute from 07:36:47 to 08:34:47: e.g. Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623931 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-> Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623973 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube> Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623982 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us-> Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623991 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver> Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:51.965530 1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" > Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965748 1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c> Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965780 1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens> Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965859 1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/54a382> Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:52.471471 1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" > Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471654 1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c> Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471682 1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens> Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471759 1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/23c6af> Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624935 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube> Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624973 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver> Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624983 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us-> Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624997 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-> Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625764 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-> Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625809 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube> Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625820 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us-> Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625828 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver> Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:53.983566 1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" > Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983741 1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c> Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983762 1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens> Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983822 1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/54a382> Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:54.488263 1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" > Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488450 1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c> Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488486 1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens> Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488557 1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/23c6af> Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626655 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube> Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626690 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us-> Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626701 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver> Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626715 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-> Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627079 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube> Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627113 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us-> Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627122 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver> Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627156 1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube-> Seems that things started working after these logs in kubelet: Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821190 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821244 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821274 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821295 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821318 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821348 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821366 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821384 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821399 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.823778 1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service: no such file or directory Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: E0929 08:35:36.994469 1789 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/system.slice/systemd-tmpfiles-clean.service\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0f338783_b47a_4e2a_b518_c266972ab57c.slice/crio-conmon-03eb266abdd67a> Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.120833 1789 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-ingress_router-default-65cdc4f6d9-nb9pk_0f338783-b47a-4e2a-b518-c266972ab57c/router/1.log" Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.121018 1789 preemption.go:117] "Pod evicted successfully" pod="openshift-ingress/router-default-65cdc4f6d9-nb9pk" Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.121114 1789 topology_manager.go:200] "Topology Admit Handler" Possibly duplicate of / related to https://bugzilla.redhat.com/show_bug.cgi?id=1997478 I tried to reproduce the dns failure. I don't think I reproduced the same failure, but I did produce some kinds of failures. First, I launched a 4.9.0-rc.4 cluster on AWS, single-node, and rebooted as described in comment 0. All clusteroperators except for "storage" came up fine after the first reboot. I then did a second reboot, after which many pods (including SDN and DNS) were stuck in "ContainerCreating". The pods that were stuck included operators, which meant operators were not updating their respective clusteroperator statuses to report the problem. Eventually (an hour after the second reboot), DNS, SDN, and most other pods in the cluster (including all operators) became "Ready", except for the following, which remained in error states: aws-ebs-csi-driver-controller-5b977d447d-2h7d7, router-default-5599559769-r2xxl, installer-5-ip-10-0-139-92.ec2.internal, machine-config-controller-65485b89f-8bmmb, and machine-config-server-d8xqh; and all clusteroperators except for the "storage" clusteroperator reported all-good. The second reboot happened at 16:14. In the crio and kubelet logs from `oc adm node-logs`, the kubelet logged that it did a "SyncLoop ADD" after reboot (I abbreviated the list of pods in the log message because it is very long): Sep 29 16:15:20.487481 ip-10-0-139-92 hyperkube[1779]: I0929 16:15:20.487411 1779 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[... openshift-dns/dns-default-7wqpg ...] Then crio tore down a bunch of pods, including the dns pod: Sep 29 16:16:00.685614 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.685475216Z" level=info msg="Got pod network &{Name:dns-default-7wqpg Namespace:openshift-dns ID:223af3fcd717cae3e2bae6f0472249af831ff59708ad38dc80c0067573e5fd79 UID:94d48e4a-6065-4e2d-ac04-29b66bc979dd NetNS: Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Sep 29 16:16:00.685746 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.685715577Z" level=info msg="Deleting pod openshift-dns_dns-default-7wqpg from CNI network \"multus-cni-network\" (type=multus)" Sep 29 16:16:00.742024 ip-10-0-139-92 ovs-vswitchd[1156]: ovs|00688|connmgr|INFO|br0<->unix#538: 2 flow_mods in the last 0 s (2 deletes) Sep 29 16:16:00.753710 ip-10-0-139-92 crio[1736]: 2021-09-29T16:16:00Z [verbose] Del: openshift-dns:dns-default-7wqpg:94d48e4a-6065-4e2d-ac04-29b66bc979dd:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Sep 29 16:16:00.753862 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.753789723Z" level=info msg="Successfully cleaned up network for pod 223af3fcd717cae3e2bae6f0472249af831ff59708ad38dc80c0067573e5fd79" Sep 29 16:16:00.754504 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.754482438Z" level=warning msg="Ignoring error tearing down loopback interface: failed to Statfs \"\": no such file or directory" For an hour, no "SyncLoop" events are logged. Then this is logged: Sep 29 17:15:54.673224 ip-10-0-139-92 crio[1736]: time="2021-09-29 17:15:54.673173976Z" level=warning msg="Stopping container 0ca45f76530ab3b8d03d8f2ae6054f219dc34f9d7ba769484fad6e507316dcc6 with stop signal timed out: timeout reached after 3600 seconds waiting for container process to exit" id=164a044d-4b45-490d-8e3a-f498db046bcf name=/runtime.v1alpha2.RuntimeService/StopContainer [...] Sep 29 17:15:54.937155 ip-10-0-139-92 crio[1736]: time="2021-09-29 17:15:54.937000503Z" level=info msg="Stopped container 0ca45f76530ab3b8d03d8f2ae6054f219dc34f9d7ba769484fad6e507316dcc6: openshift-ingress/router-default-5599559769-r2xxl/router" id=164a044d-4b45-490d-8e3a-f498db046bcf name=/runtime.v1alpha2.RuntimeService/StopContainer Immediately after that, the kubelet started doing "SyncLoop" events again whereupon pods became "Ready". Could the router pod's grace period somehow block the kubelet from syncing other pods? Tested on 4.9.0-0.nightly-2021-09-27-105859 Tried the following 10 times. [kni@r640-u01 ~]$ oc debug node/openshift-master-0.qe1.kni.lab.eng.bos.redhat.com Starting pod/openshift-master-0qe1knilabengbosredhatcom-debug ... To use host binaries, run `chroot /host` Pod IP: 2620:52:0:1386::34 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# shutdown -r 1 Shutdown scheduled for Wed 2021-09-29 20:02:57 UTC, use 'shutdown -c' to cancel. sh-4.4# Removing debug pod ... error: unable to delete the debug pod "openshift-master-0qe1knilabengbosredhatcom-debug": Delete "https://api.qe1.kni.lab.eng.bos.redhat.com:6443/api/v1/namespaces/sasha/pods/openshift-master-0qe1knilabengbosredhatcom-debug": dial tcp [2620:52:0:1386::3a]:6443: connect: connection refused oc get clusterversion; NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-27-105859 True False 6h46m Cluster version is 4.9.0-0.nightly-2021-09-27-105859 oc get co; NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-09-27-105859 True False False 69m baremetal 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m cloud-controller-manager 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m cloud-credential 4.9.0-0.nightly-2021-09-27-105859 True False False 6h56m cluster-autoscaler 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m config-operator 4.9.0-0.nightly-2021-09-27-105859 True False False 6h57m console 4.9.0-0.nightly-2021-09-27-105859 True False False 78m csi-snapshot-controller 4.9.0-0.nightly-2021-09-27-105859 True False False 6h56m dns 4.9.0-0.nightly-2021-09-27-105859 True False False 6h25m etcd 4.9.0-0.nightly-2021-09-27-105859 True False False 6h51m image-registry 4.9.0-0.nightly-2021-09-27-105859 True False False 6h45m ingress 4.9.0-0.nightly-2021-09-27-105859 True False False 6h50m insights 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m kube-apiserver 4.9.0-0.nightly-2021-09-27-105859 True False False 6h49m kube-controller-manager 4.9.0-0.nightly-2021-09-27-105859 True False False 6h50m kube-scheduler 4.9.0-0.nightly-2021-09-27-105859 True False False 6h48m kube-storage-version-migrator 4.9.0-0.nightly-2021-09-27-105859 True False False 6h57m machine-api 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m machine-approver 4.9.0-0.nightly-2021-09-27-105859 True False False 6h56m machine-config 4.9.0-0.nightly-2021-09-27-105859 True False False 6h56m marketplace 4.9.0-0.nightly-2021-09-27-105859 True False False 6h56m monitoring 4.9.0-0.nightly-2021-09-27-105859 True False False 6h47m network 4.9.0-0.nightly-2021-09-27-105859 True False False 6h57m node-tuning 4.9.0-0.nightly-2021-09-27-105859 True False False 6h35m openshift-apiserver 4.9.0-0.nightly-2021-09-27-105859 True False False 156m openshift-controller-manager 4.9.0-0.nightly-2021-09-27-105859 True False False 6h50m openshift-samples 4.9.0-0.nightly-2021-09-27-105859 True False False 6h50m operator-lifecycle-manager 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-09-27-105859 True False False 79m service-ca 4.9.0-0.nightly-2021-09-27-105859 True False False 6h57m storage 4.9.0-0.nightly-2021-09-27-105859 True False False 6h52m oc get pod -A|grep -v Run|grep -v Comple; NAMESPACE NAME READY STATUS RESTARTS AGE curl sasha-sasha.apps.qe1.kni.lab.eng.bos.redhat.com <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Hello World</title> </head> <body> IP address is: fd01:0:0:1::36 </body> </html> *** This bug has been marked as a duplicate of bug 1997478 *** |