2008745 – Cluster unable to come back in healthy state after a manual reboot

Bug 2008745 - Cluster unable to come back in healthy state after a manual reboot

Summary: Cluster unable to come back in healthy state after a manual reboot

Keywords:
Status:	CLOSED DUPLICATE of bug 1997478
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Rom Freiman
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-29 04:45 UTC by Praveen Kumar
Modified:	2021-11-01 20:08 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-08 13:55:13 UTC
Target Upstream Version:
Embargoed:
Flags:	ehashman: needinfo-

Attachments	(Terms of Use)

Description Praveen Kumar 2021-09-29 04:45:42 UTC

Description of problem: I deployed a single node cluster (4.9.0-rc.4) using cluster bot and then try to reboot the cluster node manually but cluster doesn't come up as expected after the reboot.

Version-Release number of selected component (if applicable):
4.9.0-rc.4


Steps to Reproduce:
1. Create a single node cluster (I tried on aws using cluster bot) using 4.9.0-rc.4 version
2. Reboot it (oc debug node; chroot /host; shutdown -r 1; ) 

Actual results:
```
$ oc get co
NAME             VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.9.0-rc.4   False       False         True       8m52s   APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node.
APIServicesAvailable: PreconditionNotReady
OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://10.217.4.32:443/healthz": dial tcp 10.217.4.32:443: connect: connection refused
config-operator   4.9.0-rc.4   True        False         False      24h     
console           4.9.0-rc.4   True        False         False      24h     
dns               4.9.0-rc.4   False       True          True       9m5s    DNS "default" is unavailable.
etcd              4.9.0-rc.4   True        False         False      24h     
image-registry    4.9.0-rc.4   False       False         False      9m20s   NodeCADaemonAvailable: The daemon set node-ca does not have available replicas
Available: The registry is ready
ImagePrunerAvailable: Pruner CronJob has been created
ingress                   4.9.0-rc.4   True        False         False      24h     
kube-apiserver            4.9.0-rc.4   True        True          False      24h     NodeInstallerProgressing: 1 nodes are at revision 8; 0 nodes have achieved new revision 10
kube-controller-manager   4.9.0-rc.4   True        False         False      24h     
kube-scheduler            4.9.0-rc.4   True        False         False      24h     
machine-api               4.9.0-rc.4   True        False         True       24h     Failed to resync for operator: 4.9.0-rc.4 because deployment machine-api-controllers is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)
machine-approver          4.9.0-rc.4   True        False         False      24h     
machine-config            4.9.0-rc.4   True        False         False      24h     
marketplace               4.9.0-rc.4   True        False         False      24h     
network                   4.9.0-rc.4   True        True          False      24h     DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
DaemonSet "openshift-multus/multus-additional-cni-plugins" is not available (awaiting 1 nodes)
DaemonSet "openshift-multus/network-metrics-daemon" is waiting for other operators to become ready
DaemonSet "openshift-multus/multus-admission-controller" is waiting for other operators to become ready
DaemonSet "openshift-sdn/sdn-controller" is not available (awaiting 1 nodes)
DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)
DaemonSet "openshift-network-diagnostics/network-check-target" is waiting for other operators to become ready
Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
node-tuning               4.9.0-rc.4   False       False         False      6m50s   DaemonSet "tuned" has no available Pod(s)
openshift-apiserver       4.9.0-rc.4   False       False         True       8m54s   APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node.
APIServicesAvailable: PreconditionNotReady
openshift-controller-manager               4.9.0-rc.4   False       True          False      6m24s   Available: no daemon pods available on any node.
openshift-samples                          4.9.0-rc.4   True        False         False      24h     
operator-lifecycle-manager                 4.9.0-rc.4   True        False         False      24h     
operator-lifecycle-manager-catalog         4.9.0-rc.4   True        False         False      24h     
operator-lifecycle-manager-packageserver   4.9.0-rc.4   False       True          False      9m14s   ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout
service-ca                                 4.9.0-rc.4   True        True          False      24h     Progressing: 
Progressing: service-ca does not have available replicas
```


Expected results:
Cluster should able to get back to healthy state after 4-5 mins.

Additional info: Not able to perform must gather because of unstable cluster.

```
$ oc adm must-gather .
[must-gather      ] OUT the server is currently unable to handle the request (get imagestreams.image.openshift.io must-gather)
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-zhxbn created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-mssmm created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-6jzlj] OUT gather did not start: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-mssmm deleted
[must-gather      ] OUT namespace/openshift-must-gather-zhxbn deleted
error: gather did not start for pod must-gather-6jzlj: timed out waiting for the condition
```

This is not the case with 4.9.0-rc.3 which I have tried so some changes which landed to rc.4 might created issue.

Also https://bugzilla.redhat.com/show_bug.cgi?id=1997905#c1 we filed during 4.9 nightly testing but it got resolved later and till rc.3 it is working without issue.

Comment 1 Eran Cohen 2021-09-29 13:21:37 UTC

I didn't manage to download the logs before the node was lost, but did get some snippets
It seems that after the reboot kubelet hanged and it took about an hour until it actually started.
it wrote 6 log line per minute from 07:36:47 to 08:34:47:

e.g.

Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623931    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube->
Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623973    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube>
Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623982    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us->
Sep 29 08:13:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:47.623991    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver>
Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:51.965530    1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" >
Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965748    1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c>
Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965780    1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens>
Sep 29 08:13:51 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:51.965859    1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/54a382>
Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: I0929 08:13:52.471471    1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" >
Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471654    1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c>
Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471682    1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens>
Sep 29 08:13:52 ip-10-0-158-214 hyperkube[1789]: E0929 08:13:52.471759    1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/23c6af>
Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624935    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube>
Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624973    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver>
Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624983    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us->
Sep 29 08:14:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:14:47.624997    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube->
Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625764    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube->
Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625809    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube>
Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625820    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us->
Sep 29 08:15:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:47.625828    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver>
Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:53.983566    1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" >
Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983741    1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c>
Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983762    1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens>
Sep 29 08:15:53 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:53.983822    1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/54a382>
Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: I0929 08:15:54.488263    1789 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" >
Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488450    1789 projected.go:293] Couldn't get configMap openshift-kube-controller-manager/kube-root-ca.c>
Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488486    1789 projected.go:199] Error preparing data for projected volume kube-api-access for pod opens>
Sep 29 08:15:54 ip-10-0-158-214 hyperkube[1789]: E0929 08:15:54.488557    1789 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/projected/23c6af>
Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626655    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube>
Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626690    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us->
Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626701    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver>
Sep 29 08:16:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:16:47.626715    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube->
Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627079    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube>
Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627113    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-158-214.us->
Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627122    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver>
Sep 29 08:17:47 ip-10-0-158-214 hyperkube[1789]: I0929 08:17:47.627156    1789 kubelet_getters.go:176] "Pod status updated" pod="openshift-kube-controller-manager/kube->


Seems that things started working after these logs in kubelet:

Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821190    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821244    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821274    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821295    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821318    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821348    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821366    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821384    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.821399    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: W0929 08:35:36.823778    1789 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/NetworkManager-dispatcher.service: no such file or directory
Sep 29 08:35:36 ip-10-0-158-214 hyperkube[1789]: E0929 08:35:36.994469    1789 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/system.slice/systemd-tmpfiles-clean.service\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0f338783_b47a_4e2a_b518_c266972ab57c.slice/crio-conmon-03eb266abdd67a>
Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.120833    1789 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-ingress_router-default-65cdc4f6d9-nb9pk_0f338783-b47a-4e2a-b518-c266972ab57c/router/1.log"
Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.121018    1789 preemption.go:117] "Pod evicted successfully" pod="openshift-ingress/router-default-65cdc4f6d9-nb9pk"
Sep 29 08:35:37 ip-10-0-158-214 hyperkube[1789]: I0929 08:35:37.121114    1789 topology_manager.go:200] "Topology Admit Handler"

Comment 2 To Hung Sze 2021-09-29 14:20:01 UTC

Possibly duplicate of / related to https://bugzilla.redhat.com/show_bug.cgi?id=1997478

Comment 4 Miciah Dashiel Butler Masters 2021-09-29 18:00:28 UTC

I tried to reproduce the dns failure.  I don't think I reproduced the same failure, but I did produce some kinds of failures.  First, I launched a 4.9.0-rc.4 cluster on AWS, single-node, and rebooted as described in comment 0.  All clusteroperators except for "storage" came up fine after the first reboot.  

I then did a second reboot, after which many pods (including SDN and DNS) were stuck in "ContainerCreating".  The pods that were stuck included operators, which meant operators were not updating their respective clusteroperator statuses to report the problem.  

Eventually (an hour after the second reboot), DNS, SDN, and most other pods in the cluster (including all operators) became "Ready", except for the following, which remained in error states: aws-ebs-csi-driver-controller-5b977d447d-2h7d7, router-default-5599559769-r2xxl, installer-5-ip-10-0-139-92.ec2.internal, machine-config-controller-65485b89f-8bmmb, and machine-config-server-d8xqh; and all clusteroperators except for the "storage" clusteroperator reported all-good.  

The second reboot happened at 16:14.  In the crio and kubelet logs from `oc adm node-logs`, the kubelet logged that it did a "SyncLoop ADD" after reboot (I abbreviated the list of pods in the log message because it is very long):

    Sep 29 16:15:20.487481 ip-10-0-139-92 hyperkube[1779]: I0929 16:15:20.487411    1779 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[... openshift-dns/dns-default-7wqpg ...]

Then crio tore down a bunch of pods, including the dns pod:

    Sep 29 16:16:00.685614 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.685475216Z" level=info msg="Got pod network &{Name:dns-default-7wqpg Namespace:openshift-dns ID:223af3fcd717cae3e2bae6f0472249af831ff59708ad38dc80c0067573e5fd79 UID:94d48e4a-6065-4e2d-ac04-29b66bc979dd NetNS: Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
    Sep 29 16:16:00.685746 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.685715577Z" level=info msg="Deleting pod openshift-dns_dns-default-7wqpg from CNI network \"multus-cni-network\" (type=multus)"
    Sep 29 16:16:00.742024 ip-10-0-139-92 ovs-vswitchd[1156]: ovs|00688|connmgr|INFO|br0<->unix#538: 2 flow_mods in the last 0 s (2 deletes)
    Sep 29 16:16:00.753710 ip-10-0-139-92 crio[1736]: 2021-09-29T16:16:00Z [verbose] Del: openshift-dns:dns-default-7wqpg:94d48e4a-6065-4e2d-ac04-29b66bc979dd:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
    Sep 29 16:16:00.753862 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.753789723Z" level=info msg="Successfully cleaned up network for pod 223af3fcd717cae3e2bae6f0472249af831ff59708ad38dc80c0067573e5fd79"
    Sep 29 16:16:00.754504 ip-10-0-139-92 crio[1736]: time="2021-09-29 16:16:00.754482438Z" level=warning msg="Ignoring error tearing down loopback interface: failed to Statfs \"\": no such file or directory"

For an hour, no "SyncLoop" events are logged.  Then this is logged: 

    Sep 29 17:15:54.673224 ip-10-0-139-92 crio[1736]: time="2021-09-29 17:15:54.673173976Z" level=warning msg="Stopping container 0ca45f76530ab3b8d03d8f2ae6054f219dc34f9d7ba769484fad6e507316dcc6 with stop signal timed out: timeout reached after 3600 seconds waiting for container process to exit" id=164a044d-4b45-490d-8e3a-f498db046bcf name=/runtime.v1alpha2.RuntimeService/StopContainer
    [...]
    Sep 29 17:15:54.937155 ip-10-0-139-92 crio[1736]: time="2021-09-29 17:15:54.937000503Z" level=info msg="Stopped container 0ca45f76530ab3b8d03d8f2ae6054f219dc34f9d7ba769484fad6e507316dcc6: openshift-ingress/router-default-5599559769-r2xxl/router" id=164a044d-4b45-490d-8e3a-f498db046bcf name=/runtime.v1alpha2.RuntimeService/StopContainer

Immediately after that, the kubelet started doing "SyncLoop" events again whereupon pods became "Ready".  

Could the router pod's grace period somehow block the kubelet from syncing other pods?

Comment 5 Alexander Chuzhoy 2021-09-29 21:25:14 UTC

Tested on 4.9.0-0.nightly-2021-09-27-105859

Tried the following 10 times.

[kni@r640-u01 ~]$ oc debug node/openshift-master-0.qe1.kni.lab.eng.bos.redhat.com
Starting pod/openshift-master-0qe1knilabengbosredhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 2620:52:0:1386::34
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# shutdown -r 1
Shutdown scheduled for Wed 2021-09-29 20:02:57 UTC, use 'shutdown -c' to cancel.
sh-4.4# 
Removing debug pod ...
error: unable to delete the debug pod "openshift-master-0qe1knilabengbosredhatcom-debug": Delete "https://api.qe1.kni.lab.eng.bos.redhat.com:6443/api/v1/namespaces/sasha/pods/openshift-master-0qe1knilabengbosredhatcom-debug": dial tcp [2620:52:0:1386::3a]:6443: connect: connection refused


oc get clusterversion;
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-27-105859   True        False         6h46m   Cluster version is 4.9.0-0.nightly-2021-09-27-105859


oc get co; 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-09-27-105859   True        False         False      69m     
baremetal                                  4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
cloud-controller-manager                   4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
cloud-credential                           4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h56m   
cluster-autoscaler                         4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
config-operator                            4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h57m   
console                                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      78m     
csi-snapshot-controller                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h56m   
dns                                        4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h25m   
etcd                                       4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h51m   
image-registry                             4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h45m   
ingress                                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h50m   
insights                                   4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
kube-apiserver                             4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h49m   
kube-controller-manager                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h50m   
kube-scheduler                             4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h48m   
kube-storage-version-migrator              4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h57m   
machine-api                                4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
machine-approver                           4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h56m   
machine-config                             4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h56m   
marketplace                                4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h56m   
monitoring                                 4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h47m   
network                                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h57m   
node-tuning                                4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h35m   
openshift-apiserver                        4.9.0-0.nightly-2021-09-27-105859   True        False         False      156m    
openshift-controller-manager               4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h50m   
openshift-samples                          4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h50m   
operator-lifecycle-manager                 4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-09-27-105859   True        False         False      79m     
service-ca                                 4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h57m   
storage                                    4.9.0-0.nightly-2021-09-27-105859   True        False         False      6h52m   


oc get pod -A|grep -v Run|grep -v Comple;
NAMESPACE                                          NAME                                                                         READY   STATUS      RESTARTS       AGE


curl sasha-sasha.apps.qe1.kni.lab.eng.bos.redhat.com
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Hello World</title>
</head>
<body>
IP address is: fd01:0:0:1::36 
</body>
</html>

Comment 6 Angus Thomas 2021-10-08 13:55:13 UTC


*** This bug has been marked as a duplicate of bug 1997478 ***

Note You need to log in before you can comment on or make changes to this bug.