Description of problem: Rebooting a node in the cluster (OCP 4.7 on Power) cause the pods to remain in ContainerCreating state and crio and kubelet services remain inactive. Version-Release number of selected component (if applicable): The installation is done using the 4.7 nightly build on Power (ppc64le) Systems. # oc version Client Version: 4.7.0-0.nightly-ppc64le-2021-01-27-164047 Server Version: 4.7.0-0.nightly-ppc64le-2021-01-27-164047 Kubernetes Version: v1.20.0+614d551 This issue was also seen on another nightly build of 4.7 and is reported at https://bugzilla.redhat.com/show_bug.cgi?id=1917667 . Also, similar issue seen at https://bugzilla.redhat.com/show_bug.cgi?id=1704410 How reproducible: Randomly occurs on the nodes. Steps to Reproduce: 1. Reboot (normal) any of the master or worker nodes. Actual results: Upon node reboot, the crio and kubelet service fail to start. [root@worker-0 core]# systemctl status crio ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-mco-default-env.conf, 20-nodenet.conf Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [root@worker-0 core]# systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-mco-default-env.conf, 20-logging.conf, 20-nodenet.conf Active: inactive (dead) # oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 40h v1.20.0+614d551 master-1 Ready master 40h v1.20.0+614d551 master-2 Ready master 40h v1.20.0+614d551 worker-0 NotReady worker 40h v1.20.0+614d551 worker-1 Ready worker 40h v1.20.0+614d551 The pods on the node are stuck in ContainerCreating or Terminating state. # oc get pods -A -owide | grep worker-0 nfs-provisioner nfs-client-provisioner-5b7c886bcb-2ntr6 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-cluster-node-tuning-operator tuned-5bzkg 0/1 ContainerCreating 0 39h 9.114.97.30 worker-0 <none> <none> openshift-dns dns-default-fhkkc 0/3 ContainerCreating 0 39h <none> worker-0 <none> <none> openshift-image-registry image-registry-6c76bc5c4b-b5pdq 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-image-registry node-ca-7htvg 0/1 ContainerCreating 0 39h 9.114.97.30 worker-0 <none> <none> openshift-ingress-canary ingress-canary-pghpc 0/1 ContainerCreating 0 39h <none> worker-0 <none> <none> openshift-ingress router-default-c9bd8c7f6-vnk78 0/1 Terminating 0 21h 9.114.97.30 worker-0 <none> <none> openshift-kube-storage-version-migrator migrator-77bd4d6c89-9nfgm 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-machine-config-operator machine-config-daemon-khs9n 0/2 ContainerCreating 0 39h 9.114.97.30 worker-0 <none> <none> openshift-marketplace certified-operators-9r2jm 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace certified-operators-fq9x4 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace community-operators-57v2g 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace community-operators-8mnrm 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace redhat-marketplace-4h29b 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace redhat-marketplace-xst7v 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace redhat-operators-mftlv 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-marketplace redhat-operators-wdrr4 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring alertmanager-main-0 0/5 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring alertmanager-main-1 0/5 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring alertmanager-main-2 0/5 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring grafana-7c7bbf9f79-x8bfh 0/2 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring kube-state-metrics-7cdc649949-j4b7p 0/3 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring node-exporter-47jhh 0/2 PodInitializing 0 39h 9.114.97.30 worker-0 <none> <none> openshift-monitoring openshift-state-metrics-8696d7fbd5-jn4d5 0/3 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring prometheus-k8s-0 0/7 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring prometheus-k8s-1 0/7 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring telemeter-client-5f578d7946-ddgjz 0/3 Terminating 0 21h <none> worker-0 <none> <none> openshift-monitoring thanos-querier-6c6d74c94b-2xbcx 0/5 Terminating 0 21h <none> worker-0 <none> <none> openshift-multus multus-g47rn 0/1 PodInitializing 0 39h 9.114.97.30 worker-0 <none> <none> openshift-multus network-metrics-daemon-b4524 0/2 ContainerCreating 0 39h <none> worker-0 <none> <none> openshift-network-diagnostics network-check-source-789c4ffcd9-mdmk4 0/1 Terminating 0 21h <none> worker-0 <none> <none> openshift-network-diagnostics network-check-target-hdd6s 0/1 ContainerCreating 0 39h <none> worker-0 <none> <none> openshift-sdn ovs-88fvc 0/1 ContainerCreating 0 39h 9.114.97.30 worker-0 <none> <none> openshift-sdn sdn-d6p4d 0/2 ContainerCreating 0 39h 9.114.97.30 worker-0 <none> <none> The pods events show error: Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0 in pod sandbox k8s_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0(4313bf70e600cd27eae89c9aeb269a89ecdf4f6dc3e6e07ae870abd0e8467004): error recreating the missing symlinks: error reading name of symlink for &{"73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a" '\x14' %!q(os.FileMode=2147484096) {%!q(uint64=67063710) %!q(int64=63747370128) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000481f80})} {'ﴃ' %!q(uint64=142606522) '\x03' '䇀' '\x00' '\x00' '\x00' '\x00' '\x14' '' '\x00' {%!q(int64=1611773328) %!q(int64=167060853)} {%!q(int64=1611773328) %!q(int64=67063710)} {%!q(int64=1611773328) %!q(int64=67063710)} '\x00' '\x00' '\x00'}}: open /var/lib/containers/storage/overlay/73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a/link: no such file or directory Warning FailedCreatePodSandBox 11s (x239 over 53m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0 in pod sandbox k8s_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0(498ccd0ff60790cbdc0ea339cb444d163bebea1051fc5920cac8ec3a0c0e30a1): error recreating the missing symlinks: error reading name of symlink for &{"73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a" '\x14' %!q(os.FileMode=2147484096) {%!q(uint64=67063710) %!q(int64=63747370128) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000481f80})} {'ﴃ' %!q(uint64=142606522) '\x03' '䇀' '\x00' '\x00' '\x00' '\x00' '\x14' '' '\x00' {%!q(int64=1611773328) %!q(int64=167060853)} {%!q(int64=1611773328) %!q(int64=67063710)} {%!q(int64=1611773328) %!q(int64=67063710)} '\x00' '\x00' '\x00'}}: open /var/lib/containers/storage/overlay/73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a/link: no such file or directory The journalctl logs show the following error on the node: -- Logs begin at Wed 2021-01-27 18:48:31 UTC. -- Jan 29 09:44:34 worker-0 crio[663704]: time="2021-01-29 09:44:34.448761351Z" level=info msg="Version file /var/run/crio/version not found: open /var/run/crio/version: no such file or directory: triggering wipe of containers" Jan 29 09:44:34 worker-0 systemd[1]: crio-wipe.service: Succeeded. Jan 29 09:44:34 worker-0 systemd[1]: Started CRI-O Auto Update Script. Jan 29 09:44:34 worker-0 systemd[1]: crio-wipe.service: Consumed 104ms CPU time Jan 29 09:44:35 worker-0 bash[1618]: Error: readlink /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ: no such file or directory Jan 29 09:44:40 worker-0 bash[1618]: Error: readlink /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ: no such file or directory /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ is not present ls /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ ls: cannot access '/var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ': No such file or directory # ls /var/lib/containers/storage/overlay/l/ 4VPW36PPPQK6VMOVWQ6UYMX4WB CH5CFIAWO2DIRX7FQHFGPWWDGO KLYCXOQTJTADTAYS5MTXFGAEQX OSFCZDSSPLIQKUQKYBKDFHJR6X QEJ7XY22QDTUEK7TDQ3IMCHBV2 W3UXWWGA4PWPVUUFSMSWXHOOQP 6GE5N5FB2ZWVGWZ4LIGH6ULLXC DSTLZVJSOO34XFRAB2GNTL3WCZ LAXCJHZLYSIRMH5TYSEEDVUVBD OSPWEBJT65XO3LUHVIGFHDMC3M TNXMA2P2LZD3XC4APMWRJ5MJTH WNWHWN7UBIHAZDSNZAZ5NH7RJY ADG3MNMTZCZRLERY4UMPIVXR2Z F3RBQOU33QYZAVSBOYOUHOGFXF LI6ZJSPWE7XUTLBWF3GFDUNNLL OVW66ZCQ44J2R2P6UGEDW34VZE UAMEZAZ7Y6KUYSPVKDXNEOCXZ4 WVY3XV22YULR4NVON7Z2I5KFXS ATTE5PHKYVA3UPSEYELJ7AOJLC GOK3YJG5FEHAWXCFIPQIODHEKM MW5VU2HD3KAEZXSQ3GIXPZWFXB OYQ26CMLSMB4B6DWGBAPINO6BL UDKC5LJFIAWGTDOPLPJA7LPXZV XME2FKZVZD7JGTZ2S2FCFXL7WM BW2ZD5H2PZJEJFBEOUIJV7SVXZ IJUJEIQX33FWN5SE5KODWRN3GX NBQI6Z4CV7PRV7WNBGSY7UYHDM P742FV32PIVXMJQESJOXYJKE2S UQOHL7ZOO6LK3JR2UTJACIJJPA YLJ27FU7FIQUK2IS54YUPYYQDP CD5AQG5OLSSZUMERGFE6F327XC JMUE4Z6FU3ATCKFEECAMXVDPYB NIS3TIBFBAD4AEJ2JCR75HXCRZ PZIJOKX7URYCKI425HSTZR4H3K VTN56IUJPBS4ZJGXSK6MZZGD7L YM4Q5BXGEBZPSQQQKNEPI55PMD # journalctl -f -u kubelet -- Logs begin at Wed 2021-01-27 18:48:31 UTC. -- Jan 28 13:58:40 worker-0 hyperkube[1846]: I0128 13:58:40.599664 1846 kuberuntime_manager.go:439] No sandbox for pod "ingress-canary-pghpc_openshift-ingress-canary(7daaaa5a-61ec-4a57-b132-8ef3eb84350b)" can be found. Need to start a new one Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.029922 1846 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030024 1846 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030065 1846 kuberuntime_manager.go:755] createPodSandbox for pod "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030192 1846 pod_workers.go:191] Error syncing pod 81b2ae09-661c-46bc-b114-b461da8c7eae ("certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)"), skipping: failed to "CreatePodSandbox" for "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" with CreatePodSandboxError: "CreatePodSandbox for pod \"certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition" Jan 28 13:58:41 worker-0 hyperkube[1846]: I0128 13:58:41.033777 1846 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/etc/kubernetes/kubelet-ca.crt Jan 28 13:58:41 worker-0 systemd[1]: Stopping Kubernetes Kubelet... Jan 28 13:58:41 worker-0 systemd[1]: kubelet.service: Succeeded. Jan 28 13:58:41 worker-0 systemd[1]: Stopped Kubernetes Kubelet. Jan 28 13:58:41 worker-0 systemd[1]: kubelet.service: Consumed 4min 18.362s CPU time # journalctl -f -u crio.service -- Logs begin at Wed 2021-01-27 18:48:31 UTC. -- Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.766410873Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_dns-default-fhkkc_openshift-dns_7084e5e8-ca89-4ab9-b008-ed117ecae376_0" id=41254314-9803-4f84-8042-46f3ce037061 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.778879949Z" level=info msg="RunSandbox: removing pod sandbox from storage: 659c39749d06436812e0696e37a53243709fe5950aef4656fd872e46e26f8c68" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.783828722Z" level=info msg="RunSandbox: releasing container name: k8s_POD_image-registry-6c76bc5c4b-b5pdq_openshift-image-registry_64f0b20b-19cc-466f-913e-18af31b87562_0" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.783991186Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_image-registry-6c76bc5c4b-b5pdq_openshift-image-registry_64f0b20b-19cc-466f-913e-18af31b87562_0" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.868902965Z" level=info msg="RunSandbox: removing pod sandbox from storage: 9460390d93d0e37245fa462e7ce0ee2421cf6f30b986d19d3d4b03859dc66985" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.876717932Z" level=info msg="RunSandbox: releasing container name: k8s_POD_alertmanager-main-0_openshift-monitoring_d82f5260-e620-4e47-9060-61199b721245_0" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.876750990Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_alertmanager-main-0_openshift-monitoring_d82f5260-e620-4e47-9060-61199b721245_0" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jan 28 13:59:27 worker-0 systemd[1]: crio.service: Succeeded. Jan 28 13:59:27 worker-0 systemd[1]: Stopped Open Container Initiative Daemon. Jan 28 13:59:27 worker-0 systemd[1]: crio.service: Consumed 3min 29.911s CPU time Expected results: The crio and kubelet services should start and pods should be Running. Additional info: # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="47.83.202101271010-0" VERSION_ID="4.7" OPENSHIFT_VERSION="4.7" RHEL_VERSION="8.3" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 47.83.202101271010-0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.7" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.7" OSTREE_VERSION='47.83.202101271010-0' [root@worker-0 core]# crio version INFO[0000] Starting CRI-O, version: 1.20.0-0.rhaos4.7.gita1ab08a.el8.43, git: () Version: 1.20.0-0.rhaos4.7.gita1ab08a.el8.43 GoVersion: go1.15.5 Compiler: gc Platform: linux/ppc64le Linkmode: dynamic # kubelet --version Kubernetes v1.20.0+614d551
Hi Peter and Node team, I just wanted to state that this behavior is randomly reproducible, but when it occurs, it renders Node unusable. It is also worth noting that there is a similar GH issue where the same happened on podman: https://github.com/containers/podman/issues/5986 Therefore, I wanted to give you this update and leave the "Blocker?" flag evaluation up to your team, as this is currently a "High" severity.
I think https://github.com/cri-o/cri-o/pull/3999 is a potential fix for such issues with the storage
We just hit this trying to add a remote worker to a cluster. The rhel node rebooted and kubelet and crio were dead. We are using 4.6.12.
Hi Giuseppe and node team, would it be possible if your team could let us know whether this fix will be in 4.7? If not, Archana and her team hope to notify the Power doc writers to include this bug in the release notes. Thank you!
@Dan Li, it seems unlikely that the fix will hit 4.7
@Giuseppe Could you suggest any workaround which would help in getting the services back to running state on the node? Thanks
the workaround is to `rm -rf /var/lib/containers/storage` and reboot the node
we have a potential fix upstream but we are not backporting it to 4.7. So closing the issue for 4.7