Bug 1922154

Summary:	Upon node reboot, crio and kubelet service fail to start
Product:	OpenShift Container Platform	Reporter:	pdsilva
Component:	Node	Assignee:	Giuseppe Scrivano <gscrivan>
Node sub component:	CRI-O	QA Contact:	MinLi <minmli>
Status:	CLOSED UPSTREAM	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, aprabhak, danili, mdunnett, nagrawal
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-19 11:29:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description pdsilva 2021-01-29 11:18:24 UTC

Description of problem:
Rebooting a node in the cluster (OCP 4.7 on Power) cause the pods to remain in ContainerCreating state and crio and kubelet services remain inactive.

Version-Release number of selected component (if applicable):
The installation is done using the 4.7 nightly build on Power (ppc64le) Systems.

# oc version
Client Version: 4.7.0-0.nightly-ppc64le-2021-01-27-164047
Server Version: 4.7.0-0.nightly-ppc64le-2021-01-27-164047
Kubernetes Version: v1.20.0+614d551

This issue was also seen on another nightly build of 4.7 and is reported at https://bugzilla.redhat.com/show_bug.cgi?id=1917667 . Also, similar issue seen at https://bugzilla.redhat.com/show_bug.cgi?id=1704410

How reproducible:
Randomly occurs on the nodes.

Steps to Reproduce:
1. Reboot (normal) any of the master or worker nodes.
 
Actual results:
Upon node reboot, the crio and kubelet service fail to start.


[root@worker-0 core]# systemctl status crio
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-mco-default-env.conf, 20-nodenet.conf
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o
	 
[root@worker-0 core]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-mco-default-env.conf, 20-logging.conf, 20-nodenet.conf
   Active: inactive (dead)


# oc get nodes
NAME       STATUS     ROLES    AGE   VERSION
master-0   Ready      master   40h   v1.20.0+614d551
master-1   Ready      master   40h   v1.20.0+614d551
master-2   Ready      master   40h   v1.20.0+614d551
worker-0   NotReady   worker   40h   v1.20.0+614d551
worker-1   Ready      worker   40h   v1.20.0+614d551

The pods on the node are stuck in ContainerCreating or Terminating state.
# oc get pods -A -owide  | grep worker-0
nfs-provisioner                                    nfs-client-provisioner-5b7c886bcb-2ntr6                   0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-5bzkg                                               0/1     ContainerCreating   0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-dns                                      dns-default-fhkkc                                         0/3     ContainerCreating   0          39h     <none>        worker-0   <none>           <none>
openshift-image-registry                           image-registry-6c76bc5c4b-b5pdq                           0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-image-registry                           node-ca-7htvg                                             0/1     ContainerCreating   0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-ingress-canary                           ingress-canary-pghpc                                      0/1     ContainerCreating   0          39h     <none>        worker-0   <none>           <none>
openshift-ingress                                  router-default-c9bd8c7f6-vnk78                            0/1     Terminating         0          21h     9.114.97.30   worker-0   <none>           <none>
openshift-kube-storage-version-migrator            migrator-77bd4d6c89-9nfgm                                 0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-khs9n                               0/2     ContainerCreating   0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-marketplace                              certified-operators-9r2jm                                 0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              certified-operators-fq9x4                                 0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              community-operators-57v2g                                 0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              community-operators-8mnrm                                 0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              redhat-marketplace-4h29b                                  0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              redhat-marketplace-xst7v                                  0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              redhat-operators-mftlv                                    0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-marketplace                              redhat-operators-wdrr4                                    0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               alertmanager-main-0                                       0/5     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               alertmanager-main-1                                       0/5     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               alertmanager-main-2                                       0/5     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               grafana-7c7bbf9f79-x8bfh                                  0/2     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               kube-state-metrics-7cdc649949-j4b7p                       0/3     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               node-exporter-47jhh                                       0/2     PodInitializing     0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-monitoring                               openshift-state-metrics-8696d7fbd5-jn4d5                  0/3     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               prometheus-k8s-0                                          0/7     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               prometheus-k8s-1                                          0/7     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               telemeter-client-5f578d7946-ddgjz                         0/3     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-monitoring                               thanos-querier-6c6d74c94b-2xbcx                           0/5     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-multus                                   multus-g47rn                                              0/1     PodInitializing     0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-multus                                   network-metrics-daemon-b4524                              0/2     ContainerCreating   0          39h     <none>        worker-0   <none>           <none>
openshift-network-diagnostics                      network-check-source-789c4ffcd9-mdmk4                     0/1     Terminating         0          21h     <none>        worker-0   <none>           <none>
openshift-network-diagnostics                      network-check-target-hdd6s                                0/1     ContainerCreating   0          39h     <none>        worker-0   <none>           <none>
openshift-sdn                                      ovs-88fvc                                                 0/1     ContainerCreating   0          39h     9.114.97.30   worker-0   <none>           <none>
openshift-sdn                                      sdn-d6p4d                                                 0/2     ContainerCreating   0          39h     9.114.97.30   worker-0   <none>           <none>


The pods events show error:
Warning  FailedCreatePodSandBox  53m                  kubelet          Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0 in pod sandbox k8s_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0(4313bf70e600cd27eae89c9aeb269a89ecdf4f6dc3e6e07ae870abd0e8467004): error recreating the missing symlinks: error reading name of symlink for &{"73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a" '\x14' %!q(os.FileMode=2147484096) {%!q(uint64=67063710) %!q(int64=63747370128) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000481f80})} {'ﴃ' %!q(uint64=142606522) '\x03' '䇀' '\x00' '\x00' '\x00' '\x00' '\x14' '' '\x00' {%!q(int64=1611773328) %!q(int64=167060853)} {%!q(int64=1611773328) %!q(int64=67063710)} {%!q(int64=1611773328) %!q(int64=67063710)} '\x00' '\x00' '\x00'}}: open /var/lib/containers/storage/overlay/73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a/link: no such file or directory
Warning  FailedCreatePodSandBox  11s (x239 over 53m)  kubelet          (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0 in pod sandbox k8s_ovs-88fvc_openshift-sdn_d8a3e040-4779-40c6-86f1-07200dcc723d_0(498ccd0ff60790cbdc0ea339cb444d163bebea1051fc5920cac8ec3a0c0e30a1): error recreating the missing symlinks: error reading name of symlink for &{"73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a" '\x14' %!q(os.FileMode=2147484096) {%!q(uint64=67063710) %!q(int64=63747370128) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000481f80})} {'ﴃ' %!q(uint64=142606522) '\x03' '䇀' '\x00' '\x00' '\x00' '\x00' '\x14' '' '\x00' {%!q(int64=1611773328) %!q(int64=167060853)} {%!q(int64=1611773328) %!q(int64=67063710)} {%!q(int64=1611773328) %!q(int64=67063710)} '\x00' '\x00' '\x00'}}: open /var/lib/containers/storage/overlay/73c2a5de576aad28e8cf93c2826f26c39ebdeca4046e5378a27e9f62abf15e1a/link: no such file or directory


The journalctl logs show the following error on the node:
-- Logs begin at Wed 2021-01-27 18:48:31 UTC. --
Jan 29 09:44:34 worker-0 crio[663704]: time="2021-01-29 09:44:34.448761351Z" level=info msg="Version file /var/run/crio/version not found: open /var/run/crio/version: no such file or directory: triggering wipe of containers"
Jan 29 09:44:34 worker-0 systemd[1]: crio-wipe.service: Succeeded.
Jan 29 09:44:34 worker-0 systemd[1]: Started CRI-O Auto Update Script.
Jan 29 09:44:34 worker-0 systemd[1]: crio-wipe.service: Consumed 104ms CPU time
Jan 29 09:44:35 worker-0 bash[1618]: Error: readlink /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ: no such file or directory
Jan 29 09:44:40 worker-0 bash[1618]: Error: readlink /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ: no such file or directory

/var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ is not present
ls /var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ
ls: cannot access '/var/lib/containers/storage/overlay/l/DNI7KNQPHAGOICJHSF7SHRKUPQ': No such file or directory


# ls /var/lib/containers/storage/overlay/l/
4VPW36PPPQK6VMOVWQ6UYMX4WB  CH5CFIAWO2DIRX7FQHFGPWWDGO  KLYCXOQTJTADTAYS5MTXFGAEQX  OSFCZDSSPLIQKUQKYBKDFHJR6X  QEJ7XY22QDTUEK7TDQ3IMCHBV2  W3UXWWGA4PWPVUUFSMSWXHOOQP
6GE5N5FB2ZWVGWZ4LIGH6ULLXC  DSTLZVJSOO34XFRAB2GNTL3WCZ  LAXCJHZLYSIRMH5TYSEEDVUVBD  OSPWEBJT65XO3LUHVIGFHDMC3M  TNXMA2P2LZD3XC4APMWRJ5MJTH  WNWHWN7UBIHAZDSNZAZ5NH7RJY
ADG3MNMTZCZRLERY4UMPIVXR2Z  F3RBQOU33QYZAVSBOYOUHOGFXF  LI6ZJSPWE7XUTLBWF3GFDUNNLL  OVW66ZCQ44J2R2P6UGEDW34VZE  UAMEZAZ7Y6KUYSPVKDXNEOCXZ4  WVY3XV22YULR4NVON7Z2I5KFXS
ATTE5PHKYVA3UPSEYELJ7AOJLC  GOK3YJG5FEHAWXCFIPQIODHEKM  MW5VU2HD3KAEZXSQ3GIXPZWFXB  OYQ26CMLSMB4B6DWGBAPINO6BL  UDKC5LJFIAWGTDOPLPJA7LPXZV  XME2FKZVZD7JGTZ2S2FCFXL7WM
BW2ZD5H2PZJEJFBEOUIJV7SVXZ  IJUJEIQX33FWN5SE5KODWRN3GX  NBQI6Z4CV7PRV7WNBGSY7UYHDM  P742FV32PIVXMJQESJOXYJKE2S  UQOHL7ZOO6LK3JR2UTJACIJJPA  YLJ27FU7FIQUK2IS54YUPYYQDP
CD5AQG5OLSSZUMERGFE6F327XC  JMUE4Z6FU3ATCKFEECAMXVDPYB  NIS3TIBFBAD4AEJ2JCR75HXCRZ  PZIJOKX7URYCKI425HSTZR4H3K  VTN56IUJPBS4ZJGXSK6MZZGD7L  YM4Q5BXGEBZPSQQQKNEPI55PMD


# journalctl  -f -u  kubelet
-- Logs begin at Wed 2021-01-27 18:48:31 UTC. --
Jan 28 13:58:40 worker-0 hyperkube[1846]: I0128 13:58:40.599664    1846 kuberuntime_manager.go:439] No sandbox for pod "ingress-canary-pghpc_openshift-ingress-canary(7daaaa5a-61ec-4a57-b132-8ef3eb84350b)" can be found. Need to start a new one
Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.029922    1846 remote_runtime.go:116] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030024    1846 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030065    1846 kuberuntime_manager.go:755] createPodSandbox for pod "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Jan 28 13:58:41 worker-0 hyperkube[1846]: E0128 13:58:41.030192    1846 pod_workers.go:191] Error syncing pod 81b2ae09-661c-46bc-b114-b461da8c7eae ("certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)"), skipping: failed to "CreatePodSandbox" for "certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)" with CreatePodSandboxError: "CreatePodSandbox for pod \"certified-operators-9r2jm_openshift-marketplace(81b2ae09-661c-46bc-b114-b461da8c7eae)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_certified-operators-9r2jm_openshift-marketplace_81b2ae09-661c-46bc-b114-b461da8c7eae_0(dffd057d8ca16676a67cfe02b23107fa6cbdd9414ec94054b4e9c695bf7a486e): Multus: [openshift-marketplace/certified-operators-9r2jm]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition"
Jan 28 13:58:41 worker-0 hyperkube[1846]: I0128 13:58:41.033777    1846 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/etc/kubernetes/kubelet-ca.crt
Jan 28 13:58:41 worker-0 systemd[1]: Stopping Kubernetes Kubelet...
Jan 28 13:58:41 worker-0 systemd[1]: kubelet.service: Succeeded.
Jan 28 13:58:41 worker-0 systemd[1]: Stopped Kubernetes Kubelet.
Jan 28 13:58:41 worker-0 systemd[1]: kubelet.service: Consumed 4min 18.362s CPU time

# journalctl  -f -u  crio.service
-- Logs begin at Wed 2021-01-27 18:48:31 UTC. --
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.766410873Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_dns-default-fhkkc_openshift-dns_7084e5e8-ca89-4ab9-b008-ed117ecae376_0" id=41254314-9803-4f84-8042-46f3ce037061 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.778879949Z" level=info msg="RunSandbox: removing pod sandbox from storage: 659c39749d06436812e0696e37a53243709fe5950aef4656fd872e46e26f8c68" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.783828722Z" level=info msg="RunSandbox: releasing container name: k8s_POD_image-registry-6c76bc5c4b-b5pdq_openshift-image-registry_64f0b20b-19cc-466f-913e-18af31b87562_0" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.783991186Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_image-registry-6c76bc5c4b-b5pdq_openshift-image-registry_64f0b20b-19cc-466f-913e-18af31b87562_0" id=189678d3-9440-4f1e-9c57-b3947192d2e9 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.868902965Z" level=info msg="RunSandbox: removing pod sandbox from storage: 9460390d93d0e37245fa462e7ce0ee2421cf6f30b986d19d3d4b03859dc66985" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.876717932Z" level=info msg="RunSandbox: releasing container name: k8s_POD_alertmanager-main-0_openshift-monitoring_d82f5260-e620-4e47-9060-61199b721245_0" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:26 worker-0 crio[1805]: time="2021-01-28 13:59:26.876750990Z" level=info msg="RunSandbox: releasing pod sandbox name: k8s_alertmanager-main-0_openshift-monitoring_d82f5260-e620-4e47-9060-61199b721245_0" id=b9fbc4d8-fa30-4493-908b-ee5e575673a2 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jan 28 13:59:27 worker-0 systemd[1]: crio.service: Succeeded.
Jan 28 13:59:27 worker-0 systemd[1]: Stopped Open Container Initiative Daemon.
Jan 28 13:59:27 worker-0 systemd[1]: crio.service: Consumed 3min 29.911s CPU time

Expected results:
The crio and kubelet services should start and pods should be Running.

Additional info:
# cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="47.83.202101271010-0"
VERSION_ID="4.7"
OPENSHIFT_VERSION="4.7"
RHEL_VERSION="8.3"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 47.83.202101271010-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.7"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.7"
OSTREE_VERSION='47.83.202101271010-0'

[root@worker-0 core]# crio version
INFO[0000] Starting CRI-O, version: 1.20.0-0.rhaos4.7.gita1ab08a.el8.43, git: ()
Version:    1.20.0-0.rhaos4.7.gita1ab08a.el8.43
GoVersion:  go1.15.5
Compiler:   gc
Platform:   linux/ppc64le
Linkmode:   dynamic

# kubelet --version
Kubernetes v1.20.0+614d551

Comment 1 Dan Li 2021-02-01 15:40:27 UTC

Hi Peter and Node team, I just wanted to state that this behavior is randomly reproducible, but when it occurs, it renders Node unusable. It is also worth noting that there is a similar GH issue where the same happened on podman: https://github.com/containers/podman/issues/5986

Therefore, I wanted to give you this update and leave the "Blocker?" flag evaluation up to your team, as this is currently a "High" severity.

Comment 2 Giuseppe Scrivano 2021-02-04 12:49:39 UTC

I think https://github.com/cri-o/cri-o/pull/3999 is a potential fix for such issues with the storage

Comment 5 Mark Dunnett 2021-02-05 19:38:22 UTC

We just hit this trying to add a remote worker to a cluster. The rhel node rebooted and kubelet and crio were dead. We are using 4.6.12.

Comment 7 Dan Li 2021-02-08 13:03:35 UTC

Hi Giuseppe and node team, would it be possible if your team could let us know whether this fix will be in 4.7? If not, Archana and her team hope to notify the Power doc writers to include this bug in the release notes. Thank you!

Comment 8 Giuseppe Scrivano 2021-02-08 15:38:58 UTC

@Dan Li, it seems unlikely that the fix will hit 4.7

Comment 9 pdsilva 2021-02-09 09:44:22 UTC

@Giuseppe Could you suggest any workaround which would help in getting the services back to running state on the node? Thanks

Comment 10 Giuseppe Scrivano 2021-02-09 12:07:24 UTC

the workaround is to `rm -rf /var/lib/containers/storage` and reboot the node

Comment 12 Giuseppe Scrivano 2021-03-19 11:29:15 UTC

we have a potential fix upstream but we are not backporting it to 4.7.  So closing the issue for 4.7