Description of problem: I have a customer with the following issue: from time to time we experience the issues, where all pods stop running on a worker node, including ovs, sdn pods. This restricts any pod from runnning on the node. The issue seems to be related to crio storage, which needs to be wiped out like it's described in: https://access.redhat.com/solutions/5350721 (CRIO service can't run due to that and now the node is in NotReady state). However, we would like to understand what is causing it in the first place and prevent our clusters from this behaviour. Version-Release number of selected component (if applicable): OCP 4.6 How reproducible: Randomly Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
looks like the following PR (once vendored into cri-o 1.19) will help
cri-o PR attached
PR merged!
verified on version : 4.6.0-0.nightly-2021-08-22-084748 sh-4.4# grep "pause_image" /etc/crio/crio.conf.d/00-default pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:651d8b6291100887526f0d25b234f19e053de728d95db3efe13c8b87d9f26aaa" sh-4.4# podman inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a166dcfd8f85ca9aa9f5c9307a68b4293a882546dad0c4ba631a2a7b50fd19b And we find GrahDriver.Data.UpperDir is: /var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/diff sh-4.4# cat /var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/lower l/ADMOZBCMF7NQ4GNXCIETZHJJYF:l/YUMJB3LZKFZOP5NLQDVIAEZNCD:l/FYT6MBNACCKCVDL7LGXGCI34BZ:l/C43RDCK7OQ23S7JWUZQNIG6Q3Y sh-4.4# mkdir /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000 sh-4.4# chmod 700 /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000 sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y lrwxrwxrwx. 1 root root 72 Aug 25 03:12 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff sh-4.4# cat /var/lib/containers/storage/overlay/962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/link C43RDCK7OQ23S7JWUZQNIG6Q3Y sh-4.4# rm /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y ls: cannot access '/var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y': No such file or directory sh-4.4# ls -l /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000 total 0 $ oc create -f hello-pod.yaml $ oc describe pod hello-pod-1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> Successfully assigned default/hello-pod-1 to minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Normal AddedInterface 22s multus Add eth0 [10.131.0.194/23] Warning FailedCreatePodSandBox 21s kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0 in pod sandbox k8s_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0(69d07e1c47af4f7e7a0af5c6f569ede721e922a365a4a52df48fc10ab2e6a5bc): error recreating the missing symlinks: 1 error occurred: * error reading name of symlink for &{"0000000000000000000000000000000000000000000000000000000000000000" '\x06' %!q(os.FileMode=2147484096) {%!q(uint64=13184607) %!q(int64=63765457541) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000343140})} {'ﴀ' %!q(uint64=104889930) '\x02' '䇀' '\x00' '\x00' '\x00' '\x00' '\x06' 'က' '\x00' {%!q(int64=1629860974) %!q(int64=78988704)} {%!q(int64=1629860741) %!q(int64=13184607)} {%!q(int64=1629860760) %!q(int64=542425085)} ['\x00' '\x00' '\x00']}}: open /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000/link: no such file or directory Normal AddedInterface 19s multus Add eth0 [10.131.0.195/23] Normal Pulling 18s kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Pulling image "httpd:latest" Normal Pulled 12s kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Successfully pulled image "httpd:latest" in 6.165010522s Normal Created 12s kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Created container hello-pod Normal Started 12s kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal Started container hello-pod sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y lrwxrwxrwx. 1 root root 72 Aug 25 03:32 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff (we can see crio recreate the missing symlinks )
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.44 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3395