Bug 1984756

Summary: Node can't run pods due to cri-o ephemeral-storage
Product: OpenShift Container Platform Reporter: Andy Bartlett <andbartl>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aivaras.laimikis, aos-bugs, jnordell, minmli, pehunt
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-09 01:52:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1986038    
Bug Blocks:    

Description Andy Bartlett 2021-07-22 07:43:18 UTC
Description of problem:

I have a customer with the following issue:

from time to time we experience the issues, where all pods stop running on a worker node, including ovs, sdn pods. This restricts any pod from runnning on the node. The issue seems to be related to crio storage, which needs to be wiped out like it's described in:

https://access.redhat.com/solutions/5350721 (CRIO service can't run due to that and now the node is in NotReady state). 

However, we would like to understand what is causing it in the first place and prevent our clusters from this behaviour.


Version-Release number of selected component (if applicable):

OCP 4.6


How reproducible:

Randomly


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Peter Hunt 2021-07-23 15:15:40 UTC
looks like the following PR (once vendored into cri-o 1.19) will help

Comment 4 Peter Hunt 2021-07-26 15:20:08 UTC
cri-o PR attached

Comment 5 Peter Hunt 2021-07-27 13:08:10 UTC
PR merged!

Comment 11 MinLi 2021-08-25 04:04:18 UTC
verified on version : 4.6.0-0.nightly-2021-08-22-084748


sh-4.4# grep "pause_image" /etc/crio/crio.conf.d/00-default
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:651d8b6291100887526f0d25b234f19e053de728d95db3efe13c8b87d9f26aaa"

sh-4.4# podman inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a166dcfd8f85ca9aa9f5c9307a68b4293a882546dad0c4ba631a2a7b50fd19b 
And we find GrahDriver.Data.UpperDir is: 
/var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/diff

sh-4.4# cat /var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/lower
l/ADMOZBCMF7NQ4GNXCIETZHJJYF:l/YUMJB3LZKFZOP5NLQDVIAEZNCD:l/FYT6MBNACCKCVDL7LGXGCI34BZ:l/C43RDCK7OQ23S7JWUZQNIG6Q3Y

sh-4.4# mkdir /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000
sh-4.4# chmod 700 /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000


sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
lrwxrwxrwx. 1 root root 72 Aug 25 03:12 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff

sh-4.4# cat /var/lib/containers/storage/overlay/962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/link
C43RDCK7OQ23S7JWUZQNIG6Q3Y

sh-4.4# rm /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
ls: cannot access '/var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y': No such file or directory

sh-4.4# ls -l /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000
total 0

$ oc create -f hello-pod.yaml
$ oc describe pod hello-pod-1
Events:
  Type     Reason                  Age        From                                                                  Message
  ----     ------                  ----       ----                                                                  -------
  Normal   Scheduled               <unknown>                                                                        Successfully assigned default/hello-pod-1 to minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal
  Normal   AddedInterface          22s        multus                                                                Add eth0 [10.131.0.194/23]
  Warning  FailedCreatePodSandBox  21s        kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0 in pod sandbox k8s_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0(69d07e1c47af4f7e7a0af5c6f569ede721e922a365a4a52df48fc10ab2e6a5bc): error recreating the missing symlinks: 1 error occurred:
           * error reading name of symlink for &{"0000000000000000000000000000000000000000000000000000000000000000" '\x06' %!q(os.FileMode=2147484096) {%!q(uint64=13184607) %!q(int64=63765457541) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000343140})} {'ﴀ' %!q(uint64=104889930) '\x02' '䇀' '\x00' '\x00' '\x00' '\x00' '\x06' 'က' '\x00' {%!q(int64=1629860974) %!q(int64=78988704)} {%!q(int64=1629860741) %!q(int64=13184607)} {%!q(int64=1629860760) %!q(int64=542425085)} ['\x00' '\x00' '\x00']}}: open /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000/link: no such file or directory
  Normal   AddedInterface  19s  multus                                                                Add eth0 [10.131.0.195/23]
  Normal   Pulling         18s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Pulling image "httpd:latest"
  Normal   Pulled          12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Successfully pulled image "httpd:latest" in 6.165010522s
  Normal   Created         12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Created container hello-pod
  Normal   Started         12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Started container hello-pod

sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y 
lrwxrwxrwx. 1 root root 72 Aug 25 03:32 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff
(we can see crio recreate the missing symlinks )

Comment 14 errata-xmlrpc 2021-09-09 01:52:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.44 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3395