Bug 1984756 - Node can't run pods due to cri-o ephemeral-storage
Summary: Node can't run pods due to cri-o ephemeral-storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.z
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
Depends On: 1986038
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-22 07:43 UTC by Andy Bartlett
Modified: 2021-09-09 01:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-09 01:52:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github containers storage pull 971 0 None closed [1.20-stable] overlay.recreateSymlinks: handle missing "link" files, add a test 2021-07-27 13:08:02 UTC
Github cri-o cri-o pull 5128 0 None closed [1.19] vendor: bump c/storage to latest v1.20-stable 2021-07-27 13:08:05 UTC
Red Hat Product Errata RHBA-2021:3395 0 None None None 2021-09-09 01:53:14 UTC

Description Andy Bartlett 2021-07-22 07:43:18 UTC
Description of problem:

I have a customer with the following issue:

from time to time we experience the issues, where all pods stop running on a worker node, including ovs, sdn pods. This restricts any pod from runnning on the node. The issue seems to be related to crio storage, which needs to be wiped out like it's described in:

https://access.redhat.com/solutions/5350721 (CRIO service can't run due to that and now the node is in NotReady state). 

However, we would like to understand what is causing it in the first place and prevent our clusters from this behaviour.


Version-Release number of selected component (if applicable):

OCP 4.6


How reproducible:

Randomly


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Peter Hunt 2021-07-23 15:15:40 UTC
looks like the following PR (once vendored into cri-o 1.19) will help

Comment 4 Peter Hunt 2021-07-26 15:20:08 UTC
cri-o PR attached

Comment 5 Peter Hunt 2021-07-27 13:08:10 UTC
PR merged!

Comment 11 MinLi 2021-08-25 04:04:18 UTC
verified on version : 4.6.0-0.nightly-2021-08-22-084748


sh-4.4# grep "pause_image" /etc/crio/crio.conf.d/00-default
pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:651d8b6291100887526f0d25b234f19e053de728d95db3efe13c8b87d9f26aaa"

sh-4.4# podman inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a166dcfd8f85ca9aa9f5c9307a68b4293a882546dad0c4ba631a2a7b50fd19b 
And we find GrahDriver.Data.UpperDir is: 
/var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/diff

sh-4.4# cat /var/lib/containers/storage/overlay/bced96a354be948e9935c9369e231fe67b56a08d96341125eebb3fb640f817d8/lower
l/ADMOZBCMF7NQ4GNXCIETZHJJYF:l/YUMJB3LZKFZOP5NLQDVIAEZNCD:l/FYT6MBNACCKCVDL7LGXGCI34BZ:l/C43RDCK7OQ23S7JWUZQNIG6Q3Y

sh-4.4# mkdir /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000
sh-4.4# chmod 700 /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000


sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
lrwxrwxrwx. 1 root root 72 Aug 25 03:12 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff

sh-4.4# cat /var/lib/containers/storage/overlay/962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/link
C43RDCK7OQ23S7JWUZQNIG6Q3Y

sh-4.4# rm /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y
ls: cannot access '/var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y': No such file or directory

sh-4.4# ls -l /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000
total 0

$ oc create -f hello-pod.yaml
$ oc describe pod hello-pod-1
Events:
  Type     Reason                  Age        From                                                                  Message
  ----     ------                  ----       ----                                                                  -------
  Normal   Scheduled               <unknown>                                                                        Successfully assigned default/hello-pod-1 to minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal
  Normal   AddedInterface          22s        multus                                                                Add eth0 [10.131.0.194/23]
  Warning  FailedCreatePodSandBox  21s        kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0 in pod sandbox k8s_hello-pod-1_default_3b5ce75d-40d2-4432-901f-7576e7322b55_0(69d07e1c47af4f7e7a0af5c6f569ede721e922a365a4a52df48fc10ab2e6a5bc): error recreating the missing symlinks: 1 error occurred:
           * error reading name of symlink for &{"0000000000000000000000000000000000000000000000000000000000000000" '\x06' %!q(os.FileMode=2147484096) {%!q(uint64=13184607) %!q(int64=63765457541) %!q(*time.Location=&{Local [{UTC 0 false}] [{-576460752303423488 0 false false}] UTC0 9223372036854775807 9223372036854775807 0xc000343140})} {'ﴀ' %!q(uint64=104889930) '\x02' '䇀' '\x00' '\x00' '\x00' '\x00' '\x06' 'က' '\x00' {%!q(int64=1629860974) %!q(int64=78988704)} {%!q(int64=1629860741) %!q(int64=13184607)} {%!q(int64=1629860760) %!q(int64=542425085)} ['\x00' '\x00' '\x00']}}: open /var/lib/containers/storage/overlay/0000000000000000000000000000000000000000000000000000000000000000/link: no such file or directory
  Normal   AddedInterface  19s  multus                                                                Add eth0 [10.131.0.195/23]
  Normal   Pulling         18s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Pulling image "httpd:latest"
  Normal   Pulled          12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Successfully pulled image "httpd:latest" in 6.165010522s
  Normal   Created         12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Created container hello-pod
  Normal   Started         12s  kubelet, minmli08244602-znd2s-worker-a-bk969.c.openshift-qe.internal  Started container hello-pod

sh-4.4# ls -l /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y 
lrwxrwxrwx. 1 root root 72 Aug 25 03:32 /var/lib/containers/storage/overlay/l/C43RDCK7OQ23S7JWUZQNIG6Q3Y -> ../962650a5a3906dfa582dc56db3a4f60756b7d77d47e0309a60585cbff60eae10/diff
(we can see crio recreate the missing symlinks )

Comment 14 errata-xmlrpc 2021-09-09 01:52:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.44 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3395


Note You need to log in before you can comment on or make changes to this bug.