Bug 2091214

Summary: Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory
Product: OpenShift Container Platform Reporter: Dhruv Gautam <dgautam>
Component: ContainersAssignee: Peter Hunt <pehunt>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: dseals, dwalsh, harpatil, jnovy, mfuruta, nalin, pmagotra, rubrodri, sbelmasg, schoudha, tsweeney, vlours
Target Milestone: ---   
Target Release: 4.12.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cri-o-1.25.2-7.rhaos4.12.git1a6bb9c.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-20 18:30:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dhruv Gautam 2022-05-27 21:11:04 UTC
Description of problem:
Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory

Below event is streamed:
14m        Normal   OperatorStatusChanged              deployment/openshift-kube-scheduler-operator              Status for clusteroperator/kube-scheduler changed: Degraded message changed from "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"ad27e79312dd780a6b2a266996000490596b5a2df8f8de152de0118f9e2c6d26\": Failed to set quota limit for projid 28032 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory" to "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"45a7ead5e0e626f65591b437c062855c7bfe25efa00abc3dd409f9f14cc2f876\": Failed to set quota limit for projid 28035 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory"

Version-Release number of selected component (if applicable):
4.8+

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
Some pods are getting crashed

Expected results:
The pods should start without any issues.

Additional info:
- Cleaning up crio storage or rebooting the node is known to workaround issue for now.

Comment 12 rubrodri 2022-08-31 12:47:26 UTC
Hi,

I'm also facing this issue with a customer, here are the workaround we use to make the node work again without rebooting it or restarting crio :

1. Use ssh to connect to the node having the issue, and use:

> sudo podman info

This will recreate the `backingFsBlockDev` file and allow pods to be started on the node.

2. Automate the remediation using a deamonset which analyse the logs and execute the `sudo podman info` command when necessary.
The daemonset must run with the `privileged` scc and mount the host fs.

The command that the daemonset executes :
 
> chroot /host /bin/sh -ec journalctl -u crio.service -f | sed -u -nr 's#.*error creating pod sandbox with name ("[a-zA-Z0-9_-]*").*Failed to set quota limit for file or directory#\1#p' | while read -r line ; do echo "$(date) -Error creating the following sandbox : $line" ; sudo podman info 2>&1 1>/dev/null ; done

Comment 13 Vincent Lours 2022-09-01 06:48:43 UTC
I have created the Jira ticket https://issues.redhat.com/browse/OCPBUGS-788 to track this issue in the new ticketing system.

In addition, I've completed a review of the different cases attached to the KCS:
- most of them are related to RHOCP 4.8, but some to 4.9.
- most of them are related to vSphere, but some unknown, and 1 AWS.

further in order to reply to the questions from #7:
> If the device node is being removed sometime after CRI-O is started, though, then it's rather concerning.

==> From last & current case (https://access.redhat.com/support/cases/03303118), 2 nodes have been impacted. Crio was running for months or weeks
~~~
[vlours@supportshell-1 ~]$ grep active 03303118/0040-sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu.tar.xz/sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu/sos_commands/crio/systemctl_status_crio
   Active: active (running) since Fri 2022-07-01 10:10:55 UTC; 2 months 0 days ago
~~~

~~~
[vlours@supportshell-1 ~]$ grep active 03303118/0030-sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok.tar.xz/sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok/sos_commands/crio/systemctl_status_crio
   Active: active (running) since Fri 2022-08-05 05:57:54 UTC; 3 weeks 5 days ago
~~~

> I may be misreading this, but the type=SYSCALL entry suggests that it's a podman process.  podman shares CRI-O's startup behavior, so I'd expect it to also recreate the device node almost immediately after removing it.  That actually makes running `podman images` or a similar command a viable method of recreating the device node.

==> The KCS now provide a simply solution using `sudo podman info`

> If CRI-O or podman or whichever process removes the device node isn't subsequently recreating the device node, then we likely have our answer.
==> the file is not automatically recreated. We have asked a few customers to enable the audit rule to capture the process of removing the file.
Awaiting feedback from the customers.

Would it be a permanent solution to include in the code the creation of the file when flagged as missing after warning about the absence of the file?

Comment 25 errata-xmlrpc 2023-02-20 18:30:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.4 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0769

Comment 27 Red Hat Bugzilla 2023-09-18 04:37:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days