Bug 2091214 - Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory
Summary: Pods are unable to start due to error "Failed to set quota limit for projid x...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.12.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-27 21:11 UTC by Dhruv Gautam
Modified: 2023-09-18 04:37 UTC (History)
12 users (show)

Fixed In Version: cri-o-1.25.2-7.rhaos4.12.git1a6bb9c.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-20 18:30:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCPBUGS-773 0 None None None 2022-09-19 00:11:18 UTC
Red Hat Issue Tracker OCPBUGS-788 0 None None None 2022-09-19 00:11:18 UTC
Red Hat Knowledge Base (Solution) 6844091 0 None None None 2022-10-13 17:25:14 UTC
Red Hat Product Errata RHSA-2023:0769 0 None None None 2023-02-20 18:30:55 UTC

Description Dhruv Gautam 2022-05-27 21:11:04 UTC
Description of problem:
Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory

Below event is streamed:
14m        Normal   OperatorStatusChanged              deployment/openshift-kube-scheduler-operator              Status for clusteroperator/kube-scheduler changed: Degraded message changed from "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"ad27e79312dd780a6b2a266996000490596b5a2df8f8de152de0118f9e2c6d26\": Failed to set quota limit for projid 28032 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory" to "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"45a7ead5e0e626f65591b437c062855c7bfe25efa00abc3dd409f9f14cc2f876\": Failed to set quota limit for projid 28035 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory"

Version-Release number of selected component (if applicable):
4.8+

How reproducible:
NA

Steps to Reproduce:
1.
2.
3.

Actual results:
Some pods are getting crashed

Expected results:
The pods should start without any issues.

Additional info:
- Cleaning up crio storage or rebooting the node is known to workaround issue for now.

Comment 12 rubrodri 2022-08-31 12:47:26 UTC
Hi,

I'm also facing this issue with a customer, here are the workaround we use to make the node work again without rebooting it or restarting crio :

1. Use ssh to connect to the node having the issue, and use:

> sudo podman info

This will recreate the `backingFsBlockDev` file and allow pods to be started on the node.

2. Automate the remediation using a deamonset which analyse the logs and execute the `sudo podman info` command when necessary.
The daemonset must run with the `privileged` scc and mount the host fs.

The command that the daemonset executes :
 
> chroot /host /bin/sh -ec journalctl -u crio.service -f | sed -u -nr 's#.*error creating pod sandbox with name ("[a-zA-Z0-9_-]*").*Failed to set quota limit for file or directory#\1#p' | while read -r line ; do echo "$(date) -Error creating the following sandbox : $line" ; sudo podman info 2>&1 1>/dev/null ; done

Comment 13 Vincent Lours 2022-09-01 06:48:43 UTC
I have created the Jira ticket https://issues.redhat.com/browse/OCPBUGS-788 to track this issue in the new ticketing system.

In addition, I've completed a review of the different cases attached to the KCS:
- most of them are related to RHOCP 4.8, but some to 4.9.
- most of them are related to vSphere, but some unknown, and 1 AWS.

further in order to reply to the questions from #7:
> If the device node is being removed sometime after CRI-O is started, though, then it's rather concerning.

==> From last & current case (https://access.redhat.com/support/cases/03303118), 2 nodes have been impacted. Crio was running for months or weeks
~~~
[vlours@supportshell-1 ~]$ grep active 03303118/0040-sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu.tar.xz/sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu/sos_commands/crio/systemctl_status_crio
   Active: active (running) since Fri 2022-07-01 10:10:55 UTC; 2 months 0 days ago
~~~

~~~
[vlours@supportshell-1 ~]$ grep active 03303118/0030-sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok.tar.xz/sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok/sos_commands/crio/systemctl_status_crio
   Active: active (running) since Fri 2022-08-05 05:57:54 UTC; 3 weeks 5 days ago
~~~

> I may be misreading this, but the type=SYSCALL entry suggests that it's a podman process.  podman shares CRI-O's startup behavior, so I'd expect it to also recreate the device node almost immediately after removing it.  That actually makes running `podman images` or a similar command a viable method of recreating the device node.

==> The KCS now provide a simply solution using `sudo podman info`

> If CRI-O or podman or whichever process removes the device node isn't subsequently recreating the device node, then we likely have our answer.
==> the file is not automatically recreated. We have asked a few customers to enable the audit rule to capture the process of removing the file.
Awaiting feedback from the customers.

Would it be a permanent solution to include in the code the creation of the file when flagged as missing after warning about the absence of the file?

Comment 25 errata-xmlrpc 2023-02-20 18:30:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.4 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0769

Comment 27 Red Hat Bugzilla 2023-09-18 04:37:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.