Description of problem: Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory Below event is streamed: 14m Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"ad27e79312dd780a6b2a266996000490596b5a2df8f8de152de0118f9e2c6d26\": Failed to set quota limit for projid 28032 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory" to "StaticPodsDegraded: pod/openshift-kube-scheduler-nonprod2-openshift-cr5cj-master-0 container \"kube-scheduler\" is waiting: CreateContainerError: error creating read-write layer with ID \"45a7ead5e0e626f65591b437c062855c7bfe25efa00abc3dd409f9f14cc2f876\": Failed to set quota limit for projid 28035 on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory" Version-Release number of selected component (if applicable): 4.8+ How reproducible: NA Steps to Reproduce: 1. 2. 3. Actual results: Some pods are getting crashed Expected results: The pods should start without any issues. Additional info: - Cleaning up crio storage or rebooting the node is known to workaround issue for now.
Hi, I'm also facing this issue with a customer, here are the workaround we use to make the node work again without rebooting it or restarting crio : 1. Use ssh to connect to the node having the issue, and use: > sudo podman info This will recreate the `backingFsBlockDev` file and allow pods to be started on the node. 2. Automate the remediation using a deamonset which analyse the logs and execute the `sudo podman info` command when necessary. The daemonset must run with the `privileged` scc and mount the host fs. The command that the daemonset executes : > chroot /host /bin/sh -ec journalctl -u crio.service -f | sed -u -nr 's#.*error creating pod sandbox with name ("[a-zA-Z0-9_-]*").*Failed to set quota limit for file or directory#\1#p' | while read -r line ; do echo "$(date) -Error creating the following sandbox : $line" ; sudo podman info 2>&1 1>/dev/null ; done
I have created the Jira ticket https://issues.redhat.com/browse/OCPBUGS-788 to track this issue in the new ticketing system. In addition, I've completed a review of the different cases attached to the KCS: - most of them are related to RHOCP 4.8, but some to 4.9. - most of them are related to vSphere, but some unknown, and 1 AWS. further in order to reply to the questions from #7: > If the device node is being removed sometime after CRI-O is started, though, then it's rather concerning. ==> From last & current case (https://access.redhat.com/support/cases/03303118), 2 nodes have been impacted. Crio was running for months or weeks ~~~ [vlours@supportshell-1 ~]$ grep active 03303118/0040-sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu.tar.xz/sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu/sos_commands/crio/systemctl_status_crio Active: active (running) since Fri 2022-07-01 10:10:55 UTC; 2 months 0 days ago ~~~ ~~~ [vlours@supportshell-1 ~]$ grep active 03303118/0030-sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok.tar.xz/sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok/sos_commands/crio/systemctl_status_crio Active: active (running) since Fri 2022-08-05 05:57:54 UTC; 3 weeks 5 days ago ~~~ > I may be misreading this, but the type=SYSCALL entry suggests that it's a podman process. podman shares CRI-O's startup behavior, so I'd expect it to also recreate the device node almost immediately after removing it. That actually makes running `podman images` or a similar command a viable method of recreating the device node. ==> The KCS now provide a simply solution using `sudo podman info` > If CRI-O or podman or whichever process removes the device node isn't subsequently recreating the device node, then we likely have our answer. ==> the file is not automatically recreated. We have asked a few customers to enable the audit rule to capture the process of removing the file. Awaiting feedback from the customers. Would it be a permanent solution to include in the code the creation of the file when flagged as missing after warning about the absence of the file?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.4 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0769
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days