Bug 2091214
Summary: | Pods are unable to start due to error "Failed to set quota limit for projid xxxxx on /var/lib/containers/storage/overlay/backingFsBlockDev: no such file or directory | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dhruv Gautam <dgautam> |
Component: | Containers | Assignee: | Peter Hunt <pehunt> |
Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.9 | CC: | dseals, dwalsh, harpatil, jnovy, mfuruta, nalin, pmagotra, rubrodri, sbelmasg, schoudha, tsweeney, vlours |
Target Milestone: | --- | ||
Target Release: | 4.12.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | cri-o-1.25.2-7.rhaos4.12.git1a6bb9c.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-02-20 18:30:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dhruv Gautam
2022-05-27 21:11:04 UTC
Hi, I'm also facing this issue with a customer, here are the workaround we use to make the node work again without rebooting it or restarting crio : 1. Use ssh to connect to the node having the issue, and use: > sudo podman info This will recreate the `backingFsBlockDev` file and allow pods to be started on the node. 2. Automate the remediation using a deamonset which analyse the logs and execute the `sudo podman info` command when necessary. The daemonset must run with the `privileged` scc and mount the host fs. The command that the daemonset executes : > chroot /host /bin/sh -ec journalctl -u crio.service -f | sed -u -nr 's#.*error creating pod sandbox with name ("[a-zA-Z0-9_-]*").*Failed to set quota limit for file or directory#\1#p' | while read -r line ; do echo "$(date) -Error creating the following sandbox : $line" ; sudo podman info 2>&1 1>/dev/null ; done I have created the Jira ticket https://issues.redhat.com/browse/OCPBUGS-788 to track this issue in the new ticketing system. In addition, I've completed a review of the different cases attached to the KCS: - most of them are related to RHOCP 4.8, but some to 4.9. - most of them are related to vSphere, but some unknown, and 1 AWS. further in order to reply to the questions from #7: > If the device node is being removed sometime after CRI-O is started, though, then it's rather concerning. ==> From last & current case (https://access.redhat.com/support/cases/03303118), 2 nodes have been impacted. Crio was running for months or weeks ~~~ [vlours@supportshell-1 ~]$ grep active 03303118/0040-sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu.tar.xz/sosreport-sindceapocpd55-03303118-2022-09-01-hjeugbu/sos_commands/crio/systemctl_status_crio Active: active (running) since Fri 2022-07-01 10:10:55 UTC; 2 months 0 days ago ~~~ ~~~ [vlours@supportshell-1 ~]$ grep active 03303118/0030-sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok.tar.xz/sosreport-sindceapocpd51-03303118-2022-09-01-etxiaok/sos_commands/crio/systemctl_status_crio Active: active (running) since Fri 2022-08-05 05:57:54 UTC; 3 weeks 5 days ago ~~~ > I may be misreading this, but the type=SYSCALL entry suggests that it's a podman process. podman shares CRI-O's startup behavior, so I'd expect it to also recreate the device node almost immediately after removing it. That actually makes running `podman images` or a similar command a viable method of recreating the device node. ==> The KCS now provide a simply solution using `sudo podman info` > If CRI-O or podman or whichever process removes the device node isn't subsequently recreating the device node, then we likely have our answer. ==> the file is not automatically recreated. We have asked a few customers to enable the audit rule to capture the process of removing the file. Awaiting feedback from the customers. Would it be a permanent solution to include in the code the creation of the file when flagged as missing after warning about the absence of the file? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.4 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0769 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |