Bug 1972209
Summary: | Under load, container failed to be created due to missing cgroup scope | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Damien Ciabrini <dciabrin> | |
Component: | runc | Assignee: | Jindrich Novy <jnovy> | |
Status: | CLOSED ERRATA | QA Contact: | Alex Jia <ajia> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 8.4 | CC: | bbaude, bdobreli, dornelas, dwalsh, ekuris, gfidente, jligon, jnovy, kir, leiwang, lfriedma, lmiccini, lsm5, mheon, michele, mpatel, pthomas, sewagner, snanda, tsweeney, umohnani, ypu | |
Target Milestone: | beta | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | runc-1.0.0-72.rc92.el8_4 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1990406 2000570 2019335 2021325 (view as bug list) | Environment: | ||
Last Closed: | 2021-11-09 17:38:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1982460, 1990406, 2000570, 2019335, 2021325 |
Description
Damien Ciabrini
2021-06-15 12:34:47 UTC
Any chance you can try with `crun` instead of `runc` on a fresh system? Podman isn't responsible for creating that cgroup, so I suspect this is a race somewhere in runc, and testing with crun will reveal that. (In reply to Matthew Heon from comment #1) > Any chance you can try with `crun` instead of `runc` on a fresh system? > Podman isn't responsible for creating that cgroup, so I suspect this is a > race somewhere in runc, and testing with crun will reveal that. I will run a couple a tests with crun and report if i see any occurrence of it. Unfortunately each test is about 2h30 to 3h so it might take some time to report back. Meanwhile, i couldn't spot who is responsible for creating the cgroup from source, but this error message has been reported by runc (the error message comes from it), so that would tend to validate your initia suspicion. After some config changes on the node under test, all the containers have been recreated to use crun instead of runc. That equates to 47 containers on the host, among which 8 are re-created after each reboot. I did 100 reboots with this new setup, under the same load as originally reported, and I couldn't replicate my issue when podman targets the crun runtime. Reassigning to runc as comment #3 proves it is a runc's race as Matt mentions in comment #1. Kir, can you take a look at this, please? Dan or Mrunal, if someone else should take a look, please let me know. FYI, this issue also affects Ceph: https://tracker.ceph.com/issues/49287 . This *might* also affect RHCS 5, but I haven't seen this race yet in downstream. (In reply to Tom Sweeney from comment #5) > Kir, can you take a look at this, please? > > Dan or Mrunal, if someone else should take a look, please let me know. I'd appreciate if you can provide an update, as it impacts both RHOSP 16.2 as well as Ceph (potentially RHCS 5.0), both of which are to be released soon. I would guess we would ask you to test with the latest runc 1.0.1, which was recently released. Of course maybe transitioning to crun is the best idea. This is indeed a race in runc, which was fixed by https://github.com/opencontainers/runc/pull/2614, which is part of runc v1.0.0-rc93. So any recent runc should be fine (1.0.1 is recommended though). I can't find it at the moment which runc is available via rhel8 container-tools, but I hope it's recent. Ok let's just say that this is fixed in runc 1.0.1 Jindrich, I think this one is in your purview, please reroute if not. Setting to Post for any further BZ or packaging needs. @jnovy It is too late to make any changes for 8.4.0.2. The final compose is already done. But you could make the change in 8.4.0.3 in 6 weeks. Proposed this for zstream in bug 1990406 then. Thanks. I can't hit this issue on runc-1.0.1-5.module+el8.5.0+12157+04f1d6be w/ podman-3.3.0-2.module+el8.5.0+12157+04f1d6be. Seems that centos's container-tools 3.0 is also affected by this: https://pulpito.ceph.com/swagner-2021-08-20_11:35:16-rados:cephadm-wip-swagner2-testing-2021-08-18-1238-pacific-distro-basic-smithi/6349346/ Is there a plan to get it into container-tools 3.0 as well? @jnovy do you know the answer to Sebastian's question: https://bugzilla.redhat.com/show_bug.cgi?id=1972209#c22? Is it possible to update 3.0, or has the window closed? Sebastian, please file a separate bug for 3.0 stream if you believe a backport is required there too. Thanks. Hey Jindrich, Tom and Sebastien, I just cloned this bz into https://bugzilla.redhat.com/show_bug.cgi?id=2000570, to track the backport of this fix for container-tools 3.0 in rhel 8.4, as it is what we're consuming in RHOSP 16.2. Thanks Sorry for my ignorance here, but we're still seeing this bug multiple times a day in upstream Ceph using CentOS's container-tools:3.0. That's why I cloned this into 2019335 *** Bug 2019335 has been marked as a duplicate of this bug. *** Manual cloning will not work, this needs to follow the zstream cloning process. Laurie, Derrick, can you please z+ this so I can update runc in 3.0-8.4.0? relates to https://github.com/ceph/ceph/pull/43813 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4154 The patch mentioned in comment #9 in bug 1972209 is already applied in the runc-1.0.0-72.rc92.el8_4 which was already released in 3.0-8.4.0 via https://access.redhat.com/errata/RHBA-2021:4093 - so no need for cloning/updates in 3.0-8.4.0 |