Bug 2173697
Summary: | Fails to run containers with kernel-rt and cgroups v1 | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Colin Walters <walters> | |
Component: | conmon | Assignee: | Jindrich Novy <jnovy> | |
Status: | CLOSED ERRATA | QA Contact: | Joy Pu <ypu> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | CentOS Stream | CC: | bbaude, bhu, blitton, bstinson, bzvonar, ccardeno, dgonyier, dphillip, dwalsh, jhopper, jlelli, jmario, jnovy, jwboyer, kcarcia, lsm5, mboddu, mcornea, mheon, mpatel, npache, pehunt, pthomas, pvlasin, rphillips, sdodson, tsweeney, umohnani, vschneid, ypu | |
Target Milestone: | rc | Keywords: | Triaged | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | conmon-2.1.7-1.el9_2 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2174178 2174381 (view as bug list) | Environment: | ||
Last Closed: | 2023-05-09 07:36:19 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2174178, 2174381, 2178712 | |||
Deadline: | 2023-03-14 |
Description
Colin Walters
2023-02-27 17:40:46 UTC
My reproducer setup is a qemu instance: [root@cosa-devsh ~]# uname -a Linux cosa-devsh 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux A bit more info: $ rpm -q podman systemd conmon podman-4.4.0-1.el9.x86_64 systemd-252-6.el9.x86_64 conmon-2.1.6-1.el9.x86_64 Replying to https://issues.redhat.com/browse/OCPBUGS-7286?focusedCommentId=21786169&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21786169 > Further: I'm still quite certain it has something to do with the container's conmon process being killed. ECONNREFUSED on a Unix socket indicates that the other end of the socket is no longer listening. For the attach socket (used by Podman to communicate with Conmon, which holds the container's standard streams open) this other end is always Conmon. Given that there is a message in the logs about killing a Libpod cgroup, and Conmon lives in said Libpod cgroup, it seems to me that something is tearing down Conmon immediately after we start it, before Podman can connect to it, and that something is probably not targeting Conmon specifically, but the whole container cgroup. Yes, but...ok, some digging in here shows that it's actually crun that's invoking StopUnit. CC: ccardeno [root@cosa-devsh ~]# cat /usr/local/bin/crun.trace #!/bin/sh exec crun --debug "$@" [root@cosa-devsh ~]# podman run --runtime /usr/local/bin/crun.trace --log-level debug --rm --net=none -ti busybox sleep 5 ... Gives me this in the journal: Feb 27 18:16:30 cosa-devsh conmon[34894]: conmon 63a50f9eb11148000b48 <error>: Failed to write to cgroup.event_control Operation not supported Ah yes, there's the problem =) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n4861 static ssize_t memcg_write_event_control(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { ... if (IS_ENABLED(CONFIG_PREEMPT_RT)) return -EOPNOTSUPP; I'm no cgroups expert, but it looks to me like there's no way to get per-cgroup OOM notifications in this setup - CONFIG_PREEMPT_RT + cgroup v1? That seems somewhat problematic for the current OCP plan to use RHEL 9 + cgroups v1. We want to end support for cgroups V1, what possible reason would we want to continue to support it on RHEL9 in OpenSHift? IIUC, cgroupv1 relies on setting up an eventfd and linking that to memory.oom_control via memory.event_control. cgroupv2 instead has idnotify support, which allows tracking memory.events in a more traditional way. v1 of the patch mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2173697#c7 describes failed tests and lock order issues, however I don't have a clear picture of the specific issues with the cgroupv1 file. I expect that cgroupv1 requiring to write into a cgroup file to use the eventfd is an issue, whereas cgroupv2 allows transparently idnotify'ing memory.events which is itself updated as OOM events happen. I've started looking into the cgroupv1 issues to get a better understanding, but given the cgroupv1 approach is deprecated upstream, I'm not confident about hacking a "fix" for it. > what possible reason would we want to continue to support it on RHEL9 in OpenSHift? This came in with https://github.com/openshift/machine-config-operator/pull/3486 and really should have had more information there... Anyways can someone who owns conmon please take ownership of queuing a build for C9S with the patch? There's an IMO larger picture question here as to whether the plan to use RHEL9 with cgroups v1 for OCP 4.13 is really the right call. However, we will only be able to better formulate answers to that question when things like this bug are fixed, so we can better compare bug-for-bug across more of the OCP use cases which include a lot of kernel-rt. For now I am fast tracking this into OCP 4.13/RHEL9 channel: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51051298 @walters it might be first day back after PTO brain talking here, but I'm confused as to what the fix is, and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default? The PR noted here in this BZ: https://github.com/openshift/machine-config-operator/pull/3486 looks to be setting cgroups to v1 if a version of group is not specified. But that's not the fix for this issue is it? Rather, that change is what's causing the issue now I think? For a fix to this issue that you ran into, are you proposing rolling that PR back and/or a change to common? If a change to conmon is needed, do you have a suggested change for that? Thanks in advance for straightening my noggin out. > Rather, [the MCO change] is what's causing the issue now I think? Right. The change to conmon is linked in the "links" section, it's https://github.com/containers/conmon/pull/385 which is what I want to pull into RHEL 9.2. But, if in the end we don't care about this case for RHEL (kernel-rt + cgroups v1) then we could just carry forward with shipping a patched conmon in OCP 4.13. > and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default? FTR I cannot answer this question, it's a question for the OCP node team. See https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-1448150138 Thanks, Colin, I missed that link earlier, the change makes sense to me now. As far as the group v1 in OCP 4.13 goes, sounds like it might be best to have a little chat sometime in the near future with the interested parties to sort that out. (In reply to Colin Walters from comment #17) > > and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default? > > FTR I cannot answer this question, it's a question for the OCP node team. > See > https://github.com/openshift/machine-config-operator/pull/3486#issuecomment- > 1448150138 @pehunt or @rphillips can either of you fill folks in the reasoning for why we'd need cgroups v1? > @pehunt or @rphillips can either of you fill folks in the reasoning for why we'd need cgroups v1?
AFAIU the choice to use cgroupv1 by default even on rhel 9 comes largely from cgroupv2 availability in openshift. in 4.12 it's only tech preview, and we wanted to give customers some time to prepare for the change before we default to it. It will also allow us to catch any potential remaining issues by moving in-house openshift deployments (like managed services) to cgroupv2 for a cycle before deploying to the entire fleet.
I could see an argument about keeping upgrades to cgroupv1 and have new clusters be cgroupv2 but it's so close to dev freeze I think we can keep it simple.
Ok, but my opinion is <sandbox> it is time to put cgroup V1 out of its misery. I think it is being depracated dropped from Fedora 38. It should definitely not be supported in RHEL10. We need to get OpenShift totally to CGroup V2 ASAP. Cgroup V2 is better and has been around for years. Cgroupv1 is basically broken. </sandbox> yeah I personally would like to see cgroupv2 default for 4.14 though that may not represent the node team's position Just to clarify what's happening in this bug is that conmon is being made to work around the absence of oom kill events. This a workaround, not a fix. We also need to pursue, in parallel, https://bugzilla.redhat.com/show_bug.cgi?id=2174178 which was copied from this bug. My 2c: - Let's retarget this bug for 9.3 - We keep the override build in 4.14/4.13 Alternatively, we could do the build for 9.2 since we have an E+ and drop our override build. I'm good with either. There's already a 9.3 clone of this bug and it seems like the status quo is that we've got rhaos4.x builds of conmon in all releases. I'd propose we let @jnovy decide as it seems he's the primary maintainer. Based on Dan's comments we'd like to keep RHEL cgroups v2 only. I see Colin already committed the v1 cherrypick for 4.13. Do I understand this correctly there is nothing to do on conmon side for RHEL? Conmon 2.1.7 includes the fixes. I'd suggest RHEL in 9.3 and 9.2 updates to conmon 2.1.7, however I guess it doesn't matter too much because OCP builds and ships conmon on their own and has already included the fix in relevant versions. Ok, I just updated conmon to 2.1.7 for 9.2. Seems there is a newly introduced memleak in 2.1.7, can you PTAL Peter? https://cov01.lab.eng.brq2.redhat.com/covscanhub/task/277715//log/added.html https://github.com/containers/conmon/pull/387/commits/54a0c9c11db7f2b1b8401822306be4e6b658e082 just merged earlier today. yeah it can be safely ignored and will be fixed in 2.1.8 Can reproduce with cgroupv1, kernel-rt-5.14.0-283.rt14.283.el9.x86_64 and conmon-2.1.6-1.el9.x86_64: # podman run --net=none --rm -ti busybox echo hello world Resolved "busybox" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf) Trying to pull docker.io/library/busybox:latest... Getting image source signatures Copying blob 4b35f584bb4f done Copying config 7cfbbec896 done Writing manifest to image destination Storing signatures Error: failed to connect to container's attach socket: /var/lib/containers/storage/overlay-containers/c23f908269dee65cab49d5b4d6dd9b489420bbed9754b9def964a2d54f13dafc/userdata/attach: dial unixpacket /proc/self/fd/3: connect: connection refused And test with conmon-2:2.1.7-1.el9_2.x86_64, the problem fixed: # podman run --net=none --rm -ti busybox echo hello world hello world So add the Tested flags. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: conmon security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2222 |