Bug 2173697

Summary:	Fails to run containers with kernel-rt and cgroups v1
Product:	Red Hat Enterprise Linux 9	Reporter:	Colin Walters <walters>
Component:	conmon	Assignee:	Jindrich Novy <jnovy>
Status:	CLOSED ERRATA	QA Contact:	Joy Pu <ypu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	CentOS Stream	CC:	bbaude, bhu, blitton, bstinson, bzvonar, ccardeno, dgonyier, dphillip, dwalsh, jhopper, jlelli, jmario, jnovy, jwboyer, kcarcia, lsm5, mboddu, mcornea, mheon, mpatel, npache, pehunt, pthomas, pvlasin, rphillips, sdodson, tsweeney, umohnani, vschneid, ypu
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	conmon-2.1.7-1.el9_2	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2174178 2174381 (view as bug list)		Environment:
Last Closed:	2023-05-09 07:36:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2174178, 2174381, 2178712
Deadline:	2023-03-14

Description Colin Walters 2023-02-27 17:40:46 UTC

It's not clear to me whether this is definitely a podman bug or a kernel-rt bug (or whether systemd is involved, etc.).

Moving this from https://issues.redhat.com/browse/OCPBUGS-7286

I started from a RHEL CoreOS 9.2 system, but I believe this will reproduce elsewhere too, basically:

- switch to kernel-rt
- inject systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller kernel arguments
- reboot

Then:

```
[root@cosa-devsh ~]# podman run --net=none --rm -ti busybox echo hello world
WARN[0000] Error loading CNI config file /etc/cni/net.d/200-loopback.conflist: error parsing configuration list: no name 
Error: failed to connect to container's attach socket: /var/lib/containers/storage/overlay-containers/7ba26ff20b5d26f21c6e969b705886266dae843bc66996bb803e56c28945cf02/userdata/attach: dial unixpacket /proc/self/fd/7: connect: connection refused
```

Comment 1 Colin Walters 2023-02-27 17:54:30 UTC

My reproducer setup is a qemu instance:

[root@cosa-devsh ~]# uname -a
Linux cosa-devsh 5.14.0-283.rt14.283.el9.x86_64 #1 SMP PREEMPT_RT Fri Feb 24 12:53:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Comment 2 Colin Walters 2023-02-27 18:08:34 UTC

A bit more info:

$ rpm -q podman systemd conmon
podman-4.4.0-1.el9.x86_64
systemd-252-6.el9.x86_64
conmon-2.1.6-1.el9.x86_64


Replying to https://issues.redhat.com/browse/OCPBUGS-7286?focusedCommentId=21786169&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21786169

> Further: I'm still quite certain it has something to do with the container's conmon process being killed. ECONNREFUSED on a Unix socket indicates that the other end of the socket is no longer listening. For the attach socket (used by Podman to communicate with Conmon, which holds the container's standard streams open) this other end is always Conmon. Given that there is a message in the logs about killing a Libpod cgroup, and Conmon lives in said Libpod cgroup, it seems to me that something is tearing down Conmon immediately after we start it, before Podman can connect to it, and that something is probably not targeting Conmon specifically, but the whole container cgroup.

Yes, but...ok, some digging in here shows that it's actually crun that's invoking StopUnit.

Comment 3 Carlos Cardeñosa 2023-02-27 18:14:19 UTC

CC: ccardeno

Comment 4 Colin Walters 2023-02-27 18:17:53 UTC

[root@cosa-devsh ~]# cat /usr/local/bin/crun.trace 
#!/bin/sh
exec crun --debug "$@"
[root@cosa-devsh ~]# podman run --runtime /usr/local/bin/crun.trace --log-level debug --rm --net=none -ti busybox sleep 5
...

Gives me this in the journal:

Feb 27 18:16:30 cosa-devsh conmon[34894]: conmon 63a50f9eb11148000b48 <error>: Failed to write to cgroup.event_control Operation not supported

Comment 5 Colin Walters 2023-02-27 18:23:39 UTC

Ah yes, there's the problem =)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n4861

static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
					 char *buf, size_t nbytes, loff_t off)
{
...
	if (IS_ENABLED(CONFIG_PREEMPT_RT))
		return -EOPNOTSUPP;

Comment 6 Colin Walters 2023-02-27 18:56:06 UTC

I'm no cgroups expert, but it looks to me like there's no way to get per-cgroup OOM notifications in this setup - CONFIG_PREEMPT_RT + cgroup v1?

That seems somewhat problematic for the current OCP plan to use RHEL 9 + cgroups v1.

Comment 7 Colin Walters 2023-02-27 19:01:46 UTC

This code came in https://github.com/torvalds/linux/commit/2343e88d238f5de973d609d861c505890f94f22e

Comment 8 Daniel Walsh 2023-02-28 02:31:39 UTC

We want to end support for cgroups V1, what possible reason would we want to continue to support it on RHEL9 in OpenSHift?

Comment 9 Valentin Schneider 2023-02-28 12:52:06 UTC

IIUC, cgroupv1 relies on setting up an eventfd and linking that to
memory.oom_control via memory.event_control.

cgroupv2 instead has idnotify support, which allows tracking
memory.events in a more traditional way.

v1 of the patch mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2173697#c7
describes failed tests and lock order issues, however I don't
have a clear picture of the specific issues with the cgroupv1 file.

I expect that cgroupv1 requiring to write into a cgroup file to use
the eventfd is an issue, whereas cgroupv2 allows transparently
idnotify'ing memory.events which is itself updated as OOM events
happen.

I've started looking into the cgroupv1 issues to get a better
understanding, but given the cgroupv1 approach is deprecated
upstream, I'm not confident about hacking a "fix" for it.

Comment 10 Colin Walters 2023-02-28 13:07:40 UTC

> what possible reason would we want to continue to support it on RHEL9 in OpenSHift?

This came in with https://github.com/openshift/machine-config-operator/pull/3486 and really should have had more information there...

Anyways can someone who owns conmon please take ownership of queuing a build for C9S with the patch?

Comment 13 Colin Walters 2023-02-28 20:06:34 UTC

There's an IMO larger picture question here as to whether the plan to use RHEL9 with cgroups v1 for OCP 4.13 is really the right call.  However, we will only be able to better formulate answers to that question when things like this bug are fixed, so we can better compare bug-for-bug across more of the OCP use cases which include a lot of kernel-rt.

Comment 14 Colin Walters 2023-02-28 20:32:47 UTC

For now I am fast tracking this into OCP 4.13/RHEL9 channel: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=51051298

Comment 15 Tom Sweeney 2023-02-28 21:01:11 UTC

@walters it might be first day back after PTO brain talking here, but I'm confused as to what the fix is, and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default?

The PR noted here in this BZ:  https://github.com/openshift/machine-config-operator/pull/3486 looks to be setting cgroups to v1 if a version of group is not specified.  But that's not the fix for this issue is it?  Rather, that change is what's causing the issue now I think?

For a fix to this issue that you ran into, are you proposing rolling that PR back and/or a change to common?  If a change to conmon is needed, do you have a suggested change for that?

Thanks in advance for straightening my noggin out.

Comment 16 Colin Walters 2023-02-28 21:33:47 UTC

> Rather, [the MCO change] is what's causing the issue now I think?

Right.

The change to conmon is linked in the "links" section, it's https://github.com/containers/conmon/pull/385
which is what I want to pull into RHEL 9.2.

But, if in the end we don't care about this case for RHEL (kernel-rt + cgroups v1) then we could just carry forward with shipping a patched conmon in OCP 4.13.

Comment 17 Colin Walters 2023-03-01 00:59:02 UTC

> and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default?

FTR I cannot answer this question, it's a question for the OCP node team.  See https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-1448150138

Comment 18 Tom Sweeney 2023-03-01 01:13:11 UTC

Thanks, Colin, I missed that link earlier, the change makes sense to me now.  As far as the group v1 in OCP 4.13 goes, sounds like it might be best to have a little chat sometime in the near future with the interested parties to sort that out.

Comment 19 Scott Dodson 2023-03-01 02:15:21 UTC

(In reply to Colin Walters from comment #17)
> > and secondarily, why we want to use groups v1 on 4.13/RHEL 9 as the default?
> 
> FTR I cannot answer this question, it's a question for the OCP node team. 
> See
> https://github.com/openshift/machine-config-operator/pull/3486#issuecomment-
> 1448150138

@pehunt or @rphillips can either of you fill folks in the reasoning for why we'd need cgroups v1?

Comment 22 Peter Hunt 2023-03-01 14:51:50 UTC

> @pehunt or @rphillips can either of you fill folks in the reasoning for why we'd need cgroups v1?

AFAIU the choice to use cgroupv1 by default even on rhel 9 comes largely from cgroupv2 availability in openshift. in 4.12 it's only tech preview, and we wanted to give customers some time to prepare for the change before we default to it. It will also allow us to catch any potential remaining issues by moving in-house openshift deployments (like managed services) to cgroupv2 for a cycle before deploying to the entire fleet.

I could see an argument about keeping upgrades to cgroupv1 and have new clusters be cgroupv2 but it's so close to dev freeze I think we can keep it simple.

Comment 23 Daniel Walsh 2023-03-01 16:26:57 UTC

Ok, but my opinion is 
<sandbox>
it is time to put cgroup V1 out of its misery.  I think it is being depracated dropped from Fedora 38. It should definitely not be supported in RHEL10.

We need to get OpenShift totally to CGroup V2 ASAP. 

Cgroup V2 is better and has been around for years.  Cgroupv1 is basically broken.
</sandbox>

Comment 24 Peter Hunt 2023-03-01 16:57:06 UTC

yeah I personally would like to see cgroupv2 default for 4.14 though that may not represent the node team's position

Comment 25 Scott Dodson 2023-03-02 16:29:52 UTC

Just to clarify what's happening in this bug is that conmon is being made to work around the absence of oom kill events. This a workaround, not a fix.

We also need to pursue, in parallel, https://bugzilla.redhat.com/show_bug.cgi?id=2174178 which was copied from this bug.

Comment 29 Colin Walters 2023-03-09 20:47:20 UTC

My 2c:

- Let's retarget this bug for 9.3
- We keep the override build in 4.14/4.13

Alternatively, we could do the build for 9.2 since we have an E+ and drop our override build.  I'm good with either.

Comment 30 Scott Dodson 2023-03-09 21:08:03 UTC

There's already a 9.3 clone of this bug and it seems like the status quo is that we've got rhaos4.x builds of conmon in all releases.

I'd propose we let @jnovy decide as it seems he's the primary maintainer.

Comment 31 Jindrich Novy 2023-03-21 11:40:04 UTC

Based on Dan's comments we'd like to keep RHEL cgroups v2 only. I see Colin already committed the v1 cherrypick for 4.13. Do I understand this correctly there is nothing to do on conmon side for RHEL?

Comment 32 Scott Dodson 2023-03-21 12:45:15 UTC

Conmon 2.1.7 includes the fixes. I'd suggest RHEL in 9.3 and 9.2 updates to conmon 2.1.7, however I guess it doesn't matter too much because OCP builds and ships conmon on their own and has already included the fix in relevant versions.

Comment 33 Jindrich Novy 2023-03-21 14:58:30 UTC

Ok, I just updated conmon to 2.1.7 for 9.2.

Comment 34 Jindrich Novy 2023-03-21 15:43:26 UTC

Seems there is a newly introduced memleak in 2.1.7, can you PTAL Peter?

https://cov01.lab.eng.brq2.redhat.com/covscanhub/task/277715//log/added.html

Comment 35 Colin Walters 2023-03-21 17:09:54 UTC

https://github.com/containers/conmon/pull/387/commits/54a0c9c11db7f2b1b8401822306be4e6b658e082
just merged earlier today.

Comment 36 Peter Hunt 2023-03-21 17:29:41 UTC

yeah it can be safely ignored and will be fixed in 2.1.8

Comment 38 Joy Pu 2023-03-23 08:14:46 UTC

Can reproduce with cgroupv1, kernel-rt-5.14.0-283.rt14.283.el9.x86_64 and conmon-2.1.6-1.el9.x86_64:
# podman run --net=none --rm -ti busybox echo hello world
Resolved "busybox" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/busybox:latest...
Getting image source signatures
Copying blob 4b35f584bb4f done  
Copying config 7cfbbec896 done  
Writing manifest to image destination
Storing signatures
Error: failed to connect to container's attach socket: /var/lib/containers/storage/overlay-containers/c23f908269dee65cab49d5b4d6dd9b489420bbed9754b9def964a2d54f13dafc/userdata/attach: dial unixpacket /proc/self/fd/3: connect: connection refused

And test with conmon-2:2.1.7-1.el9_2.x86_64, the problem fixed:
# podman run --net=none --rm -ti busybox echo hello world
hello world


So add the Tested flags.

Comment 46 errata-xmlrpc 2023-05-09 07:36:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: conmon security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2222