Description Running a centos 7 container with systemd fails to run on Fedora 31 Steps to reproduce the issue: podman run -ti centos7-with-systemd /sbin/init output shows the following: Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted [!!!!!!] Failed to mount API filesystems, freezing. This occurs when running as root and rootless Describe the results you received: Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted [!!!!!!] Failed to mount API filesystems, freezing. Describe the results you expected: Expect systemd to run Additional information you deem important (e.g. issue happens only occasionally): Output of podman version: podman version Version: 1.6.1 RemoteAPI Version: 1 Go Version: go1.13 OS/Arch: linux/amd64 Output of podman info --debug: podman info --debug debug: compiler: gc git commit: "" go version: go1.13 podman version: 1.6.1 host: BuildahVersion: 1.11.2 CgroupVersion: v2 Conmon: package: conmon-2.0.1-1.fc31.x86_64 path: /usr/bin/conmon version: 'conmon version 2.0.1, commit: 5e0eadedda9508810235ab878174dca1183f4013' Distribution: distribution: fedora version: "31" MemFree: 18496446464 MemTotal: 67443789824 OCIRuntime: package: crun-0.10.1-1.fc31.x86_64 path: /usr/bin/crun version: |- crun version 0.10.1 spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL SwapFree: 0 SwapTotal: 0 arch: amd64 cpus: 16 eventlogger: journald hostname: kubhost kernel: 5.3.0-1.fc31.x86_64 os: linux rootless: true slirp4netns: Executable: /usr/bin/slirp4netns Package: slirp4netns-0.4.0-20.1.dev.gitbbd6f25.fc31.x86_64 Version: |- slirp4netns version 0.4.0-beta.3+dev commit: bbd6f25c70d5db2a1cd3bfb0416a8db99a75ed7e uptime: 186h 43m 44.97s (Approximately 7.75 days) registries: blocked: null insecure: null search: docker.io registry.fedoraproject.org quay.io registry.access.redhat.com registry.centos.org store: ConfigFile: /home/greg/.config/containers/storage.conf ContainerStore: number: 13 GraphDriverName: vfs GraphOptions: {} GraphRoot: /home/greg/.local/share/containers/storage GraphStatus: {} ImageStore: number: 21 RunRoot: /run/user/1000 VolumePath: /home/greg/.local/share/containers/storage/volumes Additional environment details (AWS, VirtualBox, physical, etc.): running ZFS
Systemd guys, is there anything that could be done to make systemd on Centos work on a cgroupv2 file system?
at least with docker when using centos7/systemd container one needs to mount the /sys/fs/cgroup volume: docker run -d -v /sys/fs/cgroup:/sys/fs/cgroup:ro centos7-with-systemd /sbin/init Is it the same with podman?
No, Podman will automatically detect a container was run with systemd as the entrypoint and add the volume (among other changes necessary to make systemd run well in a container)
i have same problem on fc31. Is there anything, that can be done? It is quite blocker for me. Thanks you
Data point: F31(upgraded from F30) cannot run a systemd-based F31 container (even though the latter is supposed to be cgroupv2-aware). Actually I get a different situation from the reporter, and it is the same with CentOS 7 / CentOS 8. I don't see any output from the container itself, instead conmon stalls at INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-807c8d059ee7276a163930b6de94e90d456398003bc5c8311f001dfa3a1b0f07.scope After a couple of minutes: DEBU[0241] ExitCode msg: "container creation timeout: internal libpod error" Error: container creation timeout: internal libpod error Unlike the reporter, systemd inside the container doesn't even get a chance to complain.
It would be useful to be precise about exactly what versions of relevant packages are on both ends there (on the host system and in the container), as there's been quite a lot of change in this area late in F31. There is a known bug https://bugzilla.redhat.com/show_bug.cgi?id=1763868 .
Brand new container F31 Host: podman-1.6.1-5.fc31.x86_64 / podman-1.6.2-2.fc31.x86_64 conmon-2.0.1-1.fc31.x86_64 crun-0.10.2-1.fc31.x86_64 systemd-243-3.gitef67743.fc31.x86_64 F31 Container: --entrypoint /bin/bash systemd-243-2.gitfab6f01.fc31.x86_64
What is the exact command you are using to run the F31 container with systemd in it?
podman --log-level DEBUG run --rm -it --entrypoint /sbin/init fedora:31 Reaches: DEBU[0000] /usr/bin/conmon messages will be logged to syslog DEBU[0000] running conmon: /usr/bin/conmon args="[--api-version 1 -s -c a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 -u a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata -p /var/run/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/pidfile -l k8s-file:/var/lib/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/ctr.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level debug --syslog -t --conmon-pidfile /var/run/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531]" INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531.scope ...stall for a few minutes... DEBU[0240] Cleaning up container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 DEBU[0240] Tearing down network namespace at /var/run/netns/cni-a5ee8c98-8d56-9658-cc99-086c88757c71 for container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 INFO[0240] Got pod network &{Name:nostalgic_haslett Namespace:nostalgic_haslett ID:a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 NetNS:/var/run/netns/cni-a5ee8c98-8d56-9658-cc99-086c88757c71 Networks:[] RuntimeConfig:map[podman:{IP: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]} INFO[0240] About to del CNI network podman (type=bridge) DEBU[0240] unmounted container "a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531" DEBU[0240] unable to remove container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 after failing to start and attach to it DEBU[0240] ExitCode msg: "container creation timeout: internal libpod error" DEBU[0240] [graphdriver] trying provided driver "overlay" DEBU[0240] cached value indicated that overlay is supported DEBU[0240] cached value indicated that metacopy is being used DEBU[0240] backingFs=xfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=true Error: container creation timeout: internal libpod error
Are you running this rootless or rootfull? If rootless does it work when running as root.
Works for me. $ rpm -q podman podman-1.6.2-2.fc31.x86_64 $ podman run --rm -it --entrypoint /sbin/init fedora:31 Trying to pull docker.io/library/fedora:31... Getting image source signatures Copying blob 619d35b2bf84 done Copying config 98c519110e done Writing manifest to image destination Storing signatures systemd v243-2.gitfab6f01.fc31 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified) Detected virtualization podman. Detected architecture x86-64. Welcome to Fedora 31 (Container Image)! Set hostname to <08a30fbf62e3>. Initializing machine ID from random generator. [ OK ] Started Dispatch Password…ts to Console Directory Watch. [ OK ] Started Forward Password …uests to Wall Directory Watch. [ OK ] Reached target Local File Systems. [ OK ] Reached target Paths. [ OK ] Reached target Remote File Systems. [ OK ] Reached target Slices. [ OK ] Reached target Swap. [ OK ] Listening on Process Core Dump Socket. [ OK ] Listening on initctl Compatibility Named Pipe. [ OK ] Listening on Journal Socket (/dev/log). [ OK ] Listening on Journal Socket. Starting Rebuild Dynamic Linker Cache... Starting Journal Service... Starting Create System Users... [ OK ] Started Create System Users. [ OK ] Started Rebuild Dynamic Linker Cache. [ OK ] Started Journal Service. Starting Flush Journal to Persistent Storage... [ OK ] Started Flush Journal to Persistent Storage. Starting Create Volatile Files and Directories... [ OK ] Started Create Volatile Files and Directories. Starting Rebuild Journal Catalog... Starting Update UTMP about System Boot/Shutdown... [ OK ] Started Update UTMP about System Boot/Shutdown. [ OK ] Started Rebuild Journal Catalog. Starting Update is Completed... [ OK ] Started Update is Completed. [ OK ] Reached target System Initialization. [ OK ] Started Daily Cleanup of Temporary Directories. [ OK ] Reached target Timers. [ OK ] Listening on D-Bus System Message Bus Socket. [ OK ] Reached target Sockets. [ OK ] Reached target Basic System. Starting Permit User Sessions... [ OK ] Started Permit User Sessions. [ OK ] Reached target Multi-User System. Starting Update UTMP about System Runlevel Changes... [ OK ] Started Update UTMP about System Runlevel Changes.
You say this system was upgraded from F30 - was this container preexisting, or created after the upgrade
Completely new container, running as root, at the stall, strace'ing I see 120414 futex(0x5649d6ac0d30, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0} <unfinished ...> 120342 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) 120342 futex(0x56437ad58d30, FUTEX_WAKE_PRIVATE, 1) = 1 120309 <... futex resumed>) = 0 120342 futex(0xc0003152c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120309 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...> 120682 <... futex resumed>) = 0 120342 <... futex resumed>) = 1 120682 nanosleep({tv_sec=0, tv_nsec=3000}, <unfinished ...> 120342 futex(0x56437ad659e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120309 <... nanosleep resumed>NULL) = 0 120308 <... futex resumed>) = 0 120342 <... futex resumed>) = 1 120309 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...> 120308 futex(0x56437ad659e8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120682 <... nanosleep resumed>NULL) = 0 120342 futex(0x56437ad6a920, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=999010290} <unfinished ...> 120682 futex(0xc0003152c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120309 <... nanosleep resumed>NULL) = 0 120309 futex(0x56437ad58d30, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0} <unfinished ...> 120422 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) 120422 futex(0x5649d6ac0d30, FUTEX_WAKE_PRIVATE, 1) = 1 120414 <... futex resumed>) = 0 120422 futex(0xc00006a848, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120414 sched_yield( <unfinished ...> 120422 <... futex resumed>) = 1 120415 <... futex resumed>) = 0 120414 <... sched_yield resumed>) = 0 120422 futex(0xc00006b9c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120415 futex(0xc00009a4c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120414 futex(0x5649d6ac0c30, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120423 <... futex resumed>) = 0 120422 <... futex resumed>) = 1 120419 <... futex resumed>) = 0 120415 <... futex resumed>) = 1 120414 <... futex resumed>) = 0 120423 futex(0xc00006b9c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120422 futex(0x5649d6ad29a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=998981396} <unfinished ...> 120419 futex(0xc00009a4c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120415 futex(0xc00006a848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
Also something interesting here... 120398 execve("/usr/bin/conmon", ["/usr/bin/conmon", "--api-version", "1", "-s", "-c", "cf865286f1ae24faa69fd5371ad757fa"..., "-u", "cf865286f1ae24faa69fd5371ad757fa"..., "-r", "/usr/bin/crun", "-b", "/var/lib/containers/storage/over"..., "-p", "/var/run/containers/storage/over"..., "-l", "k8s-file:/var/lib/containers/sto"..., "--exit-dir", "/var/run/libpod/exits", "--socket-dir-path", "/var/run/libpod/socket", "--log-level", "debug", "--syslog", "-t", "--conmon-pidfile", "/var/run/containers/storage/over"..., "--exit-command", "/usr/bin/podman", "--exit-command-arg", "--root", "--exit-command-arg", "/var/lib/containers/storage", ...], 0xc000106480 /* 7 vars */ <unfinished ...> 120318 <... clone resumed>) = 120398 120318 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 120318 close(17) = 0 120318 read(16, "", 8) = 0 120318 close(16) = 0 120318 epoll_ctl(4, EPOLL_CTL_DEL, 15, 0xc00074cd2c) = 0 120318 close(15) = 0 120318 futex(0xc000314bc8, FUTEX_WAKE_PRIVATE, 1) = 1 120315 <... futex resumed>) = 0 120318 gettid() = 120318 120318 openat(AT_FDCWD, "/proc/self/task/120318/attr/exec", O_WRONLY|O_CLOEXEC <unfinished ...> 120315 nanosleep({tv_sec=0, tv_nsec=3000}, <unfinished ...> 120318 <... openat resumed>) = 15 120318 epoll_ctl(4, EPOLL_CTL_ADD, 15, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=884056568, u64=139686105424376}}) = -1 EPERM (Operation not permitted) 120315 <... nanosleep resumed>NULL) = 0 120318 epoll_ctl(4, EPOLL_CTL_DEL, 15, 0xc00074cd44 <unfinished ...> 120315 futex(0xc0004a6f48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120318 <... epoll_ctl resumed>) = -1 EPERM (Operation not permitted) 120396 <... futex resumed>) = 0 120315 <... futex resumed>) = 1 120396 futex(0xc0004a6f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120318 write(15, "", 0) = 0 120315 read(14, <unfinished ...> 120318 close(15 <unfinished ...> 120315 <... read resumed>0xc0001cc000, 512) = -1 EAGAIN (Resource temporarily unavailable) 120318 <... close resumed>) = 0 120315 futex(0xc000314bc8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120318 close(10) = 0 120318 close(13) = 0 120318 ioctl(2, TCGETS, {B38400 -opost -isig -icanon -echo ...}) = 0 120318 write(2, "\33[36mINFO\33[0m[0000] Running conm"..., 161) = 161 120318 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 10 120318 setsockopt(10, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0 120318 connect(10, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34 <unfinished ...> 120342 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) 120342 futex(0xc000314bc8, FUTEX_WAKE_PRIVATE, 1) = 1 120315 <... futex resumed>) = 0 120342 madvise(0xc000600000, 2097152, MADV_NOHUGEPAGE <unfinished ...> 120318 <... connect resumed>) = 0 120342 <... madvise resumed>) = 0 120315 futex(0xc000314bc8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120342 madvise(0xc000698000, 8192, MADV_FREE <unfinished ...> 120318 epoll_ctl(4, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=884056568, u64=139686105424376}} <unfinished ...> 120342 <... madvise resumed>) = 0 120308 <... epoll_pwait resumed>[{EPOLLOUT, {u32=884056568, u64=139686105424376}}], 128, -1, NULL, 3) = 1 120318 <... epoll_ctl resumed>) = 0 120308 epoll_pwait(4, <unfinished ...> 120342 futex(0x56437ad6a920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> 120318 getsockname(10, <unfinished ...> 120342 <... futex resumed>) = 1 120313 <... futex resumed>) = 0 120342 futex(0xc0004fa148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 120318 <... getsockname resumed>{sa_family=AF_UNIX}, [112->2]) = 0 120313 futex(0x56437ad6a920, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1742780} <unfinished ...> 120318 getpeername(10, <unfinished ...> 120398 <... execve resumed>) = 0 120318 <... getpeername resumed>{sa_family=AF_UNIX, sun_path="/run/dbus/system_bus_socket"}, [112->30]) = 0 120398 brk(NULL <unfinished ...> 120318 getuid( <unfinished ...> 120398 <... brk resumed>) = 0x9bc000 120318 <... getuid resumed>) = 0 120398 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffece7f960 <unfinished ...> 120318 getpid( <unfinished ...> 120398 <... arch_prctl resumed>) = -1 EINVAL (Invalid argument) 120318 <... getpid resumed>) = 120308 120318 getuid( <unfinished ...>
Are you sure you are fully upd2date on all packages. rpm -q podman crun conmon fuse-overlayfs podman-1.6.2-2.fc31.x86_64 crun-0.10.2-1.fc31.x86_64 conmon-2.0.1-1.fc31.x86_64 fuse-overlayfs-0.6.5-2.fc31.x86_64
Yes - I am matching these versions of the RPMs. I am getting this on 3 separate F30 -> F31 upgraded machines. On one of the machines I started with a new /var/lib/containers and seeing the same stall.
So this works on a new F31 virtual machine (running a fedora:31 systemd container as root). I get the same result as @Daniel Walsh - now to figure out why the upgraded system fails to launch this container.
@Daniel Walsh I know the cause: I must remove oci-systemd-hook oci-register-machine oci-umount During the upgrade from F30->F31 these packages were upgraded; the new F31 VM does not have these packages installed. These packages seem to interfere with the proper functioning of podman/conmon. For reference these packages had to be removed: oci-register-machine-0-11.git66fa845.fc31.x86_64 oci-systemd-hook-0.2.0-2.git05e6923.fc31.x86_64 oci-umount-2.5-3.gitc3cda1f.fc31.x86_64
(In reply to Daniel Walsh from comment #1) > Systemd guys, is there anything that could be done to make systemd on Centos > work on a cgroupv2 file system? cgroupv2 support was added in systemd-230. centos7 has systemd-219, so no support. What we do in systemd-nspawn is check the guest to guess if it supports cgroupsv2. If it has systemd >= 230, it does. See https://github.com/systemd/systemd/blob/master/src/nspawn/nspawn.c#L445-L480. If the guest has no support for v2, we try to mount v1. This is not very elegant, but we couldn't come up with an approach that would allow us to use cgroupv2, but not break old images.
@space88man I thought oci-register-machine-0-11.git66fa845.fc31.x86_64 was removed years ago. We no longer support it. oci-systemd-hook is not used by podman and would not be used by docker-ce or moby-engine. So it really serves no purpose. oci-umount is really only for use with devicemapper back end. But not sure why that would cause issues with cgroup v2? Are you sure you needed to remove it. Zbigniew Jędrzejewski-Szmek maybe we should do the same for podman.
@Daniel Walsh oci-umount / oci-register-machine are red herrings - they don't affect fedora:31 centos:8 containers. The blocker seems to be oci-systemd-hook: after reinstalling just that package I am seeing the stall in conmon as previously described. The problematic version is oci-systemd-hook-0.2.0-2.git05e6923.fc31.x86_64.rpm.
There was another bug related to Conmon hanging if an OCI hook exited non-0 - that could be what's going on here. I think it was resolved, but I forget which project the patch went into?
`conmon-2.0.1-1.fc31.x86_64` This appears to be the problem. Conmon 2.0.2 is released upstream with a fix. It's pending in Bodhi at https://bodhi.fedoraproject.org/updates/FEDORA-2019-6353777bbd Can you try the command: `podman --runtime /bin/false run --rm alpine true` If that fails, you have the same issue as the one I'm describing, and 2.0.2 will fix it once it makes it to stable.
@Matthew Heon: I can confirm that what I was seeing with oci-systemd-hook was in fact due to the conmon issue. Thanks.
Experiencing the same thing on F32 (fresh install): $ sudo podman run --log-level debug -t --name centos --rm centos:7.2.1511 /sbin/init ... DEBU[0000] Started container f573efe8280c33bf35380b34bb45c5c98c892b3834d0356582e09c44006e436d Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted [!!!!!!] Failed to mount API filesystems, freezing. The problem doesn't appear to be linked to the presence of stale oci-* packages: $ rpm -qa | rg oci | wc -l 0 $ rpm -q systemd podman crun conmon fuse-overlayfs systemd-245.4-1.fc32.x86_64 podman-1.9.1-1.fc32.x86_64 crun-0.13-2.fc32.x86_64 conmon-2.0.15-1.fc32.x86_64 fuse-overlayfs-1.0.0-1.fc32.x86_64
We've investigated whether it is possible to support older versions of systemd requiring cgroups v1 (the versions shipping in CentOS/RHEL 7 being most notable here) on cgroups v2 hosts, and it doesn't seem to be reasonable. The complexity involved is quite significant, and adding support for this is not on our priorities list at present. It seems like the best option to resolve this is to either swap the base image to RHEL/Cent 8.x (which have a new enough systemd to support cgroups v2) or disable cgroups v2 on the host (which does have the disadvantage of removing ability to set resource limits for rootless containers, among a few other things).
What about doing what is described in https://bugzilla.redhat.com/show_bug.cgi?id=1760645#c19 ?
We investigated that, but discarded it as prohibitively difficult. Giuseppe, who appears to already be on CC, was the one who looked into this, and might have more details.
Seems like a big risk to have podman examining the first process inside of a container and attempting to figure out a specific version. systemd-nspawn, might expect the container to be running systemd, but we have no such assumption, and figuring this out for different distributions would be very difficult.
> Seems like a big risk to have podman examining the first process inside of a container and attempting to figure out a specific version In general, I'd agree. But systemd is easier in this regard, because it never allowed support for cgroupsv2 to be compiled out or otherwise disabled. Additionally, systemd is always installed in the same locations. And the functionality in question is too big to be backported. Effectively, this means that a very simple check for existence of libsystemd-shared-nnn. in one location is enough to cover this case. Anyway, I'm not saying that this is appropriate for podman... just that the check is relatively simple in case of systemd in the container. Feel free to ignore my comment.
(In reply to Matthew Heon from comment #28) > We investigated that, but discarded it as prohibitively difficult. Giuseppe, > who appears to already be on CC, was the one who looked into this, and might > have more details. to give more details, I think it is difficult to set it up using the OCI config file and it would require changes in the specs. We can easily implement it in crun using an annotation. It seems only the name=systemd hierarchy is needed to run an old version of systemd on a cgroup v2 system: https://github.com/containers/crun/pull/357 I am also not sure whether we should add any auto-detection logic into Podman, or it is enough to document the annotation: podman run --annotation=run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup --rm -ti centos:7 /usr/lib/systemd/systemd Another limitation is that it seems to work only for root, as rootless cannot mount a new hierarchy.
I agree that explicit annotations might be more stable approach than automagically guessing based on the ENTRYPOINT. In https://github.com/freeipa/freeipa-container and in the rhel7/ipa-server container image, the ENTRYPOINT is a bash script which populates the data volume and only then execs systemd.
I am fine with doing this via annotation, although label might be better, since DockerV2 spec does not support Annotations. Then you could embed the Lable in the Container/Image and the user would not need to do anything special to get this to work. If the image/container has a systemd V1 label, then crun does the right thing.
Sure, labels would work too.
the label must be handled by podman, the OCI runtime doesn't have access to this information, we'll need both. I think we still need an annotation to force this behaviour when the image doesn't specify the label, or using --rootfs.
I am not sure what the difference is? I can still do podman run --label run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup ... or podman run --label run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup --rootfs ... But if the centos7-init and rhel7-init containers have this label backed into the image then podman run ... will just work.
I think Giuseppe is saying that labels are not passed to the OCI runtime, only annotations. Images contain labels, not annotations. This is a tricky/annoying distinction. Podman (and Docker before it) allows labels as arbitrary key-value metadata. The OCI runtime spec separately allows annotations as arbitrary key-value metadata. Unfortunately, these are entirely distinct - Podman uses labels largely for filtering in `ps`/`images` and annotations to trigger specific OCI runtime behaviour (like this). Labels are inherited from images, but annotations are not (images don't have annotations metadata).
OCI Images support annotations, but I don't think Dockerfile does at this time. So you can do a buildah commit --annotation x=y container.
Is it reasonable to make an annotations <-> labels mapping, or would that break things?
I shied away from that when initially writing Podman because of the differing uses (I didn't want a chance of someone setting a label to identify their container and then accidentally triggering weird behavior in the OCI runtime) but we could potentially look into mapping some labels that we know are valid, well-supported OCI runtime annotations.
Matt and Giuseppe, I say we go forward with this for Fedora. We could make up a hacky label LABEL ANNOTATION:X=y, which tells podman to set the annotation X=y when running the OCI runtime.
This message is a reminder that Fedora 31 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '31'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 31 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Discussion above seems to indicate this isn't obsolete, taking a guess that Rawhide is appropriate.
Ok the question is how important is this? RHEL8.4 will have crun and people might be encouraged to move to cgroups V2 (at least we have been talking to customers about it.) RHEL9 will be cgroups v2 by default. Will the importance of RHEL7 systemd based init containers still exists? Or will everyone have moved on to RHEL8 and RHEL9 images?
EL7 still has many years of support lifetime left, thus lots of people are still using EL7 containers images, including systemd. At least I am :)
This bug appears to have been reported against 'rawhide' during the Fedora 34 development cycle. Changing version to 34.
Giuseppe can you make the change to CRUN, and we just pass down the annotation. We can specify the annotation in a OCI Image and or the user would have to specify it. Just need crun to support it, and then to document it in man pages.
this is implemented in crun: https://github.com/containers/crun/pull/357
This message is a reminder that Fedora Linux 34 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '34'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 34 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Looks like it got fixed long ago.