1760645 – Centos 7 systemd container on Fedora 31 fails to start

Bug 1760645 - Centos 7 systemd container on Fedora 31 fails to start

Summary: Centos 7 systemd container on Fedora 31 fails to start

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	podman
Sub Component:
Version:	34
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Giuseppe Scrivano
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1999925
TreeView+	depends on / blocked

Reported:	2019-10-11 03:31 UTC by Greg
Modified:	2022-05-14 01:20 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Clone Of:
Clones:	1999925 (view as bug list)
Environment:
Last Closed:	2022-05-14 01:20:55 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Greg 2019-10-11 03:31:51 UTC

Description

Running a centos 7 container with systemd fails to run on Fedora 31

Steps to reproduce the issue:

    podman run -ti centos7-with-systemd /sbin/init

    output shows the following:

Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.

    This occurs when running as root and rootless

Describe the results you received:

Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.

Describe the results you expected:

Expect systemd to run

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

podman version
Version: 1.6.1
RemoteAPI Version: 1
Go Version: go1.13
OS/Arch: linux/amd64

Output of podman info --debug:

podman info --debug
debug:
compiler: gc
git commit: ""
go version: go1.13
podman version: 1.6.1
host:
BuildahVersion: 1.11.2
CgroupVersion: v2
Conmon:
package: conmon-2.0.1-1.fc31.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.1, commit: 5e0eadedda9508810235ab878174dca1183f4013'
Distribution:
distribution: fedora
version: "31"
MemFree: 18496446464
MemTotal: 67443789824
OCIRuntime:
package: crun-0.10.1-1.fc31.x86_64
path: /usr/bin/crun
version: |-
crun version 0.10.1
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
SwapFree: 0
SwapTotal: 0
arch: amd64
cpus: 16
eventlogger: journald
hostname: kubhost
kernel: 5.3.0-1.fc31.x86_64
os: linux
rootless: true
slirp4netns:
Executable: /usr/bin/slirp4netns
Package: slirp4netns-0.4.0-20.1.dev.gitbbd6f25.fc31.x86_64
Version: |-
slirp4netns version 0.4.0-beta.3+dev
commit: bbd6f25c70d5db2a1cd3bfb0416a8db99a75ed7e
uptime: 186h 43m 44.97s (Approximately 7.75 days)
registries:
blocked: null
insecure: null
search:

    docker.io
    registry.fedoraproject.org
    quay.io
    registry.access.redhat.com
    registry.centos.org
    store:
    ConfigFile: /home/greg/.config/containers/storage.conf
    ContainerStore:
    number: 13
    GraphDriverName: vfs
    GraphOptions: {}
    GraphRoot: /home/greg/.local/share/containers/storage
    GraphStatus: {}
    ImageStore:
    number: 21
    RunRoot: /run/user/1000
    VolumePath: /home/greg/.local/share/containers/storage/volumes

Additional environment details (AWS, VirtualBox, physical, etc.):
running ZFS

Comment 1 Daniel Walsh 2019-10-11 13:02:07 UTC

Systemd guys, is there anything that could be done to make systemd on Centos work on a cgroupv2 file system?

Comment 2 Pasi Karkkainen 2019-10-14 14:21:10 UTC

at least with docker when using centos7/systemd container one needs to mount the /sys/fs/cgroup volume:

docker run -d -v /sys/fs/cgroup:/sys/fs/cgroup:ro centos7-with-systemd /sbin/init

Is it the same with podman?

Comment 3 Matthew Heon 2019-10-14 14:25:43 UTC

No, Podman will automatically detect a container was run with systemd as the entrypoint and add the volume (among other changes necessary to make systemd run well in a container)

Comment 4 Martin Vala 2019-10-15 04:39:53 UTC

i have same problem on fc31. Is there anything, that can be done? It is quite blocker for me. Thanks you

Comment 5 space88man 2019-10-21 23:40:20 UTC

Data point: F31(upgraded from F30) cannot run a systemd-based F31 container (even though the latter is supposed to be cgroupv2-aware).

Actually I get a different situation from the reporter, and it is the same with CentOS 7 / CentOS 8.

I don't see any output from the container itself, instead conmon stalls at

INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-807c8d059ee7276a163930b6de94e90d456398003bc5c8311f001dfa3a1b0f07.scope 

After a couple of minutes:

DEBU[0241] ExitCode msg: "container creation timeout: internal libpod error"
Error: container creation timeout: internal libpod error


Unlike the reporter, systemd inside the container doesn't even get a chance to complain.

Comment 6 Adam Williamson 2019-10-21 23:50:50 UTC

It would be useful to be precise about exactly what versions of relevant packages are on both ends there (on the host system and in the container), as there's been quite a lot of change in this area late in F31. There is a known bug https://bugzilla.redhat.com/show_bug.cgi?id=1763868 .

Comment 7 space88man 2019-10-22 00:00:11 UTC

Brand new container

F31 Host:
podman-1.6.1-5.fc31.x86_64 / podman-1.6.2-2.fc31.x86_64
conmon-2.0.1-1.fc31.x86_64
crun-0.10.2-1.fc31.x86_64
systemd-243-3.gitef67743.fc31.x86_64

F31 Container: --entrypoint /bin/bash

systemd-243-2.gitfab6f01.fc31.x86_64

Comment 8 Matthew Heon 2019-10-22 13:14:43 UTC

What is the exact command you are using to run the F31 container with systemd in it?

Comment 9 space88man 2019-10-22 14:15:12 UTC

podman --log-level DEBUG run --rm -it --entrypoint /sbin/init fedora:31

Reaches:
DEBU[0000] /usr/bin/conmon messages will be logged to syslog 
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -s -c a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 -u a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata -p /var/run/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/pidfile -l k8s-file:/var/lib/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/ctr.log --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket --log-level debug --syslog -t --conmon-pidfile /var/run/containers/storage/overlay-containers/a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531]"
INFO[0000] Running conmon under slice machine.slice and unitName libpod-conmon-a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531.scope 

...stall for a few minutes...

DEBU[0240] Cleaning up container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 
DEBU[0240] Tearing down network namespace at /var/run/netns/cni-a5ee8c98-8d56-9658-cc99-086c88757c71 for container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 
INFO[0240] Got pod network &{Name:nostalgic_haslett Namespace:nostalgic_haslett ID:a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 NetNS:/var/run/netns/cni-a5ee8c98-8d56-9658-cc99-086c88757c71 Networks:[] RuntimeConfig:map[podman:{IP: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]} 
INFO[0240] About to del CNI network podman (type=bridge) 
DEBU[0240] unmounted container "a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531" 
DEBU[0240] unable to remove container a60d3436aca8e3c8633db2dfa60f186679eb6ed61a0a38ea4a2970ccaa10c531 after failing to start and attach to it 
DEBU[0240] ExitCode msg: "container creation timeout: internal libpod error" 
DEBU[0240] [graphdriver] trying provided driver "overlay" 
DEBU[0240] cached value indicated that overlay is supported 
DEBU[0240] cached value indicated that metacopy is being used 
DEBU[0240] backingFs=xfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=true 
Error: container creation timeout: internal libpod error

Comment 10 Daniel Walsh 2019-10-22 14:24:23 UTC

Are you running this rootless or rootfull?

If rootless does it work when running as root.

Comment 11 Daniel Walsh 2019-10-22 14:25:58 UTC

Works for me. 

$ rpm -q podman
podman-1.6.2-2.fc31.x86_64


$ podman run --rm -it --entrypoint /sbin/init fedora:31
Trying to pull docker.io/library/fedora:31...
Getting image source signatures
Copying blob 619d35b2bf84 done
Copying config 98c519110e done
Writing manifest to image destination
Storing signatures
systemd v243-2.gitfab6f01.fc31 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora 31 (Container Image)!

Set hostname to <08a30fbf62e3>.
Initializing machine ID from random generator.
[  OK  ] Started Dispatch Password…ts to Console Directory Watch.
[  OK  ] Started Forward Password …uests to Wall Directory Watch.
[  OK  ] Reached target Local File Systems.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Remote File Systems.
[  OK  ] Reached target Slices.
[  OK  ] Reached target Swap.
[  OK  ] Listening on Process Core Dump Socket.
[  OK  ] Listening on initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket (/dev/log).
[  OK  ] Listening on Journal Socket.
         Starting Rebuild Dynamic Linker Cache...
         Starting Journal Service...
         Starting Create System Users...
[  OK  ] Started Create System Users.
[  OK  ] Started Rebuild Dynamic Linker Cache.
[  OK  ] Started Journal Service.
         Starting Flush Journal to Persistent Storage...
[  OK  ] Started Flush Journal to Persistent Storage.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Create Volatile Files and Directories.
         Starting Rebuild Journal Catalog...
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
[  OK  ] Started Rebuild Journal Catalog.
         Starting Update is Completed...
[  OK  ] Started Update is Completed.
[  OK  ] Reached target System Initialization.
[  OK  ] Started Daily Cleanup of Temporary Directories.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Permit User Sessions...
[  OK  ] Started Permit User Sessions.
[  OK  ] Reached target Multi-User System.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.

Comment 12 Matthew Heon 2019-10-22 14:46:14 UTC

You say this system was upgraded from F30 - was this container preexisting, or created after the upgrade

Comment 13 space88man 2019-10-22 14:47:21 UTC

Completely new container, running as root, at the stall, strace'ing I see

120414 futex(0x5649d6ac0d30, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
120342 <... futex resumed>)             = -1 ETIMEDOUT (Connection timed out)
120342 futex(0x56437ad58d30, FUTEX_WAKE_PRIVATE, 1) = 1
120309 <... futex resumed>)             = 0
120342 futex(0xc0003152c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120309 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
120682 <... futex resumed>)             = 0
120342 <... futex resumed>)             = 1
120682 nanosleep({tv_sec=0, tv_nsec=3000},  <unfinished ...>
120342 futex(0x56437ad659e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120309 <... nanosleep resumed>NULL)     = 0
120308 <... futex resumed>)             = 0
120342 <... futex resumed>)             = 1
120309 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
120308 futex(0x56437ad659e8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120682 <... nanosleep resumed>NULL)     = 0
120342 futex(0x56437ad6a920, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=999010290} <unfinished ...>
120682 futex(0xc0003152c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120309 <... nanosleep resumed>NULL)     = 0
120309 futex(0x56437ad58d30, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
120422 <... futex resumed>)             = -1 ETIMEDOUT (Connection timed out)
120422 futex(0x5649d6ac0d30, FUTEX_WAKE_PRIVATE, 1) = 1
120414 <... futex resumed>)             = 0
120422 futex(0xc00006a848, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120414 sched_yield( <unfinished ...>
120422 <... futex resumed>)             = 1
120415 <... futex resumed>)             = 0
120414 <... sched_yield resumed>)       = 0
120422 futex(0xc00006b9c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120415 futex(0xc00009a4c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120414 futex(0x5649d6ac0c30, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120423 <... futex resumed>)             = 0
120422 <... futex resumed>)             = 1
120419 <... futex resumed>)             = 0
120415 <... futex resumed>)             = 1
120414 <... futex resumed>)             = 0
120423 futex(0xc00006b9c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120422 futex(0x5649d6ad29a0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=998981396} <unfinished ...>
120419 futex(0xc00009a4c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120415 futex(0xc00006a848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

Comment 14 space88man 2019-10-22 14:49:25 UTC

Also something interesting here...

120398 execve("/usr/bin/conmon", ["/usr/bin/conmon", "--api-version", "1", "-s", "-c", "cf865286f1ae24faa69fd5371ad757fa"..., "-u", "cf865286f1ae24faa69fd5371ad757fa"..., "-r", "/usr/bin/crun", "-b", "/var/lib/containers/storage/over"..., "-p", "/var/run/containers/storage/over"..., "-l", "k8s-file:/var/lib/containers/sto"..., "--exit-dir", "/var/run/libpod/exits", "--socket-dir-path", "/var/run/libpod/socket", "--log-level", "debug", "--syslog", "-t", "--conmon-pidfile", "/var/run/containers/storage/over"..., "--exit-command", "/usr/bin/podman", "--exit-command-arg", "--root", "--exit-command-arg", "/var/lib/containers/storage", ...], 0xc000106480 /* 7 vars */ <unfinished ...>
120318 <... clone resumed>)             = 120398
120318 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
120318 close(17)                        = 0
120318 read(16, "", 8)                  = 0
120318 close(16)                        = 0
120318 epoll_ctl(4, EPOLL_CTL_DEL, 15, 0xc00074cd2c) = 0
120318 close(15)                        = 0
120318 futex(0xc000314bc8, FUTEX_WAKE_PRIVATE, 1) = 1
120315 <... futex resumed>)             = 0
120318 gettid()                         = 120318
120318 openat(AT_FDCWD, "/proc/self/task/120318/attr/exec", O_WRONLY|O_CLOEXEC <unfinished ...>
120315 nanosleep({tv_sec=0, tv_nsec=3000},  <unfinished ...>
120318 <... openat resumed>)            = 15
120318 epoll_ctl(4, EPOLL_CTL_ADD, 15, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=884056568, u64=139686105424376}}) = -1 EPERM (Operation not permitted)
120315 <... nanosleep resumed>NULL)     = 0
120318 epoll_ctl(4, EPOLL_CTL_DEL, 15, 0xc00074cd44 <unfinished ...>
120315 futex(0xc0004a6f48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120318 <... epoll_ctl resumed>)         = -1 EPERM (Operation not permitted)
120396 <... futex resumed>)             = 0
120315 <... futex resumed>)             = 1
120396 futex(0xc0004a6f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120318 write(15, "", 0)                 = 0
120315 read(14,  <unfinished ...>
120318 close(15 <unfinished ...>
120315 <... read resumed>0xc0001cc000, 512) = -1 EAGAIN (Resource temporarily unavailable)
120318 <... close resumed>)             = 0
120315 futex(0xc000314bc8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120318 close(10)                        = 0
120318 close(13)                        = 0
120318 ioctl(2, TCGETS, {B38400 -opost -isig -icanon -echo ...}) = 0
120318 write(2, "\33[36mINFO\33[0m[0000] Running conm"..., 161) = 161
120318 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 10
120318 setsockopt(10, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
120318 connect(10, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34 <unfinished ...>
120342 <... futex resumed>)             = -1 ETIMEDOUT (Connection timed out)
120342 futex(0xc000314bc8, FUTEX_WAKE_PRIVATE, 1) = 1
120315 <... futex resumed>)             = 0
120342 madvise(0xc000600000, 2097152, MADV_NOHUGEPAGE <unfinished ...>
120318 <... connect resumed>)           = 0
120342 <... madvise resumed>)           = 0
120315 futex(0xc000314bc8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120342 madvise(0xc000698000, 8192, MADV_FREE <unfinished ...>
120318 epoll_ctl(4, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=884056568, u64=139686105424376}} <unfinished ...>
120342 <... madvise resumed>)           = 0
120308 <... epoll_pwait resumed>[{EPOLLOUT, {u32=884056568, u64=139686105424376}}], 128, -1, NULL, 3) = 1
120318 <... epoll_ctl resumed>)         = 0
120308 epoll_pwait(4,  <unfinished ...>
120342 futex(0x56437ad6a920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
120318 getsockname(10,  <unfinished ...>
120342 <... futex resumed>)             = 1
120313 <... futex resumed>)             = 0
120342 futex(0xc0004fa148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
120318 <... getsockname resumed>{sa_family=AF_UNIX}, [112->2]) = 0
120313 futex(0x56437ad6a920, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1742780} <unfinished ...>
120318 getpeername(10,  <unfinished ...>
120398 <... execve resumed>)            = 0
120318 <... getpeername resumed>{sa_family=AF_UNIX, sun_path="/run/dbus/system_bus_socket"}, [112->30]) = 0
120398 brk(NULL <unfinished ...>
120318 getuid( <unfinished ...>
120398 <... brk resumed>)               = 0x9bc000
120318 <... getuid resumed>)            = 0
120398 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffece7f960 <unfinished ...>
120318 getpid( <unfinished ...>
120398 <... arch_prctl resumed>)        = -1 EINVAL (Invalid argument)
120318 <... getpid resumed>)            = 120308
120318 getuid( <unfinished ...>

Comment 15 Daniel Walsh 2019-10-22 14:57:15 UTC

Are you sure you are fully upd2date on all packages.

rpm -q podman crun conmon fuse-overlayfs
podman-1.6.2-2.fc31.x86_64
crun-0.10.2-1.fc31.x86_64
conmon-2.0.1-1.fc31.x86_64
fuse-overlayfs-0.6.5-2.fc31.x86_64

Comment 16 space88man 2019-10-22 15:04:37 UTC

Yes - I am matching these versions of the RPMs.

I am getting this on 3 separate F30 -> F31 upgraded machines. On one of the machines I started with a new /var/lib/containers and seeing the same stall.

Comment 17 space88man 2019-10-22 15:23:33 UTC

So this works on a new F31 virtual machine (running a fedora:31 systemd container as root).

I get the same result as @Daniel Walsh - now to figure out why the upgraded system fails to launch this container.

Comment 18 space88man 2019-10-22 15:40:43 UTC

@Daniel Walsh I know the cause: I must remove oci-systemd-hook oci-register-machine oci-umount

During the upgrade from F30->F31 these packages were upgraded; the new F31 VM does not have these packages installed. 

These packages seem to interfere with the proper functioning of podman/conmon.

For reference these packages had to be removed:

oci-register-machine-0-11.git66fa845.fc31.x86_64
oci-systemd-hook-0.2.0-2.git05e6923.fc31.x86_64
oci-umount-2.5-3.gitc3cda1f.fc31.x86_64

Comment 19 Zbigniew Jędrzejewski-Szmek 2019-10-22 20:30:37 UTC

(In reply to Daniel Walsh from comment #1)
> Systemd guys, is there anything that could be done to make systemd on Centos
> work on a cgroupv2 file system?

cgroupv2 support was added in systemd-230. centos7 has systemd-219, so no support.

What we do in systemd-nspawn is check the guest to guess if it supports cgroupsv2.
If it has systemd >= 230, it does. See
https://github.com/systemd/systemd/blob/master/src/nspawn/nspawn.c#L445-L480.
If the guest has no support for v2, we try to mount v1. This is not very elegant,
but we couldn't come up with an approach that would allow us to use cgroupv2,
but not break old images.

Comment 20 Daniel Walsh 2019-10-23 11:40:35 UTC

@space88man

I thought oci-register-machine-0-11.git66fa845.fc31.x86_64 was removed years ago. We no longer support it.
oci-systemd-hook is not used by podman and would not be used by docker-ce or moby-engine.  So it really serves no purpose.

oci-umount is really only for use with devicemapper back end.  But not sure why that would cause issues with cgroup v2?
Are you sure you needed to remove it.

Zbigniew Jędrzejewski-Szmek maybe we should do the same for podman.

Comment 21 space88man 2019-10-23 12:30:57 UTC

@Daniel Walsh

oci-umount / oci-register-machine are red herrings - they don't affect fedora:31 centos:8 containers.

The blocker seems to be oci-systemd-hook: after reinstalling just that package I am seeing the stall in conmon as previously described. The problematic version is oci-systemd-hook-0.2.0-2.git05e6923.fc31.x86_64.rpm.

Comment 22 Matthew Heon 2019-10-23 13:40:18 UTC

There was another bug related to Conmon hanging if an OCI hook exited non-0 - that could be what's going on here. I think it was resolved, but I forget which project the patch went into?

Comment 23 Matthew Heon 2019-10-23 13:45:58 UTC

`conmon-2.0.1-1.fc31.x86_64`

This appears to be the problem. Conmon 2.0.2 is released upstream with a fix. It's pending in Bodhi at https://bodhi.fedoraproject.org/updates/FEDORA-2019-6353777bbd

Can you try the command:
`podman --runtime /bin/false run --rm  alpine true`

If that fails, you have the same issue as the one I'm describing, and 2.0.2 will fix it once it makes it to stable.

Comment 24 space88man 2019-10-23 23:24:41 UTC

@Matthew Heon: I can confirm that what I was seeing with oci-systemd-hook was in fact due to the conmon issue. Thanks.

Comment 25 WRH 2020-05-15 14:57:59 UTC

Experiencing the same thing on F32 (fresh install):

$ sudo podman run --log-level debug -t --name centos --rm centos:7.2.1511 /sbin/init
...
DEBU[0000] Started container f573efe8280c33bf35380b34bb45c5c98c892b3834d0356582e09c44006e436d
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.


The problem doesn't appear to be linked to the presence of stale oci-* packages:

$ rpm -qa | rg oci | wc -l
0


$ rpm -q systemd podman crun conmon fuse-overlayfs

systemd-245.4-1.fc32.x86_64
podman-1.9.1-1.fc32.x86_64
crun-0.13-2.fc32.x86_64
conmon-2.0.15-1.fc32.x86_64
fuse-overlayfs-1.0.0-1.fc32.x86_64

Comment 26 Matthew Heon 2020-05-15 15:19:44 UTC

We've investigated whether it is possible to support older versions of systemd requiring cgroups v1 (the versions shipping in CentOS/RHEL 7 being most notable here) on cgroups v2 hosts, and it doesn't seem to be reasonable. The complexity involved is quite significant, and adding support for this is not on our priorities list at present. It seems like the best option to resolve this is to either swap the base image to RHEL/Cent 8.x (which have a new enough systemd to support cgroups v2) or disable cgroups v2 on the host (which does have the disadvantage of removing ability to set resource limits for rootless containers, among a few other things).

Comment 27 Zbigniew Jędrzejewski-Szmek 2020-05-15 15:31:02 UTC

What about doing what is described in https://bugzilla.redhat.com/show_bug.cgi?id=1760645#c19 ?

Comment 28 Matthew Heon 2020-05-15 16:31:43 UTC

We investigated that, but discarded it as prohibitively difficult. Giuseppe, who appears to already be on CC, was the one who looked into this, and might have more details.

Comment 29 Daniel Walsh 2020-05-15 18:49:38 UTC

Seems like a big risk to have podman examining the first process inside of a container and attempting to figure out a specific version. systemd-nspawn,
might expect the container to be running systemd, but we have no such assumption, and figuring this out for different distributions would be very difficult.

Comment 30 Zbigniew Jędrzejewski-Szmek 2020-05-16 18:28:35 UTC

> Seems like a big risk to have podman examining the first process inside of a container and attempting to figure out a specific version

In general, I'd agree. But systemd is easier in this regard, because it never allowed support for
cgroupsv2 to be compiled out or otherwise disabled. Additionally, systemd is always installed
in the same locations. And the functionality in question is too big to be backported. Effectively,
this means that a very simple check for existence of libsystemd-shared-nnn. in one location
is enough to cover this case.

Anyway, I'm not saying that this is appropriate for podman... just that the check is relatively
simple in case of systemd in the container. Feel free to ignore my comment.

Comment 31 Giuseppe Scrivano 2020-05-18 06:48:24 UTC

(In reply to Matthew Heon from comment #28)
> We investigated that, but discarded it as prohibitively difficult. Giuseppe,
> who appears to already be on CC, was the one who looked into this, and might
> have more details.

to give more details, I think it is difficult to set it up using the OCI config file and it would require changes in the specs.

We can easily implement it in crun using an annotation.

It seems only the name=systemd hierarchy is needed to run an old version of systemd on a cgroup v2 system: https://github.com/containers/crun/pull/357

I am also not sure whether we should add any auto-detection logic into Podman, or it is enough to document the annotation:

podman run --annotation=run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup --rm -ti centos:7 /usr/lib/systemd/systemd

Another limitation is that it seems to work only for root, as rootless cannot mount a new hierarchy.

Comment 32 Jan Pazdziora (Red Hat) 2020-05-18 09:34:14 UTC

I agree that explicit annotations might be more stable approach than automagically guessing based on the ENTRYPOINT. In https://github.com/freeipa/freeipa-container and in the rhel7/ipa-server container image, the ENTRYPOINT is a bash script which populates the data volume and only then execs systemd.

Comment 33 Daniel Walsh 2020-05-18 19:48:33 UTC

I am fine with doing this via annotation, although label might be better, since DockerV2 spec does not support Annotations.

Then you could embed the Lable in the Container/Image and the user would not need to do anything special to get this to work.

If the image/container has a systemd V1 label, then crun does the right thing.

Comment 34 Jan Pazdziora (Red Hat) 2020-05-19 06:13:23 UTC

Sure, labels would work too.

Comment 35 Giuseppe Scrivano 2020-05-19 07:24:27 UTC

the label must be handled by podman, the OCI runtime doesn't have access to this information, we'll need both.

I think we still need an annotation to force this behaviour when the image doesn't specify the label, or using --rootfs.

Comment 36 Daniel Walsh 2020-05-19 18:02:53 UTC

I am not sure what the difference is?

I can still do

podman run --label run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup ...
or 
podman run --label run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup --rootfs ...

But if the centos7-init and rhel7-init containers have this label backed into the image
then podman run ... will just work.

Comment 37 Matthew Heon 2020-05-19 18:07:45 UTC

I think Giuseppe is saying that labels are not passed to the OCI runtime, only annotations. Images contain labels, not annotations.

This is a tricky/annoying distinction. Podman (and Docker before it) allows labels as arbitrary key-value metadata. The OCI runtime spec separately allows annotations as arbitrary key-value metadata. Unfortunately, these are entirely distinct - Podman uses labels largely for filtering in `ps`/`images` and annotations to trigger specific OCI runtime behaviour (like this). Labels are inherited from images, but annotations are not (images don't have annotations metadata).

Comment 38 Daniel Walsh 2020-05-19 19:45:43 UTC

OCI Images support annotations, but I don't think Dockerfile does at this time.

So you can do a buildah commit --annotation x=y container.

Comment 39 James Cassell 2020-05-19 19:50:43 UTC

Is it reasonable to make an annotations <-> labels mapping, or would that break things?

Comment 40 Matthew Heon 2020-05-19 20:01:10 UTC

I shied away from that when initially writing Podman because of the differing uses (I didn't want a chance of someone setting a label to identify their container and then accidentally triggering weird behavior in the OCI runtime) but we could potentially look into mapping some labels that we know are valid, well-supported OCI runtime annotations.

Comment 41 Daniel Walsh 2020-09-15 20:23:30 UTC

Matt and Giuseppe, I say we go forward with this for Fedora. 
We could make up a hacky label

LABEL ANNOTATION:X=y, which tells podman to set the annotation X=y when running the OCI runtime.

Comment 42 Ben Cotton 2020-11-03 15:38:42 UTC

This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 43 Adam Williamson 2020-11-03 22:53:23 UTC

Discussion above seems to indicate this isn't obsolete, taking a guess that Rawhide is appropriate.

Comment 44 Daniel Walsh 2021-01-29 10:20:15 UTC

Ok the question is how important is this?  RHEL8.4 will have crun and people might be encouraged to move to cgroups V2 (at least we have
been talking to customers about it.)  RHEL9 will be cgroups v2 by default.

Will the importance of RHEL7 systemd based init containers still exists?  Or will everyone have moved on to RHEL8 and RHEL9 images?

Comment 45 Pasi Karkkainen 2021-01-29 15:16:17 UTC

EL7 still has many years of support lifetime left, thus lots of people are still using EL7 containers images, including systemd.
At least I am :)

Comment 46 Ben Cotton 2021-02-09 15:12:49 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 34 development cycle.
Changing version to 34.

Comment 47 Daniel Walsh 2021-06-11 15:59:18 UTC

Giuseppe can you make the change to CRUN, and we just pass down the annotation. We can specify the annotation in a OCI Image and or the user would have to specify it.

Just need crun to support it, and then to document it in man pages.

Comment 48 Giuseppe Scrivano 2021-08-04 08:55:19 UTC

this is implemented in crun: https://github.com/containers/crun/pull/357

Comment 49 Ben Cotton 2022-05-12 16:57:27 UTC

This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 50 Adam Williamson 2022-05-14 01:20:55 UTC

Looks like it got fixed long ago.

Note You need to log in before you can comment on or make changes to this bug.

ajia
awilliam
bbaude
dmoessne
dwalsh
fedoraproject
fweimer
gscrivan
jnordell
jnovy
jpazdziora
junw99
lsm5
mheon
pasik
pehunt
santiago
shihping.chan
smccarty
sreber
systemd-maint
vala.martin
vashirov
wil.rh
zbyszek