The Koji builders were recently upgraded to F38 and this started showing a regression in the glibc testsuite. The test in question is misc/tst-ttyname. You can see the failure with a straight forward mock build of glibc in F38. It looks like this: info: entering chroot 1 info: testcase: basic smoketest info: ttyname: PASS {name="/dev/pts/0", errno=0} info: ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0} info: testcase: no conflict, no match info: ttyname: PASS {name=NULL, errno=19} info: ttyname_r: PASS {name=NULL, ret=19, errno=19} info: testcase: no conflict, console info: ttyname: PASS {name="/dev/console", errno=0} info: ttyname_r: PASS {name="/dev/console", ret=0, errno=0} info: testcase: conflict, no match info: ttyname: PASS {name=NULL, errno=19} info: ttyname_r: PASS {name=NULL, ret=19, errno=19} info: testcase: conflict, console info: ttyname: PASS {name="/dev/console", errno=0} info: ttyname_r: PASS {name="/dev/console", ret=0, errno=0} info: testcase: with readlink target info: ttyname: PASS {name="/dev/pts/0", errno=0} info: ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0} info: testcase: with readlink trap; fallback info: ttyname: PASS {name="/dev/console", errno=0} info: ttyname_r: PASS {name="/dev/console", ret=0, errno=0} info: testcase: with readlink trap; no fallback info: ttyname: PASS {name=NULL, errno=19} info: ttyname_r: PASS {name=NULL, ret=19, errno=19} info: testcase: with search-path trap info: ttyname: PASS {name="/dev/console2", errno=0} info: ttyname_r: PASS {name="/dev/console2", ret=0, errno=0} info: entering chroot 2 info: testcase: basic smoketest info: ttyname: PASS {name="/dev/pts/0", errno=0} info: ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0} error: ../sysdeps/unix/sysv/linux/tst-ttyname.c:414: mount ("proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL) == 0: Operation not permitted The test is slightly complicated in that the glibc testsuite is exercising some of the complex aspects of nested namespaces, particularly the mount namespace. We expect to be able to mount /proc and that has changed with the upgrade. With the new systemd in F38 we are unable to run the entire tst-ttyname test. I don't think this is a kernel change since both my F37 and F38 testers are basically running almost the same kernel, but the systemd version is different which will show up as systemd-nspawn differences. In summary: - F37 systemd-251.14-2, kernel-6.2.15-200, tst-ttyname PASS. - F38 systemd-253.4-1, kernel-6.2.15-300, tst-ttyname FAIL. With the recent koji builder upgrade we've regressed the misc/tst-ttyname testing. Reproducible: Always Steps to Reproduce: 1. mock -r fedora-rawhide-x86_64 --init 2. fedpkg clone glibc; cd glibc; fedpkg srpm 3. mock -r fedora-rawhide-x86_64 --no-clean --rebuild <path to glibc srpm> Actual Results: Everything builds but during rpm %check we get a failure in tst-ttyname test. Expected Results: The tst-ttyname test passes.
I'd help a lot if you could narrow this down to a much smaller reproducer. The glibc build and test system is a huge beast, and it's hard for me to know what exactly the test is doing and why its failing… Looking at the changes in systemd between v251 and current git, I see some changes to the seccomp syscall filters, but generally just additions of new syscalls as they were added, so this should not be relevant here. There are some changes to how nspawn sets up mount propagation (propagation of host mounts to the container was disabled). This doesn't seem directly relevant either. Without knowing what the test does and what exactly fails, it's hard to know what to look for.
The build works for me here (amd64, fedora 38, selinux=disabled).
(In reply to Zbigniew Jędrzejewski-Szmek from comment #1) > Without knowing what the test does and what exactly fails, it's hard to know > what to look for. Thank you for having a quick look! I'll keep trying to reduce the test case to something we can talk about more concretely. I filed the bug just to start the conversation and have something to reference when we waive the Fedora CI failures.
Created attachment 1968304 [details] tst-ttyname Please find attached a *static* version (static pie) of the test 'tst-ttyname', which should allow much easier bisect for the point at which systemd-nspawn changed to disallow mounting /proc in the test case. The test passes cleanly with 251 and fails with 253.
If you trace this inside the container we get: Initial test phase mounts proc and umounts it just fine, but there we're using a bind mount: 1685620773.584779 mount("/proc", "proc", NULL, MS_BIND|MS_REC, NULL) = 0 1685620773.596947 umount2("/proc", MNT_DETACH) = 0 Second phase mounts proc directly and fails: 1685620773.601470 mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted) The second mount works in the earlier setup with 251: 1685625735.238957 mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0 1685625735.250642 umount2("/proc", MNT_DETACH) = 0 It seems an artificial restriction to limit mounting /proc inside the systemd-nspawn container?
Thanks. The good news is that the test also fails here, exactly as originally reported. I tried using systemd-nspawn-251 on a F38 host, and the issue still reproduces. So it's not something in systemd-nspawn itself. With systemd-252 installed, the test passes. I'll try to a bisect.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #6) > Thanks. The good news is that the test also fails here, exactly as > originally reported. I tried using systemd-nspawn-251 on a F38 host, and the > issue still reproduces. So it's not something in systemd-nspawn itself. With > systemd-252 installed, the test passes. I'll try to a bisect. It's interesting that 251 on F38 reproduces the issue, that's not a configuration I could test easily. Thank you! Could this point to a libseccomp or related Seccomp issue?
Bisect says the culprit is: https://github.com/systemd/systemd/commit/57c10a5650f6bb7180f3bec31a3f24239a81be39
Created attachment 1968625 [details] sudo systemd-nspawn -D ~/f37 strace -f -o /log -y /tst-ttyname
(In reply to Zbigniew Jędrzejewski-Szmek from comment #8) > Bisect says the culprit is: > https://github.com/systemd/systemd/commit/ > 57c10a5650f6bb7180f3bec31a3f24239a81be39 Yes, this is the change that causes this. I can provide some background and a recommendation for a fix. I summarized the background for this change in commit 2e776ed6c864 ("shared: use move_pivot_root() for services"): Currently, services use mount_move_root() in order to setup the root directory of services using a mount namespace. This relies on MS_MOVE and chroot(). However, this has serious drawbacks even for relatively simple mount propagation scenarios. What systemd currently does is roughly equivalent to the following shell code: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root mount --rbind / /new-root cd /new-root mount --move /new-root / chroot . This looks simple enough but has the consequence that two separate mount trees exist for the lifetime of the service. The first one was created when the mount namespace was created, and the second one when a new mount for the rootfs was created. The first mount tree sticks around as a shadow mount tree. Both mount trees are dependent mounts with the host rootfs as their dominating mount. Now, when mount propagation is triggered by the host by e.g., mount --bind /opt /mnt it means that two propagation events are generated. I'm skipping over the exact kernel details as they aren't that important. The gist is that for every propagation event that is generated a second one is generated for the shadow mount tree. In other words, the kernel creates two copies for each mount that is propagated instead of one. This isn't necessary. We can simply change the sequence above to: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root # stash fd to old rootfs # stash fd to new rootfs mount --rbind / /new-root mkdir /new-root cd /new-root pivot_root . . # new root is tucked under old root # chdir into old rootfs via stashed fd umount -l /old-root The pivot_root allows us to get rid of the old mount tree that was created when the mount namespace was created. So after this sequence only one mount tree is alive. Plus, it's safer and nicer. However, this causes the following semantical requirement to be hit which I explained in commit b71a0192c040 ("nspawn: mount temporary visible procfs and sysfs instance") In order to mount procfs and sysfs in an unprivileged container the kernel requires that a fully visible instance is already present in the target mount namespace. [...] So far nspawn didn't run into this issue because it used MS_MOVE which meant that the shadow mount tree pinned a procfs and sysfs instance which the kernel would find. The shadow mount tree is gone with proper pivot_root() semantics. IOW, if you start a systemd-nspawn container: systemd-nspawn -U -D /var/lib/machines/my-container you will notice via findmnt that various procfs files and directories are overmounted to prevent the user from getting access to various files or directories such as /proc/kmsg. In other words, procfs isn't fully visible. So let's say you do: unshare --mount --pid --fork --user --map-root mount -t proc proc /mnt You will fail to mount procfs because the kernel sees that various files and directories have mounts in procfs on top of them and that the mounts have become locked when you created the new mount namespace. Here, a mount being locked means that you cannot unmount it even if you technically own the mount since you've been provided a copy at the time the mount namespace was provided. That locking happens whenever you create an unprivileged mount namespace - read when you create a mount namespace in a user namespace. To fix the test you should simply: mkdir /run/host/proc mount -t proc proc /run/host/proc afterwards you will be able to create unprivileged mount namespaces and they will be able to mount procfs since they will have a received a fully visible copy of procfs at the time they created an unprivileged mount namespace.
Yep. In the nspawn container, when I do: mount -m -t proc proc /run/proc then tst-ttyname passes. Based on Christian's explanation above, we want to keep systemd-nspawn as it is. The problem with the test can be fixed with a simple workaround in the glibc spec.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #11) > Yep. In the nspawn container, when I do: > mount -m -t proc proc /run/proc > then tst-ttyname passes. > > Based on Christian's explanation above, we want to keep systemd-nspawn > as it is. The problem with the test can be fixed with a simple workaround > in the glibc spec. This can't be done through the glibc spec because the executing user (mockbuilder) won't have the privileges to move mounts. Let me see if there's another way to do this that doesn't require root.
This was "fixed" in the last rawhide sync by bailing out. There needs to be a better long term solution for this though.
I don't expect systemd code to change: in general, the change that was done makes things cleaner and more efficient. The previous state was just an accident of implementation. Ideally, the glibc testsuite would be adjusted to recognize this situation and skip the test.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #14) > I don't expect systemd code to change: in general, the change that was done > makes things cleaner and more efficient. The previous state was just an > accident of implementation. Ideally, the glibc testsuite would be adjusted > to recognize this situation and skip the test. It removes useful functionality: unprivileged file system sandboxing. Based on what has been discussed so far here, there is just no way to bring this back when running under systemd-nspawn *and* have a valid /proc. This is a bit sad because I generally tell people they should use chroot instead of trying to implement constrained pathname lookup in userspace.