2210335 – pivot_root service mounts in systemd-253 cause regression in glibc testsuite (tst-ttyname)

Bug 2210335 - pivot_root service mounts in systemd-253 cause regression in glibc testsuite (tst-ttyname)

Summary: pivot_root service mounts in systemd-253 cause regression in glibc testsuite ...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	38
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Siddhesh Poyarekar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-26 16:45 UTC by Carlos O'Donell
Modified:	2023-07-10 16:07 UTC (History)
CC List:	21 users (show)
Fixed In Version:	glibc-2.37.9000-15.fc39
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-05 14:46:30 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tst-ttyname (4.52 MB, application/octet-stream) 2023-06-01 11:57 UTC, Carlos O'Donell	no flags	Details
sudo systemd-nspawn -D ~/f37 strace -f -o /log -y /tst-ttyname (55.82 KB, text/plain) 2023-06-02 16:28 UTC, Zbigniew Jędrzejewski-Szmek	no flags	Details
View All

Description Carlos O'Donell 2023-05-26 16:45:36 UTC

The Koji builders were recently upgraded to F38 and this started showing a regression in the glibc testsuite. The test in question is misc/tst-ttyname.

You can see the failure with a straight forward mock build of glibc in F38.

It looks like this:
info:  entering chroot 1
info:    testcase: basic smoketest
info:      ttyname: PASS {name="/dev/pts/0", errno=0}
info:      ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0}
info:    testcase: no conflict, no match
info:      ttyname: PASS {name=NULL, errno=19}
info:      ttyname_r: PASS {name=NULL, ret=19, errno=19}
info:    testcase: no conflict, console
info:      ttyname: PASS {name="/dev/console", errno=0}
info:      ttyname_r: PASS {name="/dev/console", ret=0, errno=0}
info:    testcase: conflict, no match
info:      ttyname: PASS {name=NULL, errno=19}
info:      ttyname_r: PASS {name=NULL, ret=19, errno=19}
info:    testcase: conflict, console
info:      ttyname: PASS {name="/dev/console", errno=0}
info:      ttyname_r: PASS {name="/dev/console", ret=0, errno=0}
info:    testcase: with readlink target
info:      ttyname: PASS {name="/dev/pts/0", errno=0}
info:      ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0}
info:    testcase: with readlink trap; fallback
info:      ttyname: PASS {name="/dev/console", errno=0}
info:      ttyname_r: PASS {name="/dev/console", ret=0, errno=0}
info:    testcase: with readlink trap; no fallback
info:      ttyname: PASS {name=NULL, errno=19}
info:      ttyname_r: PASS {name=NULL, ret=19, errno=19}
info:    testcase: with search-path trap
info:      ttyname: PASS {name="/dev/console2", errno=0}
info:      ttyname_r: PASS {name="/dev/console2", ret=0, errno=0}
info:  entering chroot 2
info:    testcase: basic smoketest
info:      ttyname: PASS {name="/dev/pts/0", errno=0}
info:      ttyname_r: PASS {name="/dev/pts/0", ret=0, errno=0}
error: ../sysdeps/unix/sysv/linux/tst-ttyname.c:414: mount ("proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL) == 0: Operation not permitted

The test is slightly complicated in that the glibc testsuite is exercising some of the complex aspects of nested namespaces, particularly the mount namespace. We expect to be able to mount /proc and that has changed with the upgrade.

With the new systemd in F38 we are unable to run the entire tst-ttyname test. I don't think this is a kernel change since both my F37 and F38 testers are basically running almost the same kernel, but the systemd version is different which will show up as systemd-nspawn differences.

In summary:
- F37 systemd-251.14-2, kernel-6.2.15-200, tst-ttyname PASS.
- F38 systemd-253.4-1, kernel-6.2.15-300, tst-ttyname FAIL.

With the recent koji builder upgrade we've regressed the misc/tst-ttyname testing.


Reproducible: Always

Steps to Reproduce:
1. mock -r fedora-rawhide-x86_64 --init
2. fedpkg clone glibc; cd glibc; fedpkg srpm
3. mock -r fedora-rawhide-x86_64 --no-clean --rebuild <path to glibc srpm>
Actual Results:  
Everything builds but during rpm %check we get a failure in tst-ttyname test.

Expected Results:  
The tst-ttyname test passes.

Comment 1 Zbigniew Jędrzejewski-Szmek 2023-05-26 17:27:56 UTC

I'd help a lot if you could narrow this down to a much smaller reproducer.
The glibc build and test system is a huge beast, and it's hard for me to know what
exactly the test is doing and why its failing…

Looking at the changes in systemd between v251 and current git, I see 
some changes to the seccomp syscall filters, but generally just additions of
new syscalls as they were added, so this should not be relevant here.
There are some changes to how nspawn sets up mount propagation (propagation
of host mounts to the container was disabled). This doesn't seem directly relevant
either.

Without knowing what the test does and what exactly fails, it's hard to know what
to look for.

Comment 2 Zbigniew Jędrzejewski-Szmek 2023-05-26 17:48:35 UTC

The build works for me here (amd64, fedora 38, selinux=disabled).

Comment 3 Carlos O'Donell 2023-05-29 11:04:35 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #1)
> Without knowing what the test does and what exactly fails, it's hard to know
> what to look for.

Thank you for having a quick look!

I'll keep trying to reduce the test case to something we can talk about more concretely.

I filed the bug just to start the conversation and have something to reference when we waive the Fedora CI failures.

Comment 4 Carlos O'Donell 2023-06-01 11:57:15 UTC

Created attachment 1968304 [details]
tst-ttyname

Please find attached a *static* version (static pie) of the test 'tst-ttyname', which should allow much easier bisect for the point at which systemd-nspawn changed to disallow mounting /proc in the test case.

The test passes cleanly with 251 and fails with 253.

Comment 5 Carlos O'Donell 2023-06-01 15:13:13 UTC

If you trace this inside the container we get:

Initial test phase mounts proc and umounts it just fine, but there we're using a bind mount:
1685620773.584779 mount("/proc", "proc", NULL, MS_BIND|MS_REC, NULL) = 0
1685620773.596947 umount2("/proc", MNT_DETACH) = 0

Second phase mounts proc directly and fails:
1685620773.601470 mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)

The second mount works in the earlier setup with 251:
1685625735.238957 mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
1685625735.250642 umount2("/proc", MNT_DETACH) = 0

It seems an artificial restriction to limit mounting /proc inside the systemd-nspawn container?

Comment 6 Zbigniew Jędrzejewski-Szmek 2023-06-02 06:40:03 UTC

Thanks. The good news is that the test also fails here, exactly as originally reported. I tried using systemd-nspawn-251 on a F38 host, and the issue still reproduces. So it's not something in systemd-nspawn itself. With systemd-252 installed, the test passes. I'll try to a bisect.

Comment 7 Carlos O'Donell 2023-06-02 11:17:40 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #6)
> Thanks. The good news is that the test also fails here, exactly as
> originally reported. I tried using systemd-nspawn-251 on a F38 host, and the
> issue still reproduces. So it's not something in systemd-nspawn itself. With
> systemd-252 installed, the test passes. I'll try to a bisect.

It's interesting that 251 on F38 reproduces the issue, that's not a configuration I could test easily. Thank you!

Could this point to a libseccomp or related Seccomp issue?

Comment 8 Zbigniew Jędrzejewski-Szmek 2023-06-02 15:43:26 UTC

Bisect says the culprit is:
https://github.com/systemd/systemd/commit/57c10a5650f6bb7180f3bec31a3f24239a81be39

Comment 9 Zbigniew Jędrzejewski-Szmek 2023-06-02 16:28:53 UTC

Created attachment 1968625 [details]
sudo systemd-nspawn -D ~/f37 strace -f -o /log -y /tst-ttyname

Comment 10 Christian Brauner 2023-06-03 17:40:45 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #8)
> Bisect says the culprit is:
> https://github.com/systemd/systemd/commit/
> 57c10a5650f6bb7180f3bec31a3f24239a81be39

Yes, this is the change that causes this.
I can provide some background and a recommendation for a fix.
I summarized the background for this change in
commit 2e776ed6c864 ("shared: use move_pivot_root() for services"):

    Currently, services use mount_move_root() in order to setup the root
    directory of services using a mount namespace. This relies on MS_MOVE
    and chroot(). However, this has serious drawbacks even for relatively
    simple mount propagation scenarios.

    What systemd currently does is roughly equivalent to the following shell
    code:

      unshare --mount --propagation=shared
      cd /
      mount --make-rslave /
      mkdir /new-root
      mount --rbind / /new-root
      cd /new-root
      mount --move /new-root /
      chroot .

    This looks simple enough but has the consequence that two separate mount
    trees exist for the lifetime of the service. The first one was created
    when the mount namespace was created, and the second one when a new
    mount for the rootfs was created. The first mount tree sticks around as
    a shadow mount tree. Both mount trees are dependent mounts with the host
    rootfs as their dominating mount.

    Now, when mount propagation is triggered by the host by e.g.,

       mount --bind /opt /mnt

    it means that two propagation events are generated. I'm skipping over
    the exact kernel details as they aren't that important. The gist is that
    for every propagation event that is generated a second one is generated
    for the shadow mount tree. In other words, the kernel creates two copies
    for each mount that is propagated instead of one.

    This isn't necessary. We can simply change the sequence above to:

      unshare --mount --propagation=shared
      cd /
      mount --make-rslave /
      mkdir /new-root
      # stash fd to old rootfs
      # stash fd to new rootfs
      mount --rbind / /new-root
      mkdir /new-root
      cd /new-root
      pivot_root . .
      # new root is tucked under old root
      # chdir into old rootfs via stashed fd
      umount -l /old-root

    The pivot_root allows us to get rid of the old mount tree that was
    created when the mount namespace was created. So after this sequence
    only one mount tree is alive. Plus, it's safer and nicer.

However, this causes the following semantical requirement to be hit which I explained in
commit b71a0192c040 ("nspawn: mount temporary visible procfs and sysfs instance")

    In order to mount procfs and sysfs in an unprivileged container the
    kernel requires that a fully visible instance is already present in the
    target mount namespace. [...]

    So far nspawn didn't run into this issue because it used MS_MOVE which
    meant that the shadow mount tree pinned a procfs and sysfs instance
    which the kernel would find. The shadow mount tree is gone with proper
    pivot_root() semantics.

IOW, if you start a systemd-nspawn container:

    systemd-nspawn -U -D /var/lib/machines/my-container

you will notice via findmnt that various procfs files and directories
are overmounted to prevent the user from getting access to various files
or directories such as /proc/kmsg. In other words, procfs isn't fully visible.

So let's say you do:

    unshare --mount --pid --fork --user --map-root
    mount -t proc proc /mnt

You will fail to mount procfs because the kernel sees that various files
and directories have mounts in procfs on top of them and that the mounts
have become locked when you created the new mount namespace. Here, a mount
being locked means that you cannot unmount it even if you technically own
the mount since you've been provided a copy at the time the mount namespace
was provided. That locking happens whenever you create an unprivileged mount
namespace - read when you create a mount namespace in a user namespace.

To fix the test you should simply:

    mkdir /run/host/proc
    mount -t proc proc /run/host/proc

afterwards you will be able to create unprivileged mount namespaces and they
will be able to mount procfs since they will have a received a fully visible
copy of procfs at the time they created an unprivileged mount namespace.

Comment 11 Zbigniew Jędrzejewski-Szmek 2023-06-05 13:28:51 UTC

Yep. In the nspawn container, when I do:
  mount -m -t proc proc /run/proc
then tst-ttyname passes.

Based on Christian's explanation above, we want to keep systemd-nspawn 
as it is. The problem with the test can be fixed with a simple workaround
in the glibc spec.

Comment 12 Siddhesh Poyarekar 2023-06-22 02:42:00 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #11)
> Yep. In the nspawn container, when I do:
>   mount -m -t proc proc /run/proc
> then tst-ttyname passes.
> 
> Based on Christian's explanation above, we want to keep systemd-nspawn 
> as it is. The problem with the test can be fixed with a simple workaround
> in the glibc spec.

This can't be done through the glibc spec because the executing user (mockbuilder) won't have the privileges to move mounts.  Let me see if there's another way to do this that doesn't require root.

Comment 13 Siddhesh Poyarekar 2023-07-05 14:46:30 UTC

This was "fixed" in the last rawhide sync by bailing out.  There needs to be a better long term solution for this though.

Comment 14 Zbigniew Jędrzejewski-Szmek 2023-07-10 13:10:57 UTC

I don't expect systemd code to change: in general, the change that was done makes things cleaner and more efficient. The previous state was just an accident of implementation. Ideally, the glibc testsuite would be adjusted to recognize this situation and skip the test.

Comment 15 Florian Weimer 2023-07-10 16:07:44 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #14)
> I don't expect systemd code to change: in general, the change that was done
> makes things cleaner and more efficient. The previous state was just an
> accident of implementation. Ideally, the glibc testsuite would be adjusted
> to recognize this situation and skip the test.

It removes useful functionality: unprivileged file system sandboxing. Based on what has been discussed so far here, there is just no way to bring this back when running under systemd-nspawn *and* have a valid /proc. This is a bit sad because I generally tell people they should use chroot instead of trying to implement constrained pathname lookup in userspace.

Note You need to log in before you can comment on or make changes to this bug.

arjun.is
christianvanbrauner
codonell
dj
fedoraproject
filbranden
fweimer
jlaw
lnykryn
mcermak
mcoufal
mfabian
msekleta
pfrankli
ryncsn
sipoyare
skolosov
systemd-maint
yuwatana
zbyszek
zjedrzej