Bug 1472439 - docker + overlay2 + systemd: /etc/machine-id is not on a temporary file system
docker + overlay2 + systemd: /etc/machine-id is not on a temporary file system
Status: ON_QA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd (Show other bugs)
7.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jan Synacek
qe-baseos-daemons
:
Depends On:
Blocks: 1466365
  Show dependency treegraph
 
Reported: 2017-07-18 13:54 EDT by Ed Santiago
Modified: 2017-11-20 10:27 EST (History)
9 users (show)

See Also:
Fixed In Version: systemd-219-46.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ed Santiago 2017-07-18 13:54:25 EDT
docker configured to use overlay2. Running systemd:

   # docker run --rm -ti rhel7 /usr/sbin/init
   Digest: sha256:582cb940a6e730dbdffee7cc5e1983522fdeeb3c40bea7373b255a209124cc02
    systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL 
+XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
    Detected virtualization other.
    Detected architecture x86-64.

    Welcome to Red Hat Enterprise Linux Server 7.3 (Maipo)!

    Set hostname to <481990dbf7f9>.
    Cannot add dependency job for unit sys-fs-fuse-connections.mount, ignoring: Unit is masked.
    Cannot add dependency job for unit getty.target, ignoring: Unit is masked.
    Cannot add dependency job for unit systemd-logind.service, ignoring: Unit is masked.
    [  OK  ] Reached target Swap.
    [  OK  ] Reached target Encrypted Volumes.
    [  OK  ] Reached target Remote File Systems.
    [  OK  ] Created slice Root Slice.
    [  OK  ] Created slice System Slice.
    [  OK  ] Listening on Journal Socket.
    [  OK  ] Listening on Delayed Shutdown Socket.
    [  OK  ] Reached target Slices.
             Starting Journal Service...
             Starting Load/Save Random Seed...
    [  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
    [  OK  ] Reached target Local File Systems (Pre).
    [  OK  ] Reached target Local File Systems.
             Starting Rebuild Journal Catalog...
             Starting Commit a transient machine-id on disk...
             Starting Rebuild Hardware Database...
    [  OK  ] Reached target Paths.
    systemd-machine-id-commit.service: main process exited, code=exited, status=1/FAILURE
    [FAILED] Failed to start Commit a transient machine-id on disk.
    See 'systemctl status systemd-machine-id-commit.service' for details.
    Unit systemd-machine-id-commit.service entered failed state.
    systemd-machine-id-commit.service failed.
    [  OK  ] Started Load/Save Random Seed.

In a separate shell:

    # docker exec 481 journalctl -b
    ....
    Jul 18 16:23:52 481990dbf7f9 systemd-machine-id-commit[18]: /etc/machine-id is not on a temporary file system.
    # docker exec 481 ls -lZ /etc/machine-id
    -r--r--r--. root root system_u:object_r:svirt_sandbox_file_t:s0:c606,c768 /etc/machine-id
    # docker exec 481 cat /etc/machine-id                                                             
    481990dbf7f98124281c8ef1c0f2a567


docker-1.12.6-45.git1680dd8.el7 on RHEL 7.4, kernel 3.10.0-693.el7
Comment 2 Daniel Walsh 2017-07-18 14:02:11 EDT
Systemd guys why is systemd complaining?
Comment 3 Lukáš Nykrýn 2017-07-24 04:50:58 EDT
I would guess that your /run is not tmpfs.
Comment 4 Daniel Walsh 2017-07-24 08:11:01 EDT
If /run was not a tmpfs this would have blown up a lot earlier, because systemd would have attempted to mount /run.  

But Ed if you can still get this error, could you exec into the container and check if /run is on a tmpfs?
Comment 5 Tom Sweeney 2017-07-24 09:04:25 EDT
Hopping in for Ed.  This is the issue that Ed was able to reproduce on his test machine and I was not.  I noticed Ed's note on overlay2 being configured on his system.  It was not on mine and this issue doesn't exist there.  When I configured overlay2fs on my machine, then I too have run into this error.

FWIW, it looks like run is on a tmpfs.  Here's a df from the container with the issue.

[root@909302277eef /]# df
Filesystem                   1K-blocks    Used Available Use% Mounted on
overlay                       23049220 3607868  19441352  16% /
tmpfs                           941748       0    941748   0% /dev
tmpfs                           941748       0    941748   0% /sys/fs/cgroup
/dev/mapper/rhel_rhelbz-root  23049220 3607868  19441352  16% /etc/hosts
shm                              65536       0     65536   0% /dev/shm
devtmpfs                        930880       0    930880   0% /dev/tty
tmpfs                            65536      12     65524   1% /run
tmpfs                            65536       0     65536   0% /run/lock
tmpfs                            65536       0     65536   0% /var/log/journal
tmpfs                           941748       0    941748   0% /tmp
Comment 6 Ed Santiago 2017-07-24 09:09:05 EDT
What Tom said. Plus the 'mount' output in case it helps:

    # docker exec 8cf mount |grep run
    /dev/vda1 on /run/secrets type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
    tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c139,c956",size=65536k,mode=755)
Comment 7 Tom Sweeney 2017-07-24 09:14:30 EDT
In case it adds value, here's what I see in a container running devicemapper (default config).  This type of container does not exhibit this error.

[root@215011feb6d9 /]# df
Filesystem                                                                                        1K-blocks    Used Available Use% Mounted on
/dev/mapper/docker-253:0-1671585-63de00b1ab775a125c3f89a0196efc8446045b38f1a687419ee28ed939d917e8  10467328  252856  10214472   3% /
tmpfs                                                                                                941748       0    941748   0% /dev
tmpfs                                                                                                941748       0    941748   0% /sys/fs/cgroup
/dev/mapper/rhel_rhelbz-root                                                                       23049220 3397616  19651604  15% /etc/hosts
shm                                                                                                   65536       0     65536   0% /dev/shm
devtmpfs                                                                                             930880       0    930880   0% /dev/tty
tmpfs                                                                                                 65536      12     65524   1% /run
tmpfs                                                                                                 65536       0     65536   0% /run/lock
tmpfs                                                                                                 65536       0     65536   0% /var/log/journal
tmpfs                                                                                                941748       0    941748   0% /tmp

# docker exec 215011feb6d9 mount | grep run
/dev/mapper/rhel_rhelbz-root on /run/secrets type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c685,c876",size=65536k,mode=755)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c685,c876",size=65536k,mode=755)
Comment 8 Daniel Walsh 2017-07-24 09:20:52 EDT
Reading systemd code.

int machine_id_commit(const char *root) {
        _cleanup_close_ int fd = -1, initial_mntns_fd = -1;
        const char *etc_machine_id;
        sd_id128_t id;
        int r;

        /* Replaces a tmpfs bind mount of /etc/machine-id by a proper file, atomically. For this, the umount is removed
         * in a mount namespace, a new file is created at the right place. Afterwards the mount is also removed in the
         * original mount namespace, thus revealing the file that was just created. */

        etc_machine_id = prefix_roota(root, "/etc/machine-id");

        r = path_is_mount_point(etc_machine_id, NULL, 0);
        if (r < 0)
                return log_error_errno(r, "Failed to determine whether %s is a mount point: %m", etc_machine_id);
        if (r == 0) {
                log_debug("%s is not a mount point. Nothing to do.", etc_machine_id);
DAN> I believe we should have hit this, since /etc/machine-id is not a mount point.  
                return 0;
        }

        /* Read existing machine-id */
        fd = open(etc_machine_id, O_RDONLY|O_CLOEXEC|O_NOCTTY);
        if (fd < 0)
                return log_error_errno(errno, "Cannot open %s: %m", etc_machine_id);

        r = fd_is_temporary_fs(fd);
        if (r < 0)
                return log_error_errno(r, "Failed to determine whether %s is on a temporary file system: %m", etc_machine_id);
        if (r == 0) {
                log_error("%s is not on a temporary file system.", etc_machine_id);
DAN> This is where we hit the error.
                return -EROFS;
        }
Comment 9 Daniel Walsh 2017-07-24 09:29:07 EDT
I think int fd_is_mount_point(int fd, const char *filename, int flags) is broken when called on top of an overlayfile system, inside of a container.


        r = name_to_handle_at(fd, filename, &h.handle, &mount_id, flags);

is going to fail, since this syscall is blocked by seccomp.
Comment 10 Lukáš Nykrýn 2017-07-24 10:25:00 EDT
Just a small note, that code is completely different in rhel, we use fstatfs there:
https://github.com/lnykryn/systemd-rhel/blob/staging/src/shared/util.c#L3282
Comment 11 Lukáš Nykrýn 2017-07-24 12:11:29 EDT
So the problem is here:
https://github.com/lnykryn/systemd-rhel/blob/staging/src/shared/path-util.c#L532

In my case we skipped the name_to_handle_at calls, because we got EOPNOTSUPP and fallback to the stupid stat calls

Even that /etc/machine-id is not a mountpoint, systemd-machine-id-commit thinks that it is, because the device for /etc and /etc/machine-id is different:

[root@f789684b175e /]# stat --printf "%D\n" /etc/machine-id 
fd01
[root@f789684b175e /]# stat --printf "%D\n" /etc/           
2a

I guess this is caused by the overlayfs2.
Right now I have no idea how to fix this, but workaround is simple. Just mask systemd-machine-id-commit.service because it is not needed in container (systemd can write directly to /etc/machine-id and does not have to do the magic with /run/machine-id and bind mount).
Comment 12 Ben Breard 2017-07-26 21:27:24 EDT
We do need to get this fixed. Does it make more sense to disable this unit in the rhel7 & rhel7-init images or make the change in systemd?

Masking it is certainly easier. Fero?
Comment 13 Lukáš Nykrýn 2017-07-27 04:29:38 EDT
Currently I am bit afraid where this issue could bite us next. This is not the only place where we do these checks. The upstream code has one additional way how to detect bind mounts, so we are currently investigating if it would help if we backport that.
Comment 15 Frantisek Kluknavsky 2017-07-27 05:25:16 EDT
systemd-machine-id-commit.service masked. Can revert it any time if it will be fixed in systemd.
Comment 16 Jan Synacek 2017-07-27 10:48:57 EDT
From the systemd point of view, if stat() doesn't work with overlayfs and name_to_handle_at() is blocked by seccomp, our remaining chance is to try /proc/self/fdinfo/, is that right?

We would have to "backport" https://github.com/systemd/systemd/commit/3f72b427b44f39a1aec6806dad6f6b57103ae9ed and some previous commits related to fd_is_mount_point() and its use.

Dan, will /proc/self/fdinfo work inside containers?
Comment 17 Jan Synacek 2017-07-27 10:52:41 EDT
It would be also nice to find out why stat() doesn't work.
Comment 18 Lukáš Nykrýn 2017-07-27 11:26:08 EDT
(In reply to Jan Synacek from comment #17)
> It would be also nice to find out why stat() doesn't work.

That is easy, /etc and /etc/machine-id are in different "layers" of overlayfs and stat call returns different device for both of those.
Comment 19 Daniel Walsh 2017-07-27 15:25:56 EDT
The /etc would be in the top level directory while the machine-id is on the lower, I believe.
Comment 20 Jan Synacek 2017-07-28 11:24:53 EDT
I couldn't reproduce this on my machine with docker-1.12.6-32.git88a4867.el7.x86_64 (currently in prod?). Anyway, patched systemd incoming. I've prepared a patch which now additionally uses the fdinfo.
Comment 24 Lukáš Nykrýn 2017-07-31 05:52:28 EDT
Jan's build seems to be working for me.

But I have a question.
I was trying to test it, with following Dockerfile.

FROM rhel7-init
ADD systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm
ADD systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm
RUN rpm -U /systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm
CMD ["/sbin/init"]

And in this case systemd became immediately stuck after I have started the container.

I had to run it like in the old days:
docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup  -ti 749b2db1437c

Dan, maybe you can help me. What did I do wrong?
Comment 25 Daniel Walsh 2017-07-31 09:08:16 EDT
If you have oci-systemd-hook package installed, your Dockerfile should work and you should not have to do
docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup  -ti 749b2db1437c
Comment 26 Lukáš Nykrýn 2017-07-31 09:14:40 EDT
(In reply to Daniel Walsh from comment #25)
> If you have oci-systemd-hook package installed, your Dockerfile should work
> and you should not have to do
> docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup  -ti 749b2db1437c

I know that :-). With rhel7 image everything works just fine. But it does work for my image and I have no idea how to debug it.
Comment 27 Lukáš Nykrýn 2017-07-31 09:15:21 EDT
sorry for typo...  "But it does *not* work for my image"
Comment 28 Daniel Walsh 2017-07-31 09:25:39 EDT
Anything in the journal about oci-systemd-hook failing?
Comment 32 Ed Santiago 2017-07-31 10:05:44 EDT
(In reply to Lukáš Nykrýn from comment #27)
> sorry for typo...  "But it does *not* work for my image"

Results confirmed: 'docker run' on new image hangs unless I add '-v /sys/fs/cgroup:/sys/fs/cgroup'
Comment 33 Lukáš Nykrýn 2017-07-31 10:14:04 EDT
(In reply to Daniel Walsh from comment #28)
> Anything in the journal about oci-systemd-hook failing?
Couple of complains about selinux like:
čec 31 10:10:30 qeos-50.lab.eng.rdu2.redhat.com oci-systemd-hook[11334]: systemdhook <error>: Failed to set context system_u:object_r:svirt_sandbox_file_t:s0:c594,c632 on /sys/fs/cgroup/systemd/system.slice/docker
čec 31 10:10:30 qeos-50.lab.eng.rdu2.redhat.com oci-systemd-hook[11334]: systemdhook <error>: Failed to set context system_u:object_r:svirt_sandbox_file_t:s0:c594,c632 on /var/lib/docker/overlay2/690d435b00ee50817

But switching selinux to permissive does not help and there seems to be nothing related in /var/log/audit/audit.log
Comment 34 Daniel Walsh 2017-07-31 10:16:05 EDT
Those SELinux errors are ignored since they return ENOSUP.  Is this on Fedora on RHEL?  It is not supported on RHEL, but should be on latest Fedora kernels.
Comment 35 Lukáš Nykrýn 2017-07-31 10:24:25 EDT
This is latest build of RHEL-7.5 + 7.4 extras. Just for curiosity what is exactly not supported?
Comment 36 Daniel Walsh 2017-07-31 15:49:09 EDT
The kernel in RHEL does not support labeling cgroups.  We need to get an updated kernel and updated policy to allow for support.
Comment 38 Jan Synacek 2017-08-09 07:44:54 EDT
https://github.com/lnykryn/systemd-rhel/pull/127
Comment 39 Daniel Walsh 2017-08-09 07:48:19 EDT
Jan does this fix the issue where systemd thinks /etc/machine-id is not on the file system?
Comment 40 Jan Synacek 2017-08-09 10:01:41 EDT
Yes.
Comment 41 Lukáš Nykrýn 2017-09-12 09:20:19 EDT
fix merged to upstream staging branch -> https://github.com/lnykryn/systemd-rhel/pull/127 -> post
Comment 45 Lukáš Nykrýn 2017-10-06 10:20:23 EDT
fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/148 -> post
Comment 46 Lukáš Nykrýn 2017-10-13 07:40:12 EDT
fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/160 -> post

Note You need to log in before you can comment on or make changes to this bug.