Bug 1472439
Summary: | docker + overlay2 + systemd: /etc/machine-id is not on a temporary file system | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ed Santiago <santiago> |
Component: | systemd | Assignee: | Jan Synacek <jsynacek> |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | amurdaca, bbreard, dwalsh, fedoraproject, fkluknav, fsumsal, jsynacek, lsm5, pasik, systemd-maint-list, systemd-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | systemd-219-46.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-10 11:21:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1466365 |
Description
Ed Santiago
2017-07-18 17:54:25 UTC
Systemd guys why is systemd complaining? I would guess that your /run is not tmpfs. If /run was not a tmpfs this would have blown up a lot earlier, because systemd would have attempted to mount /run. But Ed if you can still get this error, could you exec into the container and check if /run is on a tmpfs? Hopping in for Ed. This is the issue that Ed was able to reproduce on his test machine and I was not. I noticed Ed's note on overlay2 being configured on his system. It was not on mine and this issue doesn't exist there. When I configured overlay2fs on my machine, then I too have run into this error. FWIW, it looks like run is on a tmpfs. Here's a df from the container with the issue. [root@909302277eef /]# df Filesystem 1K-blocks Used Available Use% Mounted on overlay 23049220 3607868 19441352 16% / tmpfs 941748 0 941748 0% /dev tmpfs 941748 0 941748 0% /sys/fs/cgroup /dev/mapper/rhel_rhelbz-root 23049220 3607868 19441352 16% /etc/hosts shm 65536 0 65536 0% /dev/shm devtmpfs 930880 0 930880 0% /dev/tty tmpfs 65536 12 65524 1% /run tmpfs 65536 0 65536 0% /run/lock tmpfs 65536 0 65536 0% /var/log/journal tmpfs 941748 0 941748 0% /tmp What Tom said. Plus the 'mount' output in case it helps: # docker exec 8cf mount |grep run /dev/vda1 on /run/secrets type xfs (rw,relatime,seclabel,attr2,inode64,noquota) tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c139,c956",size=65536k,mode=755) In case it adds value, here's what I see in a container running devicemapper (default config). This type of container does not exhibit this error. [root@215011feb6d9 /]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/docker-253:0-1671585-63de00b1ab775a125c3f89a0196efc8446045b38f1a687419ee28ed939d917e8 10467328 252856 10214472 3% / tmpfs 941748 0 941748 0% /dev tmpfs 941748 0 941748 0% /sys/fs/cgroup /dev/mapper/rhel_rhelbz-root 23049220 3397616 19651604 15% /etc/hosts shm 65536 0 65536 0% /dev/shm devtmpfs 930880 0 930880 0% /dev/tty tmpfs 65536 12 65524 1% /run tmpfs 65536 0 65536 0% /run/lock tmpfs 65536 0 65536 0% /var/log/journal tmpfs 941748 0 941748 0% /tmp # docker exec 215011feb6d9 mount | grep run /dev/mapper/rhel_rhelbz-root on /run/secrets type xfs (rw,relatime,seclabel,attr2,inode64,noquota) tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c685,c876",size=65536k,mode=755) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c685,c876",size=65536k,mode=755) Reading systemd code. int machine_id_commit(const char *root) { _cleanup_close_ int fd = -1, initial_mntns_fd = -1; const char *etc_machine_id; sd_id128_t id; int r; /* Replaces a tmpfs bind mount of /etc/machine-id by a proper file, atomically. For this, the umount is removed * in a mount namespace, a new file is created at the right place. Afterwards the mount is also removed in the * original mount namespace, thus revealing the file that was just created. */ etc_machine_id = prefix_roota(root, "/etc/machine-id"); r = path_is_mount_point(etc_machine_id, NULL, 0); if (r < 0) return log_error_errno(r, "Failed to determine whether %s is a mount point: %m", etc_machine_id); if (r == 0) { log_debug("%s is not a mount point. Nothing to do.", etc_machine_id); DAN> I believe we should have hit this, since /etc/machine-id is not a mount point. return 0; } /* Read existing machine-id */ fd = open(etc_machine_id, O_RDONLY|O_CLOEXEC|O_NOCTTY); if (fd < 0) return log_error_errno(errno, "Cannot open %s: %m", etc_machine_id); r = fd_is_temporary_fs(fd); if (r < 0) return log_error_errno(r, "Failed to determine whether %s is on a temporary file system: %m", etc_machine_id); if (r == 0) { log_error("%s is not on a temporary file system.", etc_machine_id); DAN> This is where we hit the error. return -EROFS; } I think int fd_is_mount_point(int fd, const char *filename, int flags) is broken when called on top of an overlayfile system, inside of a container. r = name_to_handle_at(fd, filename, &h.handle, &mount_id, flags); is going to fail, since this syscall is blocked by seccomp. Just a small note, that code is completely different in rhel, we use fstatfs there: https://github.com/lnykryn/systemd-rhel/blob/staging/src/shared/util.c#L3282 So the problem is here: https://github.com/lnykryn/systemd-rhel/blob/staging/src/shared/path-util.c#L532 In my case we skipped the name_to_handle_at calls, because we got EOPNOTSUPP and fallback to the stupid stat calls Even that /etc/machine-id is not a mountpoint, systemd-machine-id-commit thinks that it is, because the device for /etc and /etc/machine-id is different: [root@f789684b175e /]# stat --printf "%D\n" /etc/machine-id fd01 [root@f789684b175e /]# stat --printf "%D\n" /etc/ 2a I guess this is caused by the overlayfs2. Right now I have no idea how to fix this, but workaround is simple. Just mask systemd-machine-id-commit.service because it is not needed in container (systemd can write directly to /etc/machine-id and does not have to do the magic with /run/machine-id and bind mount). We do need to get this fixed. Does it make more sense to disable this unit in the rhel7 & rhel7-init images or make the change in systemd? Masking it is certainly easier. Fero? Currently I am bit afraid where this issue could bite us next. This is not the only place where we do these checks. The upstream code has one additional way how to detect bind mounts, so we are currently investigating if it would help if we backport that. systemd-machine-id-commit.service masked. Can revert it any time if it will be fixed in systemd. From the systemd point of view, if stat() doesn't work with overlayfs and name_to_handle_at() is blocked by seccomp, our remaining chance is to try /proc/self/fdinfo/, is that right? We would have to "backport" https://github.com/systemd/systemd/commit/3f72b427b44f39a1aec6806dad6f6b57103ae9ed and some previous commits related to fd_is_mount_point() and its use. Dan, will /proc/self/fdinfo work inside containers? It would be also nice to find out why stat() doesn't work. (In reply to Jan Synacek from comment #17) > It would be also nice to find out why stat() doesn't work. That is easy, /etc and /etc/machine-id are in different "layers" of overlayfs and stat call returns different device for both of those. The /etc would be in the top level directory while the machine-id is on the lower, I believe. I couldn't reproduce this on my machine with docker-1.12.6-32.git88a4867.el7.x86_64 (currently in prod?). Anyway, patched systemd incoming. I've prepared a patch which now additionally uses the fdinfo. Jan's build seems to be working for me. But I have a question. I was trying to test it, with following Dockerfile. FROM rhel7-init ADD systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm ADD systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm RUN rpm -U /systemd-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm /systemd-libs-219-42.20170728.el7_4.bz1472439_dkr_overlayfs.0.x86_64.rpm CMD ["/sbin/init"] And in this case systemd became immediately stuck after I have started the container. I had to run it like in the old days: docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup -ti 749b2db1437c Dan, maybe you can help me. What did I do wrong? If you have oci-systemd-hook package installed, your Dockerfile should work and you should not have to do docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup -ti 749b2db1437c (In reply to Daniel Walsh from comment #25) > If you have oci-systemd-hook package installed, your Dockerfile should work > and you should not have to do > docker run -v /run -v /sys/fs/cgroup:/sys/fs/cgroup -ti 749b2db1437c I know that :-). With rhel7 image everything works just fine. But it does work for my image and I have no idea how to debug it. sorry for typo... "But it does *not* work for my image" Anything in the journal about oci-systemd-hook failing? (In reply to Lukáš Nykrýn from comment #27) > sorry for typo... "But it does *not* work for my image" Results confirmed: 'docker run' on new image hangs unless I add '-v /sys/fs/cgroup:/sys/fs/cgroup' (In reply to Daniel Walsh from comment #28) > Anything in the journal about oci-systemd-hook failing? Couple of complains about selinux like: čec 31 10:10:30 qeos-50.lab.eng.rdu2.redhat.com oci-systemd-hook[11334]: systemdhook <error>: Failed to set context system_u:object_r:svirt_sandbox_file_t:s0:c594,c632 on /sys/fs/cgroup/systemd/system.slice/docker čec 31 10:10:30 qeos-50.lab.eng.rdu2.redhat.com oci-systemd-hook[11334]: systemdhook <error>: Failed to set context system_u:object_r:svirt_sandbox_file_t:s0:c594,c632 on /var/lib/docker/overlay2/690d435b00ee50817 But switching selinux to permissive does not help and there seems to be nothing related in /var/log/audit/audit.log Those SELinux errors are ignored since they return ENOSUP. Is this on Fedora on RHEL? It is not supported on RHEL, but should be on latest Fedora kernels. This is latest build of RHEL-7.5 + 7.4 extras. Just for curiosity what is exactly not supported? The kernel in RHEL does not support labeling cgroups. We need to get an updated kernel and updated policy to allow for support. Jan does this fix the issue where systemd thinks /etc/machine-id is not on the file system? Yes. fix merged to upstream staging branch -> https://github.com/lnykryn/systemd-rhel/pull/127 -> post fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/148 -> post fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/160 -> post Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0711 |