Bug 1406684
| Summary: | docker daemon with --userns-remap=default fails to run systemd in container | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Pazdziora (Red Hat) <jpazdziora> |
| Component: | oci-systemd-hook | Assignee: | Tom Sweeney <tsweeney> |
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.3 | CC: | amurdaca, dwalsh, jpazdziora, lsm5, qcai, sauchter, systemd-maint |
| Target Milestone: | rc | Keywords: | Extras |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-10-19 15:19:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Pazdziora (Red Hat)
2016-12-21 09:03:14 UTC
When I remove /usr/libexec/oci/hooks.d/oci-systemd-hook and run the container as docker run -e container=docker -v /sys/fs/cgroup:/sys/fs/cgroup:ro --tmpfs /run --tmpfs /tmp --rm -ti rhel7 /usr/sbin/init I get systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN) Detected virtualization docker. Detected architecture x86-64. Welcome to Red Hat Enterprise Linux Server 7.3 (Maipo)! Set hostname to <9416b9a4f846>. Initializing machine ID from random generator. Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory Failed to create root cgroup hierarchy: Permission denied Failed to allocate manager object: Permission denied [!!!!!!] Failed to allocate manager object, freezing. Not sure if this is the same case or not, and whether this bugzilla should be about oci-systemd-hook rather than docker, and perhaps different bugzilla should be failed for this explicit --tmpfs and -v execution failure. I highly doubt you're supposed to run systemd without the oci-systemd-hook, that's simply not supported. As for the component, this is an oci-systemd-hook issue, but can also probably be tied to runc. (In reply to Antonio Murdaca from comment #2) > I highly doubt you're supposed to run systemd without the oci-systemd-hook, > that's simply not supported. Well, even without the hook, without --userns-remap=default, running docker run -e container=docker -v /sys/fs/cgroup:/sys/fs/cgroup:ro --tmpfs /run --tmpfs /tmp --rm -ti rhel7 /usr/sbin/init works. That's why I tried it as well in the --userns-remap=default situation, to simply gather more data about the behaviour. (In reply to Jan Pazdziora from comment #3) > (In reply to Antonio Murdaca from comment #2) > > I highly doubt you're supposed to run systemd without the oci-systemd-hook, > > that's simply not supported. > > Well, even without the hook, without --userns-remap=default, running > > docker run -e container=docker -v /sys/fs/cgroup:/sys/fs/cgroup:ro --tmpfs > /run --tmpfs /tmp --rm -ti rhel7 /usr/sbin/init > > works. That's why I tried it as well in the --userns-remap=default > situation, to simply gather more data about the behaviour. Sure, it's interesting it's working without the hook though. Systemd works without the hook. the Hook is doing what the docker run CLI is doing automatically and in a better way. But you can run systemd without the hook. The problem with user namespace and systemd I believe is that /sys/fs/cgroup/systemd needs to be written to by systemd, but user namespace is preventing it. Which means this file system is not UserNamespace aware. We might have to work with the systemd guys on a fix for this after the break. On Fedora we see other issues with changes to the kernel, which we are working to fix. Great job Jan on testing these issues. Thanks for analysing and triaging the results, and driving it further. If there is anything else you can think of that I should try before the break, please let me know. I would guess you could easily prove my point by running a container with bash and then attempting to write to /sys/fs/cgroup/systemd, and see who owns the files in that directory while running with userns # docker run --rm -ti rhel7 ls -la /sys/fs/cgroup/systemd total 0 drwxr-xr-x. 2 65534 65534 0 Dec 21 14:30 . drwxr-xr-x. 13 root root 340 Dec 21 14:30 .. -rw-r--r--. 1 65534 65534 0 Dec 21 14:30 cgroup.clone_children --w--w--w-. 1 65534 65534 0 Dec 21 14:30 cgroup.event_control -rw-r--r--. 1 65534 65534 0 Dec 21 14:30 cgroup.procs -rw-r--r--. 1 65534 65534 0 Dec 21 14:30 notify_on_release -rw-r--r--. 1 65534 65534 0 Dec 21 14:30 tasks # docker run -v /sys/fs/cgroup:/sys/fs/cgroup --rm -ti rhel7 ls -la /sys/fs/cgroup/systemd total 0 drwxr-xr-x. 4 65534 65534 0 Dec 21 14:17 . drwxr-xr-x. 13 65534 65534 340 Dec 21 14:17 .. -rw-r--r--. 1 65534 65534 0 Dec 21 14:17 cgroup.clone_children --w--w--w-. 1 65534 65534 0 Dec 21 14:17 cgroup.event_control -rw-r--r--. 1 65534 65534 0 Dec 21 14:17 cgroup.procs -r--r--r--. 1 65534 65534 0 Dec 21 14:17 cgroup.sane_behavior -rw-r--r--. 1 65534 65534 0 Dec 21 14:17 notify_on_release -rw-r--r--. 1 65534 65534 0 Dec 21 14:17 release_agent drwxr-xr-x. 68 65534 65534 0 Dec 21 14:30 system.slice -rw-r--r--. 1 65534 65534 0 Dec 21 14:17 tasks drwxr-xr-x. 3 65534 65534 0 Dec 21 2016 user.slice Running bash in the container and attempting to write to that directory gives me # echo 1 > /sys/fs/cgroup/systemd/cgroup.clone_children bash: /sys/fs/cgroup/systemd/cgroup.clone_children: Permission denied # echo 1 > /sys/fs/cgroup/systemd/test bash: /sys/fs/cgroup/systemd/test: Permission denied So to fix this these UID's should follow user namespace in order for systemd to be able to run in a User Namespace container. It looks like the RHEL 7.4 nightly kernel made some progress with the uid namespacing support and configuration: bug 1406026 and bug 1340238. However, attempt to start /usr/sbin/init in the container still fails with # docker run --rm -ti -e container=docker rhel7 /usr/sbin/init /usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:334: running prestart hook 2 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\"\n". This will probably never work with user namespace # docker run -v /sys/fs/cgroup:/sys/fs/cgroup --rm -ti rhel7 ls -la /sys/fs/cgroup/systemd Since the /sys/fs/cgroup will be owned by REAL Root and root inside of the container will not be able to read/write content in these directories. What is the correct component to track this against? Kernel, to present /sys/fs/cgroup via uid-namespaced manner (if I understand comment 10 correctly), or systemd to avoid using cgroups in containers altogether? Or are we here hitting a blocker which will forever prevent us from running systemd in uid-namespaced containers? I think oci-systemd-hook should prepare the cgroup file system in a way that systemd can use it. It understands what systemd wants to write. It needs to change ownership of this directory to be the "dockerroot", not real root and then systemd would be able to work normally. (At least I think this will work. Fix submitted here: https://github.com/projectatomic/oci-systemd-hook/pull/59 and as noted by Lokesh Mandvekar (lsm5) "built here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13367868 I'm also including a "rhel74" in the release tag to more easily tell it's a 7.4 build." Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2963 |