Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1563025

Summary: docker container requires root or sysadmin priv to run systemd
Product: Red Hat Enterprise Linux 7 Reporter: Mike Chang <mschang>
Component: systemdAssignee: systemd-maint
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.4CC: antoine.tran, bblaskov, msekleta, qe-baseos-daemons, systemd-maint-list, systemd-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1392526 Environment:
Last Closed: 2021-02-15 07:38:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1392526    
Bug Blocks:    

Description Mike Chang 2018-04-03 00:29:22 UTC
docker host os    : centos:7.4.1708
docker base image : centos:7.4.1708

# docker version
Client:
 Version:	17.12.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	c97c6d6
 Built:	Wed Dec 27 20:10:14 2017
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	c97c6d6
  Built:	Wed Dec 27 20:12:46 2017
  OS/Arch:	linux/amd64
  Experimental:	false

systemd version: 42.el7

Latest vdocker & systemd version still require either --privileged=true or --cap-add SYS_ADMIN to run systemd, or it errors on:

[!!!!!!] Failed to mount API filesystems, freezing.

How can we eliminate root or sysadmin priv and run systemd in docker container?

+++ This bug was initially created as a clone of Bug #1392526 +++

Description of problem:
The systemd component in a Docker container, such as centos:7, needs too much priviledge (either docker run --priviledge ... or docker run --cap-add SYS_ADMIN ...).

Version-Release number of selected component (if applicable):
19.el7_2.13

How reproducible:
Follow https://hub.docker.com/_/centos/ chapter "Systemd integration", as an attempt to run httpd with system. Here are the steps below.

Steps to Reproduce:
1. Creates the file Dockerfile with content:
FROM centos:7
MAINTAINER "you" <your>
ENV container docker
RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in *; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done); \
rm -f /lib/systemd/system/multi-user.target.wants/*;\
rm -f /etc/systemd/system/*.wants/*;\
rm -f /lib/systemd/system/local-fs.target.wants/*; \
rm -f /lib/systemd/system/sockets.target.wants/*udev*; \
rm -f /lib/systemd/system/sockets.target.wants/*initctl*; \
rm -f /lib/systemd/system/basic.target.wants/*;\
rm -f /lib/systemd/system/anaconda.target.wants/*;
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/usr/sbin/init"]
RUN yum -y install httpd; yum clean all; systemctl enable httpd.service
EXPOSE 80
2.docker build --rm -t local/c7-systemd-httpd .
3.docker run -ti -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 80:80 local/c7-systemd-httpd

Actual results:
[!!!!!!] Failed to mount API filesystems, freezing.
And then nothing happen, it freezes.

Expected results:
[  OK  ] Reached target Paths.
[  OK  ] Reached target Local File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Listening on Journal Socket.
         Starting Create Volatile Files and Directories...
...
[  OK  ] Reached target Multi-User System.


Additional info:
With these commands:
docker run -ti --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 80:80 local/c7-systemd-httpd
docker run -ti --privileged -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 80:80 local/c7-systemd-httpd
We have the expected result.
Using docker-engine 1.12.2 and CentOs 7.2.1511.

When using systemd just to start one service in a container, we should not enable too much priviledge just for systemd.

--- Additional comment from Antoine TRAN on 2016-11-07 11:53:42 EST ---

A bug report has been written previously in https://github.com/CentOS/sig-cloud-instance-images/issues/54 . Then someone sent me here.

--- Additional comment from Lukáš Nykrýn on 2016-11-08 03:14:28 EST ---

I would guess that systemd is trying to mount /run, because it wants to have it on tmpfs. Try to use  "-v /run"

--- Additional comment from Antoine TRAN on 2016-11-08 04:02:35 EST ---

docker run -ti -v /run -v /sys/fs/cgroup:/sys/fs/cgroup:ro -p 80:80 local/c7-systemd-httpd
gives
[!!!!!!] Failed to mount API filesystems, freezing.
Deleting the container gives
Error response from daemon: devmapper: Unknown device 3ba40886392a40ad357a0f930f47814d95db573e2eaa809ac2d3335038c10d65

This latter message does not appear without -v /run.

With -v /run:/run, or --tmpfs /run --tmpfs /var --tmpfs /var/run, I have the same error.

--- Additional comment from Jan Synacek on 2016-11-09 06:11:08 EST ---

Why is this a bug in systemd?

--- Additional comment from Antoine TRAN on 2016-11-09 08:15:02 EST ---

Although systemd works in normal Linux distribution, it does not work in restricted environment (like docker).

--- Additional comment from Jan Synacek on 2017-08-03 03:49:42 EDT ---

This is no longer reproducible with docker-1.12.6-48.git0fdc778.el7.x86_64 and systemd-219-42.el7.x86_64.

Fixed in RHEL-7.4.

Comment 2 Antoine TRAN 2018-04-03 11:26:06 UTC
I would like to point out that the issue is about docker-ce and systemd, and not docker package. Of course I understand that RedHat maintain its own packages and not external ones. But in this particular issue, I guess it deals with systemd requiring too much privileged, to determine, rather than a fix in docker/docker-ce package.

I can reproduce with systemd-219-42.el7_4.7.x86_64 / CentOS Linux release 7.4.1708 (Core) in host and in docker image / docker-ce-17.12.1.ce-1.el7.centos.x86_64.

Does anyone known what is the fix in docker package?

Comment 3 Mike Chang 2018-04-03 17:08:19 UTC
Is this a known defect and slated for a fix?

Comment 4 Antoine TRAN 2018-04-03 17:51:59 UTC
This is a known defect in docker world: anyone who want to use systemd adds at best SYS_ADMIN capabilities, and at worse all privilege. Any google search about systemd/docker shows this.

As to the second part, I don't know the answer.

Comment 5 Mike Chang 2018-04-03 20:50:57 UTC
Thanks Antoine for the update.

Any known workaround?  Granting sysadmin is exposing a lot of privileges for systemd at startup.  Such as a way to revoke priv from running container?

Comment 6 Antoine TRAN 2018-04-04 07:04:53 UTC
Not giving SYS_ADMIN will make systemd fail, as shown in previous post:
"[!!!!!!] Failed to mount API filesystems, freezing."

There is currently no known workaround except giving this privilege. To my knowledge, I believe systemd does something that causes it to mount, and we use SYS_ADMIN to give it at least mount privilege. If you know systemd very well, maybe you can help me determine what are the exact mounts systemd is doing? So that we might solve this by doing
docker run --tmpfs [MountPath]
or
docker run -v [MountPath]:[MountPath]:ro

I just couldn't do strace into /sbin/init since it needs PID 1.

Comment 7 Michal Sekletar 2018-04-04 08:47:55 UTC
(In reply to Antoine TRAN from comment #6)
> Not giving SYS_ADMIN will make systemd fail, as shown in previous post:
> "[!!!!!!] Failed to mount API filesystems, freezing."
> 
> There is currently no known workaround except giving this privilege. To my
> knowledge, I believe systemd does something that causes it to mount, and we
> use SYS_ADMIN to give it at least mount privilege. If you know systemd very
> well, maybe you can help me determine what are the exact mounts systemd is
> doing? So that we might solve this by doing
> docker run --tmpfs [MountPath]
> or
> docker run -v [MountPath]:[MountPath]:ro

This is the table of API filesystems that systemd mounts during boot,

https://github.com/systemd/systemd/blob/master/src/core/mount-setup.c#L77

Failing mount will be in that table.

> I just couldn't do strace into /sbin/init since it needs PID 1.

I think you should be able to start the container with bash as PID1, figure out PID of the bash process outside of the container, attach strace to that PID and then exec the systemd inside the shell process.

Comment 8 Antoine TRAN 2018-04-04 09:49:56 UTC
Ok, thank you, I was able to trace exactly what systemd does with strace.

Without SYS_ADMIN:
name_to_handle_at(AT_FDCWD, "/sys", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 EPERM (Operation not permitted)
name_to_handle_at(AT_FDCWD, "/dev", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 EPERM (Operation not permitted)
access("/sys/fs/smackfs/", F_OK)        = -1 ENOENT (No such file or directory)
name_to_handle_at(AT_FDCWD, "/dev/shm", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 EPERM (Operation not permitted)
name_to_handle_at(AT_FDCWD, "/run", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 EPERM (Operation not permitted)
name_to_handle_at(AT_FDCWD, "/sys/fs/cgroup/systemd", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 EPERM (Operation not permitted)
name_to_handle_at(AT_FDCWD, "/sys/fs/pstore", {handle_bytes=128}, 0x7ffeae9040c0, AT_SYMLINK_FOLLOW) = -1 ENOSYS (Function not implemented)
access("/sys/firmware/efi", F_OK)       = -1 ENOSYS (Function not implemented)
open("/dev/console", O_WRONLY|O_NOCTTY|O_CLOEXEC) = -1 ENOSYS (Function not implemented)
ioctl(4, TCGETS, 0x7ffeae904060)        = -1 ENOSYS (Function not implemented)
ioctl(4, TIOCGWINSZ, 0x7ffeae904100)    = -1 ENOSYS (Function not implemented)
writev(4, [{"[", 1}, {"\33[1;31m!!!!!!\33[0m", 17}, {"] ", 2}, {"Failed to mount API filesystems,"..., 42}, {"\n", 1}], 5) = -1 ENOSYS (Function not implemented)


With SYS_ADMIN:
name_to_handle_at(AT_FDCWD, "/sys", {handle_bytes=128}, 0x7ffc15b88c00, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
name_to_handle_at(AT_FDCWD, "/", {handle_bytes=128}, 0x7ffc15b88c04, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
stat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
stat("/", {st_mode=S_IFDIR|0755, st_size=18, ...}) = 0
name_to_handle_at(AT_FDCWD, "/proc", {handle_bytes=128}, 0x7ffc15b88c00, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
name_to_handle_at(AT_FDCWD, "/", {handle_bytes=128}, 0x7ffc15b88c04, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
stat("/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
stat("/", {st_mode=S_IFDIR|0755, st_size=18, ...}) = 0
name_to_handle_at(AT_FDCWD, "/dev", {handle_bytes=128 => 12, handle_type=1, f_handle=0x3e96c45a159b410700000000}, [1183], AT_SYMLINK_FOLLOW) = 0
name_to_handle_at(AT_FDCWD, "/", {handle_bytes=128}, 0x7ffc15b88c04, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
name_to_handle_at(AT_FDCWD, "/sys/kernel/security", {handle_bytes=128}, 0x7ffc15b88c00, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)
name_to_handle_at(AT_FDCWD, "/sys/kernel", {handle_bytes=128}, 0x7ffc15b88c04, AT_SYMLINK_FOLLOW) = -1 EOPNOTSUPP (Operation not supported)


name_to_handle_at is a syscall blocked by docker-ce > 1.12 and provided by SYS_ADMIN. I guess RedHat maintained docker or docker-latest packages does not implement seccomp or use a less rectricted seccomp profile (https://github.com/justincormack/docker/blob/master/profiles/seccomp/seccomp_default.go).

For docker point of view, this is normal by design, to add at least SYS_ADMIN (that contain this syscall), for systemd.


I looked at systemd code: particularly at https://github.com/systemd/systemd/blob/v238/src/core/mount-setup.c

------------------------
        r = path_is_mount_point(p->where, NULL, AT_SYMLINK_FOLLOW);
        if (r < 0 && r != -ENOENT) {
                log_full_errno(priority, r, "Failed to determine whether %s is a mount point: %m", p->where);
                return (p->mode & MNT_FATAL) ? r : 0;
        }
        if (r > 0)
                return 0;

        /* Skip securityfs in a container */
        if (!(p->mode & MNT_IN_CONTAINER) && detect_container() > 0)
return 0;

-------------------------
Do you think we can manage to exclude for container mode as this (move up the container check):

------------------------

        /* Skip securityfs in a container */
        if (!(p->mode & MNT_IN_CONTAINER) && detect_container() > 0)
return 0;

        r = path_is_mount_point(p->where, NULL, AT_SYMLINK_FOLLOW);
        if (r < 0 && r != -ENOENT) {
                log_full_errno(priority, r, "Failed to determine whether %s is a mount point: %m", p->where);
                return (p->mode & MNT_FATAL) ? r : 0;
        }
        if (r > 0)
                return 0;


-------------------------

Comment 9 Antoine TRAN 2018-04-05 08:02:17 UTC
@Mike Chang:
I believe I am asking the wrong entity for this kind of bug. I created an issue addressed for systemd developpers in https://github.com/systemd/systemd/issues/8657 , which is a Request For Enhancement.

Thank you for your time, you can close this issue.

Comment 10 Antoine TRAN 2018-04-05 12:25:18 UTC
@Mike Chang: according to systemd, this is fixed in their upstream version v238 (https://github.com/systemd/systemd/blob/v238/src/basic/mount-util.c). Currently, the latest available in CentOs is 219-42.el7_4.10.

Do you happen to know if this version already integrate the upstream version (with all the backports, now that the release version is "-42"), and if this is not the case, when we will have the v238 available in Redhat family?

Thank you.

Comment 13 RHEL Program Management 2021-02-15 07:38:15 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.