Description of problem: This might not be util-linux at all but I don't know an appropriate component (systemd?). Please reassign if necessary. Firstly, I want to acknowledge that the `/proc` mount is mounted as shared in Fedora. However, the behavior of Fedora wrt to the namespaces with CLONE_NEWUSER, CLONE_NEWNS and CLONE_NEWPID does not appears to be adherent to the specification. --mount-proc[=mountpoint] Just before running the program, mount the proc filesystem at mountpoint (default is /proc). This is useful when creating a new PID namespace. It also implies creating a new mount namespace since the /proc mount would otherwise mess up existing programs on the system. The new proc filesystem is explicitly mounted as private (with MS_PRIVATE|MS_REC). From everything I can gather CLONE_NEWUSER and CLONE_NEWNS without CLONE_NEWPID and a subsequent fork can never mount a namespace-specific `/proc` in Fedora 24. I fully reproduce it in all of my code as well. Observe: bash-4.3$ unshare -U --mount-proc unshare: mount /proc failed: Operation not permitted <CLONE_NEWUSER + `--mount-proc` implies CLONE_NEWNS. Unshare has all caps in a new namespace. Why didn't we mount /proc?> bash-4.3$ unshare -Up --mount-proc unshare: mount /proc failed: Operation not permitted <ok, CLONE_NEWPID but no fork - I buy that, nothing changed vs the first example for the unshare process itself> bash-4.3$ unshare -Upf --mount-proc <Success! But why?> It's important on Fedora that `--mount-proc` of `unshare` actually first performs the equivalent of `--make-rprivate` on `/proc`. It does so. if (procmnt && (mount("none", procmnt, NULL, MS_PRIVATE|MS_REC, NULL) != 0 || mount("proc", procmnt, "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL) != 0)) err(EXIT_FAILURE, _("mount %s failed"), procmnt); And the first mount operation succeeds with `unshare -Umr` without CLONE_NEWPID and a fork. But the second mount fails. bash-4.3$ unshare -Umr <-r is required since bash exec'ed by unshare doesn't inherit its caps> -bash-4.3# mount --make-rprivate /proc -bash-4.3# mount -t proc proc /proc mount: permission denied - ??? -bash-4.3# exit Version-Release number of selected component (if applicable): 2.24-2.fc24.x86_64 How reproducible: Always Steps to Reproduce: Above Actual results: CLONE_NEWUSER with CLONE_NEWNS cannot mount own /proc without CLONE_NEWPID and fork. Expected results: CLONE_NEWUSER with CLONE_NEWNS should be able to mount own /proc on their own. Additional info: From "man 7 mount_namespaces": NOTES: Since, when one uses unshare(1) to create a mount namespace, the goal is commonly to provide full isolation of the mount points in the new namespace, unshare(1) (since util-linux version 2.27) in turn reverses the step performed by systemd(1), by making all mount points private in the new namespace. That is, unshare(1) performs the equivalent of the following in the new mount namespace: mount --make-rprivate / From "man 7 user_namespaces": Holding CAP_SYS_ADMIN within the user namespace associated with a process's mount namespace allows that process to create bind mounts and mount the following types of filesystems: * /proc (since Linux 3.8) [snip-snip] This last one is a direct controlling specification of the prescribed behavior.
> unshare --user --mount-proc In this case you want create user namespace (root-like permission in the namespace) and you want to mount *shared system* /proc. I have doubts it makes sense from security point of view, because as root in user namespace you can access all originally hidden stuff from system /proc. Let's imagine: $ cat /proc/1/maps cat: /proc/1/maps: Permission denied if the "unshare --user --mount-proc" will be successful then I can bypass the "Permission denied" and access another processes. Right? IMHO require CLONE_NEWPID to unshare also the /proc makes sense. And I guess that CLONE_NEWPID requires fork() to create a "init" process in the new pid namespace -- without fork() the new /proc will not contain any process. $ unshare --user --pid --fork --mount-proc seems correct. Does it make sense? ;-)
Karel, I'm not asking for access elevation on a shared /proc. The issue is that --mount-proc implies CLONE_NEWNS per code. So the actual combination is CLONE_NEWUSER and CLONE_NEWNS. This is user NS + mount NS. If I have NEWNS I have to be able to mount my own /proc at any rate. If what you're saying is correct then the only way to mount /proc is to have CLONE_NEWPID, and documentation is adamant that /proc can be mounted with NEWUSER and NEWNS as there are specific behavior examples provided in 4.08 man-pages (http://man7.org/linux/man-pages/man7/user_namespaces.7.html). CLONE_NEWUSER inherently cannot exceed access permissions of the outer NS and topmost user, therefore `cat /proc/1/maps` is only possible if you already have access to it in the parent NS. Holding CAP_SYS_ADMIN in your new UNS will not magically grant you access to a resource to which the constraining NS doesn't have access. And if I'm mounting a /proc within my own NS and UNS my /proc should be limited, I presume, to what I have access in my new NS, i.e. behavior similar to hidepid=2 but constrained by namespace (I haven't found documentation that actually specifies what a private /proc should contain under a NEWUSER + NEWNS without NEWPID). And if you're right, then there is a huuuuuuuge ;) documentation problem. As I pointed out `man 7 user_namespaces` declares unambiguously that: Holding CAP_SYS_ADMIN within the user namespace associated with a process's mount namespace allows that process to create bind mounts and mount the following types of filesystems: * /proc (since Linux 3.8) And significant portions of `man 7 mount_namespaces` have to be rewritten (http://man7.org/linux/man-pages/man7/mount_namespaces.7.html). Incidentally `man 7 mount_namespaces` are absent entirely from F25.
The --mount-proc has been designed for CLONE_NEWPID. I cannot imagine usable --mount-proc without CLONE_NEWNS, I guess nobody wants to mess system /proc with another new unshared /proc. You don't have to use this option and mount your /proc manually after unshare(1) call. Frankly, I'm not sure about relation between CLONE_NEWUSER and CLONE_NEWPID. The comment #1 has been my guess :-) It would be better to ask Eric (added to CC:) for more information.
To set the stage for answering the question I need to add a little bit of information. The man pages are not the specification and frequently omit fiddly details that are too difficult to explain clearly. There have been a couple of changes to especially with the mounting of proc and sysfs to keep things secure, and I don't believe the man pages have always been updated. The case being tested is what can be done with a user namespace where getting all of the fiddly details exactly right is important so that people can't break preexisting setups with what user namespaces allow. The practical rule for mounting proc and sysfs as root in a user namespace are: - The user namespace root must have created a mount namespace for the mounts to happen in. - If you don't have a namespace or namespaces that would affect the contents of proc or sysfs when they are mounted you need to use a bind mount. - If except for the change in contents in the filesystem the view of proc or sysfs that you get is not equivalent to your existing view (AKA someone more privileged than you has mounted over parts of proc or sysfs in significant ways) then it is not allowed to mount a new proc or sysfs. Similarly the mount options for proc and sysfs are restricted to the mount options of the mount that is already visible. Proc and sysfs have to be restricted so that a less privileged user can not get more access to sensitive files that a more privileged user made unavailable. In short the rule for proc and sysfs is as close to a bind mount as possible.
Thanks Eric. I have added to unshare(1) man page a short note that sysfs and procfs mounting in user namespace have to be restricted to avoid access to sensitive data. IMHO it's good enough :-)