Bug 755216

Summary: multiple pam_namespace unmount problems
Product: [Fedora] Fedora Reporter: Anders Blomdell <anders.blomdell>
Component: pamAssignee: Tomas Mraz <tmraz>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: bressers, kzak, tmraz
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: pam-1.1.5-5.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-08 23:01:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 712089    
Bug Blocks:    
Attachments:
Description Flags
Test scripts
none
Program demonstrating possible implementation
none
Patch to do early unmount in original namespace
none
Proposed patch with the make rslave / approach none

Description Anders Blomdell 2011-11-19 15:58:49 UTC
Created attachment 534577 [details]
Test scripts

Description of problem:

pam_namespace has various unmount problems (see attached test scripts and output)

Version-Release number of selected component (if applicable):

pam-1.1.4-4


How reproducible:

Always

Steps to Reproduce:
1. Add 'session required pam_namespace.so'
2. Create '/namespace' and put the two attached scripts in it
3. Run '/namespace/trigger nobody'
  
Actual results:

A daemon subprocess gains access to parent namespace and parent namespace gets polluted with bogus mounts.

Problem 1: daemon gets access to original namespace
PARENT FIRST:
DAEMON BEFORE:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/hide1.nobody
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/hide2.nobody
PARENT BEFORE DAEMON EXIT:
├─/namespace/dir1                /dev/mapper/vg0-root[/namespace/dir1] ext4        rw,relatime,user_xattr,barrier=1
└─/namespace/dir2                /dev/mapper/vg0-root[/namespace/dir2] ext4        rw,relatime,user_xattr,barrier=1
DAEMON AFTER:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/dir1
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/dir2
PARENT AFTER DAEMON EXIT:

Problem 2: only some of directories gets unmounted
PARENT FIRST:
DAEMON BEFORE:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/hide1.nobody
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/hide2.nobody
PARENT BEFORE DAEMON EXIT:
├─/namespace/dir1                /dev/mapper/vg0-root[/namespace/dir1] ext4        rw,relatime,user_xattr,barrier=1
└─/namespace/dir2                /dev/mapper/vg0-root[/namespace/dir2] ext4        rw,relatime,user_xattr,barrier=1
DAEMON AFTER:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/hide1.nobody
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/dir2
PARENT AFTER DAEMON EXIT:
├─/namespace/dir1                /dev/mapper/vg0-root[/namespace/dir1] ext4        rw,relatime,user_xattr,barrier=1
└─/namespace/dir2                /dev/mapper/vg0-root[/namespace/dir2] ext4        rw,relatime,user_xattr,barrier=1

Problem 3: killing pam_namespace owner poisons parent namespace
PARENT FIRST:
DAEMON BEFORE:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/hide1.nobody
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/hide2.nobody
PARENT BEFORE DAEMON EXIT:
├─/namespace/dir1                /dev/mapper/vg0-root[/namespace/dir1] ext4        rw,relatime,user_xattr,barrier=1
└─/namespace/dir2                /dev/mapper/vg0-root[/namespace/dir2] ext4        rw,relatime,user_xattr,barrier=1
/namespace/trigger: line 72:   528 Killed                  su $1 -s '/bin/bash' -c "(`dirname $0`/show_mount &) ; sleep 1"
DAEMON AFTER:
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir1/hide1.nobody
-rw-r--r-- 1 root root 0 Nov 19 16:55 /namespace/dir2/hide2.nobody
PARENT AFTER DAEMON EXIT:
├─/namespace/dir1                /dev/mapper/vg0-root[/namespace/dir1] ext4        rw,relatime,user_xattr,barrier=1
└─/namespace/dir2                /dev/mapper/vg0-root[/namespace/dir2] ext4        rw,relatime,user_xattr,barrier=1
#


Expected results:

Parent and child session namespaces should be separated at all times.


Additional info:

If the problem is fixable, pam_namespace should probably have open/close argument in order to do setup/teardown in the right order to other pam modules.

Comment 1 Anders Blomdell 2011-11-19 16:09:02 UTC
Perhaps a combination of pam_systemd and pam_exec would be a safe way to setup a separate namespace environment.

Comment 2 Anders Blomdell 2011-11-21 11:53:15 UTC
Created attachment 534766 [details]
Program demonstrating possible implementation

Comment 3 Anders Blomdell 2011-11-21 17:32:46 UTC
Created attachment 534827 [details]
Patch to do early unmount in original namespace

Unmounting in original namespace is done almost immediately by an auxillary task. 
No unmounting in done child namespace, since there may still be lingering processes in that namespace.
At pam_close_session, yet another namespace is created and unmounting as done in that (in case the process does another pam_open_session).

NB: The sequence corresponding to

PAM caller                                Unmount thread

# In namespace A
mount --bind /some/dir /some/dir
unshare
# In namespace B
mount --make-private /some/dir
mount --bind /some/other/dir /some/dir 
                                          # In namespace A
                                          umount /some/dir
# Do stuff in private namespace B
unshare
# In namespace C
umount /some/dir
umont /some/dir

Unfortunately exposes /some/dir to namespace A (possibly the global), and hence is suspectible to races. The only way around this as far as I can see, is to allow 'mount --make-private --bind /some/other/dir /some/dir' in the kernel.

Comment 4 Anders Blomdell 2011-11-21 17:55:48 UTC
It looks like the current scheme will make kernel leak memory (mount fails after some thousand cycles), perhaps it should be rewritten using cgroup (a la systemd) to clean up when the last child has exited a namespace.

Comment 5 Tomas Mraz 2011-11-21 18:14:49 UTC
Hmm, this is clearly a serious problem, unfortunately not really solveable cleanly.

We really need the change in the kernel allowing simultaneous MS_BIND with MS_PRIVATE.

Comment 6 Tomas Mraz 2011-11-21 18:27:13 UTC
BTW I am not quite sure that pam_namespace was primarily designed to work with su. Its main purpose is login and sshd.

Comment 7 Anders Blomdell 2011-11-21 19:59:49 UTC
(In reply to comment #5)
> Hmm, this is clearly a serious problem, unfortunately not really solveable
> cleanly.
Or use some variation on the scheme in fc15, where I think 'mount --bin /some/dir /some/dir ; mount --make-private' was done at startup (sandbox), then I think that 'unshare' followed by 'mount --bind /some/other/dir /some/dir' in pam_namespace would be free from races (an extra 'unshare'/'unmount' still necessary at logout though). Checking that the /some/dir is in the right state would be almost neccessary, though.

> We really need the change in the kernel allowing simultaneous MS_BIND with
> MS_PRIVATE.
Would be nice to able to get a handle to old namespace from 'unshare', in order to cleanly get it back at pam_close_session as well.
Will you push this upstream?

(In reply to comment #6)
> BTW I am not quite sure that pam_namespace was primarily designed to work with
> su. Its main purpose is login and sshd.

I know that, I found the issue with gdm-password, which was mainly hit by lots of mounts hanging around since gdm kills off all its know children at logout (including the child doing pam_close_session :-( ). But su is so much simpler doing stress testing with :-)


Another question is if the probable memory leak is of any concern, if so i think that a 'cgroup' solution would be neccessary.

If I can help out, I would probably need some explanation of the finer points of the protect-mounting logic, and some good test-cases. Comments more than welcome!

Comment 8 Tomas Mraz 2011-11-22 13:17:58 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Hmm, this is clearly a serious problem, unfortunately not really solveable
> > cleanly.
> Or use some variation on the scheme in fc15, where I think 'mount --bin
> /some/dir /some/dir ; mount --make-private' was done at startup (sandbox), then
> I think that 'unshare' followed by 'mount --bind /some/other/dir /some/dir' in
> pam_namespace would be free from races (an extra 'unshare'/'unmount' still
> necessary at logout though). Checking that the /some/dir is in the right state
> would be almost neccessary, though.
I am not quite sure that putting one hack above another (the current way the private bind mounts are handled) is a good idea how to fix this.

Also I think you have a small bug in your analysis of how pam_namespace behaves - there is always unshare() first and then any bind mounts done (even the protecting bind mounts). Unfortunately because the / is shared these bind mounts are already shared so it does not really matter whether they are done before or after the unshare().

> > We really need the change in the kernel allowing simultaneous MS_BIND with
> > MS_PRIVATE.
> Would be nice to able to get a handle to old namespace from 'unshare', in order
> to cleanly get it back at pam_close_session as well.
I am not quite sure why we would need the old namespace as in ideal situation the old namespace should not be affected by the pam_namespace actions at all.

> Will you push this upstream?
Which upstream? You mean the kernel upstream? Unfortunately there is no patch for the MS_BIND|MS_PRIVATE mount support in kernel yet and I am not a kernel developer.

> (In reply to comment #6)
> > BTW I am not quite sure that pam_namespace was primarily designed to work with
> > su. Its main purpose is login and sshd.
> 
> I know that, I found the issue with gdm-password, which was mainly hit by lots
> of mounts hanging around since gdm kills off all its know children at logout
> (including the child doing pam_close_session :-( ). But su is so much simpler
> doing stress testing with :-)
If gdm kills its pam session handling children that is a bug in gdm and should be fixed. Can you please report it against gdm?

> Another question is if the probable memory leak is of any concern, if so i
> think that a 'cgroup' solution would be neccessary.
It could be probably limited by maxlogins setting in limits.conf for the namespaced user.

> If I can help out, I would probably need some explanation of the finer points
> of the protect-mounting logic, and some good test-cases. Comments more than
> welcome!
The protect mounting is not so interesting as it comes in play only in case the polyinstantiated directory is user-owned or the instance parents are user-owned. However the problem is still there due to the necessary bind mount of the polydir to flag it private in the next mount() call.

Comment 9 Anders Blomdell 2011-11-22 14:49:43 UTC
> Also I think you have a small bug in your analysis of how pam_namespace behaves
> - there is always unshare() first and then any bind mounts done (even the
> protecting bind mounts). Unfortunately because the / is shared these bind
> mounts are already shared so it does not really matter whether they are done
> before or after the unshare().
Right, it's my patch that does unmounting in the original namespace (so for clarity, the protect mount should be done before unshare).

> > Will you push this upstream?
> Which upstream? You mean the kernel upstream? Unfortunately there is no patch
> for the MS_BIND|MS_PRIVATE mount support in kernel yet and I am not a kernel
> developer.
OK, will see if I can get around to this at some distant future...

> > of mounts hanging around since gdm kills off all its know children at logout
> > (including the child doing pam_close_session :-( ). But su is so much simpler
> > doing stress testing with :-)
> If gdm kills its pam session handling children that is a bug in gdm and should
> be fixed. Can you please report it against gdm?
Will do. https://bugzilla.redhat.com/show_bug.cgi?id=755964

> > If I can help out, I would probably need some explanation of the finer points
> > of the protect-mounting logic, and some good test-cases. Comments more than
> > welcome!
> The protect mounting is not so interesting as it comes in play only in case the
> polyinstantiated directory is user-owned or the instance parents are
> user-owned. However the problem is still there due to the necessary bind mount
> of the polydir to flag it private in the next mount() call.
OK, will leave it as it is now then...

Comment 10 Anders Blomdell 2011-11-23 09:08:01 UTC
One more point on the problems you can run into if bind-mounts are not teare down immediately; assume that /work is automounted, then the following really confuses things:

(
echo '# Empty work'
findmnt | cut -c1-30 | grep work
ls -ld /work/Fedora-16/.
echo '# First primary mount on work'
findmnt | cut -c1-30 | grep work
unshare --mount -- /bin/sh -c "(
    mount --bind /work /work ; 
    mount --make-private /work ; 
    mount --bind /tmp /work ; 
    echo '# Detached mount of work' ;
    findmnt | cut -c1-30 | grep work ; 
    sleep 5 ; 
    echo '# Before detached unmount of private work' ;
    findmnt | cut -c1-30 | grep work; 
    umount /work ;
    echo '# Before detached unmount of bound work' ;
    findmnt | cut -c1-30 | grep work; 
    echo '# Weird detached automount behaviour' ;
    ls -ld /work/Fedora-15/. ;
    umount /work ;
    echo '# Detached unmount') &"
sleep 1
echo '# Second primary mount on work'
ls -ld /work/Fedora-15/.
findmnt | cut -c1-30 | grep work
echo '# First primary mount on work no longer accessible'
ls -ld /work/Fedora-16/.
sleep 10
echo '# And the final remaining cruft'
findmnt | cut -c1-30 | grep work
)

Which gives the following output (no wonder I had problems understanding what went wrong with my machines :-():

# Empty work
├─/work                       
drwxr-xr-x 4 root root 4096 Nov 23 09:37 /work/Fedora-16/.
# First primary mount on work
├─/work                       
│ └─/work/Fedora-16           
# Detached mount of work
├─/work                       
│ ├─/work/Fedora-16           
│ └─/work                     
│   └─/work                   
# Second primary mount on work
drwxr-xr-x 4 root root 4096 Nov 23 09:36 /work/Fedora-15/.
├─/work                       
│ ├─/work/Fedora-16           
│ ├─/work                     
│ │ └─/work/Fedora-15         
│ └─/work/Fedora-15           
# First primary mount on work no longer accessible
ls: cannot access /work/Fedora-16/.: Too many levels of symbolic links
# Before detached unmount of private work
├─/work                       
│ ├─/work/Fedora-16           
│ ├─/work                     
│ │ └─/work                   
│ └─/work/Fedora-15           
# Before detached unmount of bound work
├─/work                       
│ ├─/work/Fedora-16           
│ ├─/work                     
│ └─/work/Fedora-15           
# Weird detached automount behaviour
ls: cannot access /work/Fedora-15/.: Too many levels of symbolic links
# Detached unmount
# And the final remaining cruft
├─/work                       
│ ├─/work/Fedora-16           
│ ├─/work                     
│ │ └─/work/Fedora-15         
│ └─/work/Fedora-15

Comment 11 Anders Blomdell 2012-01-20 12:47:28 UTC
Problem exists in 1.1.5-1 as well.

Comment 12 Tomas Mraz 2012-01-20 14:35:47 UTC
Yes, Al Viro shown a method how to avoid this problem (avoid the bind mounts leaking to the parent namespace) just recently. I need to write a proper patch based on this method.

Comment 13 Anders Blomdell 2012-01-20 14:45:51 UTC
Any pointers to this method?

Comment 14 Anders Blomdell 2012-01-20 15:37:40 UTC
Could it be this https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=752540 ?

So what should be done in the new namespace is essentially:

# 1. Make changes in this namespace to others
mount --make-slave "none" / 
# 2. Protect mount 
mount --bind /some/private/dir /some/dir/to/hide
# 3. Make subsequent mount changes to /some/private/dir invisible to us
mount --make-private "none" /some/dir/to/hide

If step 3 is necessary, is probably a matter of opinion.

Comment 15 Tomas Mraz 2012-01-20 15:57:31 UTC
Created attachment 556538 [details]
Proposed patch with the make rslave / approach

Yes, this is the approach.

Comment 16 Tomas Mraz 2012-01-20 15:58:21 UTC
The step 3 is not necessary.

Comment 17 Anders Blomdell 2012-01-20 20:04:34 UTC
Looks like it still unmounts in the child namespace, giving remaining daemon processes access to the original namespace. Will test this hypotesis on monday. This is the part of my patch that I think should be retained.



@@ -2142,6 +2252,22 @@ 
     	/* nothing to reset */
     	return PAM_SUCCESS;
     	
+    /*
+     * Unmounting should not be done in the namespace of any remaining
+     * children (daemons), so create yet another namespace and unmount in 
+     * from that namespace (ideally we would like to regain the original 
+     * namespace, so at best this is a kludge).
+     */
+    if (idata.flags & PAMNS_DEBUG)
+        pam_syslog(idata.pamh, LOG_DEBUG, "Unsharing child namespace for %d",
+		   getpid());
+
+    if (unshare(CLONE_NEWNS) < 0) {
+	pam_syslog(idata.pamh, LOG_ERR,
+		   "Unable to unshare from child namespace, %m");
+	return PAM_SESSION_ERR;
+    }
+
     idata.polydirs_ptr = polyptr;
 
     if (idata.flags & PAMNS_DEBUG)

Comment 18 Tomas Mraz 2012-01-23 07:37:37 UTC
Please use a new bug report for this unmount problem. The actual solution for this problem would be to drop the unmounting altogether. The unmounts should happen automatically as soon as there is not process running in the child namespace. Doing an extra unshare() for unmounts would be basically a no-op.

Comment 19 Anders Blomdell 2012-01-23 08:10:21 UTC
(In reply to comment #18)
> Please use a new bug report for this unmount problem. 
Will do after I have verified it to still be a problem.

> The actual solution for this problem would be to drop the unmounting
> altogether. The unmounts should happen automatically as soon as there 
> is no process running in the child namespace. 
True.

> Doing an extra unshare() for unmounts would be basically a no-op.
(See also comment #3)
Unless the pam_close_session is followed by a pam_open_session (in the same process), since then we would stack mounts on top of each other (I would rather be safe than sorry).

Comment 20 Tomas Mraz 2012-01-23 08:16:25 UTC
(In reply to comment #19)

> > Doing an extra unshare() for unmounts would be basically a no-op.
> (See also comment #3)
> Unless the pam_close_session is followed by a pam_open_session (in the same
> process), since then we would stack mounts on top of each other (I would rather
> be safe than sorry).

That would be a bug in the calling application - session calls were never supposed to be called multiple times in one process. And even harmless bug as stacking them on top of the same mounts would be equivalent to stacking them on unmounted mounts in a separate namespace except of a small difference in resource consumption.

Comment 21 Anders Blomdell 2012-01-23 09:02:14 UTC
(In reply to comment #20)
> That would be a bug in the calling application - session calls were never
> supposed to be called multiple times in one process. 
Should I file a separate documentation bug?, this is not stated in the PAM(3) man-pages.

> And even harmless bug as stacking them on top of the same mounts would be 
> equivalent to stacking them on unmounted mounts in a separate namespace 
> except of a small difference in resource consumption.
Unless the second pam_open_session is for a user that should not use a namespace mount, in that case that user might place sensitive data in a location accessible by less trusted users.

BUT: I agree that this is a minor problem, and is a smaller problem than the unfortunate fact that session module hooks are run in the same order for both open_session and close_session, possibly exposing different environments (mounted filesystems come to mind) during open_session and close_session. IMHO the session
(in /etc/pam.d/*) module keyword should be augmented with session_open/session_close keywords (which would make it possible to get rid of pam_selinux.so's open/close arguments), at the cost of making the session rules twice as many :-)

Personally I'm fine with just scrapping the unmount, but I'm not sure it's the right solution.

Comment 22 Tomas Mraz 2012-01-23 09:19:34 UTC
I am fairly sure that it is the right solution although I will probably leave the unmounting (without the additional unshare() call) there. I'll just make it optional and non-default and add a note about the possible unwanted circumstances to the documentation.

As for the multiple session calls in a single process - yes, you can open a documentation bug report for that in the upstream Trac, where it can be discussed further.

Comment 23 Fedora Update System 2012-01-31 20:57:47 UTC
pam-1.1.5-5.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/pam-1.1.5-5.fc15

Comment 24 Fedora Update System 2012-01-31 20:58:05 UTC
pam-1.1.5-5.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/pam-1.1.5-5.fc16

Comment 25 Fedora Update System 2012-02-01 19:24:23 UTC
Package pam-1.1.5-5.fc15:
* should fix your issue,
* was pushed to the Fedora 15 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing pam-1.1.5-5.fc15'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-1122/pam-1.1.5-5.fc15
then log in and leave karma (feedback).

Comment 26 Fedora Update System 2012-02-08 23:01:16 UTC
pam-1.1.5-5.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 27 Fedora Update System 2012-02-14 09:05:55 UTC
pam-1.1.5-5.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.