Bug 452122 - lazy umount causes pwd to fail silently
Summary: lazy umount causes pwd to fail silently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: autofs
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Ian Kent
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On: 452120
Blocks: 233481 426502 453706 993477
TreeView+ depends on / blocked
 
Reported: 2008-06-19 14:52 UTC by Ian Kent
Modified: 2018-10-20 02:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Autofs uses "umount -l" to clear active mounts at restart. This method results in getcwd() failing because the point from which the path is constructed has been detached from the mount tree. To resolve this a miscellaneous device node for routing ioctl commands to these mount points has been implemented in the autofs4 kernel module and a library added to autofs. This provides the ability to re-construct a mount tree from existing mounts and then re-connect them.
Clone Of:
: 993477 (view as bug list)
Environment:
Last Closed: 2009-09-02 11:59:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1397 0 normal SHIPPED_LIVE autofs bug fix update 2009-09-01 12:02:15 UTC

Description Ian Kent 2008-06-19 14:52:17 UTC
+++ This bug was initially created as a clone of Bug #452120 +++

+++ This bug was initially created as a clone of Bug #431716 +++

Description of problem:

We have an odd problem with automounts from failover nfs servers which is poorly
understood but I think it boils down to the following reproduceable problem: if
you lazily umount a filesystem then any processes with that filesystem still
open will see ".." linked to "." in the root directory of the mount and
therefore getcwd will no longer work.  A subsidiary bug is that getcwd pretends
to return successfully in this situation instead of flagging an error.

Version-Release number of selected component (if applicable):
Currently running with kernel-2.6.23.1-49.fc8

How reproducible:
Always

Steps to Reproduce:

Note that /auto/scratch/imc is auto-mounted over nfs in the following.

$ cd /auto/scratch/imc/foo
$ ls -al
total 6
drwxr-xr-x  3 imc support  512 Feb  6 16:12 .
drwxr-xr-x 59 imc support 3072 Feb  6 16:12 ..
drwxr-xr-x  2 imc support  512 Feb  6 16:12 bar
-rw-r--r--  1 imc support    6 Feb  6 16:12 hello
$ ls -ldi . .. ../.. /auto/scratch/imc
 290882 drwxr-xr-x  3 imc  support  512 Feb  6 16:12 .
   9148 drwxr-xr-x 59 imc  support 3072 Feb  6 16:12 ..
2644455 drwxr-xr-x  3 root root       0 Feb  6 16:12 ../..
   9148 drwxr-xr-x 59 imc  support 3072 Feb  6 16:12 /auto/scratch/imc
$ su -c 'umount -l /auto/scratch/imc'
Password: 
umount: /auto/scratch/imc: not mounted
$ /bin/pwd
foo
$ ls -ldi . .. ../.. /auto/scratch/imc
290882 drwxr-xr-x  3 imc support  512 Feb  6 16:12 .
  9148 drwxr-xr-x 59 imc support 3072 Feb  6 16:12 ..
  9148 drwxr-xr-x 59 imc support 3072 Feb  6 16:12 ../..
  9148 drwxr-xr-x 59 imc support 3072 Feb  6 16:12 /auto/scratch/imc
$ ls -al
total 6
drwxr-xr-x  3 imc support  512 Feb  6 16:12 .
drwxr-xr-x 59 imc support 3072 Feb  6 16:12 ..
drwxr-xr-x  2 imc support  512 Feb  6 16:12 bar
-rw-r--r--  1 imc support    6 Feb  6 16:12 hello
$ (cd .. && /bin/pwd)
/auto/scratch/imc
$ (cd bar && /bin/pwd)
/auto/scratch/imc/foo/bar
$ /bin/pwd
foo

Note that after the umount the ".." entry of (what was) /auto/scratch/imc points
to itself instead of to the real parent.  There may be a good reason for this,
but if so I'd be interested to find out what it is.  The /bin/pwd command claims
that "foo" is the correct name of the current directory.  If I'd been in
/auto/scratch/imc at the time of the umount then it would have claimed that ""
is the correct name of the current directory without reporting an error.

Even though the filesystem is unmounted, you can still list it and change to
directories within it (which seems to be the point of lazy unmounting). 
However, because the filesystem is automounted, doing "cd .." remounts it and
things are fine from then on.  However, any shell process which hasn't done a cd
since the umount will still be affected.

Although we don't usually use "umount -l" in real-world situations, I quite
often discover long-lived shells with no pwd so I'm assuming the auto-mounter
has tried to umount the filesystem at some point, possibly when the nfs server
became temporarily unavailable owing to a network problem or scheduled reboot.

-- Additional comment from cebbert on 2008-02-06 15:11 EST --
(In reply to comment #0)
> Note that after the umount the ".." entry of (what was) /auto/scratch/imc points
> to itself instead of to the real parent.  There may be a good reason for this,
> but if so I'd be interested to find out what it is.

The parent is gone, so you are in the root of a disconnected tree. If you are in
/ and try 'cd ..' it silently succeeds there too (and leaves you in the same
directory.)

-- Additional comment from jmoyer on 2008-02-06 15:25 EST --
Ian, would it be okay with you if we closed this bug as a duplicate of bug 287411?

Essentially, when restarting autofs, it will do a umount -l of busy mount
points, since there is no way to re-attach to existing mounts currently.  Ian
Kent is working on a solution to fix that problem, so then we won't rely on
umount -l for the forced restart case and I think that should resolve your problem.

-- Additional comment from imc.ac.uk on 2008-02-07 07:00 EST --
It's in a disconnected tree, yes, but the parent is not gone - it is
/auto/scratch in the local filesystem, same as before.  However,  once you leave
the disconnected tree by changing to the parent, you can't get back - which of
course means that getcwd still wouldn't work even if the parent link were
preserved, I now realise.

But since you are in a disconnected tree which by definition doesn't have a name
within the filesystem, getcwd shouldn't succeed.  If there is a good reason why
it should succeed then I have a bug to raise against vim. ;-)

Bug 287411 does indeed look like the same bug.  (If I'd searched for "cwd"
instead of "pwd" then I'd have found it, I guess.)  Several of our workstations
updated their copies of autofs in late January which is consistent with that
being the cause of our current crop of lost cwds.

Thanks.

-- Additional comment from jmoyer on 2008-02-07 16:40 EST --
(In reply to comment #3)
> Bug 287411 does indeed look like the same bug.  (If I'd searched for "cwd"
> instead of "pwd" then I'd have found it, I guess.)  Several of our workstations
> updated their copies of autofs in late January which is consistent with that
> being the cause of our current crop of lost cwds.

287411 is against F-7, so I guess we should keep both bugs open.  I'll add a
dependency.

-- Additional comment from ikent on 2008-02-08 00:46 EST --
(In reply to comment #3)
> It's in a disconnected tree, yes, but the parent is not gone - it is
> /auto/scratch in the local filesystem, same as before.  However,  once you leave
> the disconnected tree by changing to the parent, you can't get back - which of
> course means that getcwd still wouldn't work even if the parent link were
> preserved, I now realise.

As I've finally realized "umount -l" is not appropriate for
mounts that we expect to work, as normal, until they aren't
in use any more. This is what autofs currently does for active
mounts at restart which causes the problem.

The lazy umount actually has very limited function as no path
lookups can be done within the mount point once it's lazy
umounted. But once autofs is restarted path lookups can
again be done again and a new mounts performed and this
works well. Unfortunately, anything that needs to walk
back up the mount tree to construct a path, such as
getcwd(2) and the proc filesystem /proc/<pid>/cwd, cannot
work because the point from which the path is constructed has
been detached from the mount tree. It's normal for detached
mounts to point to themselves in the kernel, after all they
are essentially waiting to go away.

> 
> But since you are in a disconnected tree which by definition doesn't have a name
> within the filesystem, getcwd shouldn't succeed.  If there is a good reason why
> it should succeed then I have a bug to raise against vim. ;-)

The issue here is with checking the return of the kernel
function d_path. I can't remember now what it returns in
this case but if it returns NULL (I think that was it)
then that should be checked by system calls and filesystems
that use it.

But that's a different issue to being able to cleanly restart
autofs in the presence of busy mounts.

The actual problem with autofs is that it can't reconnect
to existing mounts. Immediately one things of just adding
the ability to remount the autofs filesystem would solve
it, but alas, that can't work. This is because autofs direct
mounts and the implementation of "on demand mount and expire"
of nested mount trees (like those in multiple offset automount
maps) have the mount location mounted on top of them. The
upshot of this is that we can't use anything that needs
to walk the path because the VFS will skip over the autofs
mount point we need and end up at the mount above it.
For the same reason an ioctl file descriptor can't be
opened on these mounts either.

The way that I'm resolving this is to add code to 
implement a device node for the autofs4 kernel
module so it can be used to route ioctl control
commands to these mounts. Then there are the changes
to autofs itself to reconnect the mounts at startup
which is quite tricky.

> 
> Bug 287411 does indeed look like the same bug.  (If I'd searched for "cwd"
> instead of "pwd" then I'd have found it, I guess.)  Several of our workstations
> updated their copies of autofs in late January which is consistent with that
> being the cause of our current crop of lost cwds.

I'm fairly sure it's the issue your seeing.
I've been working on it for a while now and I've made a lot
of progress. I can't commit to a time frame because I don't
know what, if any, difficulties I will run into. I can say
that I have the foundation for this mostly done. That being
the kernel device node code and the interface code for autofs
to use this. The testing done so far shows it works fine but
not all functions have been exercised. Still to be done is
code in autofs to actually do the reconnect to active mounts
at startup.

Sorry about the rant but I wanted to try and give you an idea
of what is really happening and were we are at with it. Hopefully
it is, at least partly, understandable.

Ian

-- Additional comment from ikent on 2008-06-19 10:51 EST --
This bug has been created to track the kernel component
of the required fix for this issue.

Comment 1 Ian Kent 2008-06-19 14:53:16 UTC
This bug has been created to track the user space component
of this issue.

Comment 2 RHEL Program Management 2008-06-19 15:32:07 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Ian Kent 2008-09-16 02:58:59 UTC
This has to be deferred until 5.4 as the kernel patches
are still in test and user space component has not yet
been made available for public testing.

Comment 19 Ian Kent 2009-05-21 02:54:57 UTC
This issue has been fixed in the latest autofs package
autofs-5.0.1-0.rc2.125.

The autofs RHTS test bubzillas/bz452122 can be used to verify
the correction.

Comment 21 Chris Ward 2009-07-03 18:03:52 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 24 Ruediger Landmann 2009-09-01 00:28:31 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Autofs uses "umount -l" to clear active mounts at restart. This method results in getcwd() failing because the point from which the path is constructed has been detached from the mount tree. To resolve this a miscellaneous device node for routing ioctl commands
to these mount points has been implemented in the autofs4 kernel module
and a library added to autofs. This provides the ability to re-construct a mount tree from existing mounts and then re-connect
them.

Comment 25 errata-xmlrpc 2009-09-02 11:59:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1397.html


Note You need to log in before you can comment on or make changes to this bug.