Bug 287411 - Kernel loses track of CWD of process
Summary: Kernel loses track of CWD of process
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 10
Hardware: All
OS: Linux
medium
low
Target Milestone: ---
Assignee: Ian Kent
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 293491 (view as bug list)
Depends On: 431716
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-09-12 09:56 UTC by Jeremy Sanders
Modified: 2009-12-18 05:58 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-12-18 05:58:18 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Corrections for the mount/expire race patch in 2.6.22 (969 bytes, patch)
2007-09-18 11:15 UTC, Ian Kent
no flags Details | Diff
sysrq-t dump (198.41 KB, text/plain)
2007-12-10 18:29 UTC, Orion Poplawski
no flags Details
/var/log/debug.gz (7.60 KB, application/octet-stream)
2009-05-27 15:21 UTC, Orion Poplawski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 8938 0 None None None 2019-03-05 04:59:32 UTC

Description Jeremy Sanders 2007-09-12 09:56:06 UTC
We have a fairly serious problem where the shell loses track of its current
working directory. This is when in an autofs mounted NFS home directory
(actually it is bind mount as the NFS server is the local machine).

kernel is:
Linux xpc17.ast.cam.ac.uk 2.6.22.4-65.fc7 #1 SMP Tue Aug 21 21:50:50 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
(kernel-2.6.22.4-65.fc7)

The problem is intermittent, and occurs randomly in time across a whole set of
shells.

Here is some example output:

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ ipython
Python 2.5 (r25:51908, Apr 10 2007, 10:27:40)
Type "copyright", "credits" or "license" for more information.
..
In [1]: import os
In [2]: os.path.abspath('out.eps')
Out[2]: 'jssxmm/cen/rgs_099psf_comb_rembad/out.eps'
In [3]:
Do you really want to exit ([y]/n)? y

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abs
Can't get current working directory

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ pwd
/data/jss/xmm/cen/rgs_099psf_comb_rembad

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ tcsh
[jss@xpc17 rgs_099psf_comb_rembad]$ pwd
jssxmm/cen/rgs_099psf_comb_rembad
[jss@xpc17 rgs_099psf_comb_rembad]$ which abc
/data/soft3/heasoft/headas-6.3.1/x86_64-unknown-linux-gnu/bin/abc
[jss@xpc17 rgs_099psf_comb_rembad]$ exit

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abc
Can't get current working directory

xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ bash
xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abc
/data/soft3/heasoft/headas-6.3.1/x86_64-unknown-linux-gnu/bin/abc
xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ exit

If you examine the cwd symlink in /proc for the process, you find it has been
mangled:

xpc17:~:$ ls -l /proc/16141/cwd
lrwxrwxrwx 1 jss users 0 2007-09-12 10:47 /proc/16141/cwd ->
jssxmm/cen/rgs_099psf_comb_rembad

New shells in new terminals have mangled cwds:
[starts bash (pid 29643) in new terminal]
xpc17:~:$ ls -l /proc/29643/cwd
lrwxrwxrwx 1 jss users 0 2007-09-12 10:40 /proc/29643/cwd -> jss
[why's that not an absolute path?]
[enter command cd /home/jss]
xpc17:~:$ ls -l /proc/29643/cwd
lrwxrwxrwx 1 jss users 0 2007-09-12 10:40 /proc/29643/cwd -> /home/jss
[enter command cd]
xpc17:~:$ ls -l /proc/29643/cwd

This normally just works, but randomly breaks for a whole set of users.

versions of possible culprits:
* kernel-2.6.22.4-65.fc7
* autofs-5.0.1-27
* nfs-utils-1.1.0-3.fc7

Comment 1 Chuck Ebbert 2007-09-12 18:22:04 UTC
What does 'mount' output while this is happening? (Exactly what is mounted where
isn't given in the report.)


Comment 2 Jeremy Sanders 2007-09-12 19:07:09 UTC
Sorry for leaving that out:

xpc17:~:$ mount
/dev/sda1 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda3 on /xpc17_data1 type ext3 (rw,noatime)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
nodev on /dev/oprofile type oprofilefs (rw)
/xpc17_data1/home/jss on /home/jss type none (rw,bind)
xalph3.ast.cam.ac.uk:/soft3 on /data/soft3 type nfs (rw,intr,addr=131.111.68.53)
/xpc17_data1/data/jss on /data/jss type none (rw,bind)
/xpc17_data1/scratch/jss on /scratch/jss type none (rw,bind)
xpc9.ast.cam.ac.uk:/xpc9_data1/scratch/jgraham on /scratch/jgraham type nfs
(rw,intr,addr=131.111.68.181)
xalph3.ast.cam.ac.uk:/soft3/caldb on /data/caldb type nfs
(rw,intr,addr=131.111.68.53)



Comment 3 Ian Kent 2007-09-13 12:58:18 UTC
(In reply to comment #0)
> We have a fairly serious problem where the shell loses track of its current
> working directory. This is when in an autofs mounted NFS home directory
> (actually it is bind mount as the NFS server is the local machine).
> 

When did the problem start happening?


Comment 4 Jeremy Sanders 2007-09-13 13:11:47 UTC
We were using RHEL 4 previously, and moved to Fedora 7 from that. We noticed it
shortly after the move, around the start of June 2007. So the problem's
somewhere between 2.6.9 and 2.6.21. The problem only happens every few weeks, so
it's hard to provoke it.

Comment 5 Jeremy Sanders 2007-09-13 13:18:23 UTC
I'll add that when I log in the shell immediately has a broken CWD:

[start terminal]
xpc17:~:$ pwd
/home/jss
xpc17:~:$ which abc
Can't get current working directory

But this doesn't happen with the root user. The root homedir is not on the
autofs mounted home partition, however.

This is what the NIS auto.master map looks like:
/home auto.home intr
/data auto.data intr
/data1 auto.data1 intr
/scratch auto.scratch intr



Comment 6 Jeff Moyer 2007-09-17 20:11:36 UTC
*** Bug 293491 has been marked as a duplicate of this bug. ***

Comment 7 Ian Kent 2007-09-18 05:48:17 UTC
Orion, what version of nfs-utils are you using?
Did this problem start happening recently or have you
been seeing it for some time?
Has the problem continued through various kernel versions
or started with some particular kernel version?
Is there anything in the log?

Ian


Comment 8 Ian Kent 2007-09-18 05:55:28 UTC
(In reply to comment #2)

> xalph3.ast.cam.ac.uk:/soft3 on /data/soft3 type nfs (rw,intr,addr=131.111.68.53)
> /xpc17_data1/data/jss on /data/jss type none (rw,bind)
> /xpc17_data1/scratch/jss on /scratch/jss type none (rw,bind)
> xpc9.ast.cam.ac.uk:/xpc9_data1/scratch/jgraham on /scratch/jgraham type nfs
> (rw,intr,addr=131.111.68.181)
> xalph3.ast.cam.ac.uk:/soft3/caldb on /data/caldb type nfs
> (rw,intr,addr=131.111.68.53)

Are /soft3 and /soft3/caldb distinct filesystems on
xalph3.ast.cam.ac.uk?

Do you mount distinct exports from servers multiple
times, possibly to multiple locations, possibly with
different mount options.

Ian


Comment 9 Jeremy Sanders 2007-09-18 08:46:00 UTC
xalph3:/soft3 and xalph3:/soft3/caldb are the same file system, so are
xpc17:/xpc17_data1/data/jss, xpc17:/xpc17_data1/scratch/jss and 
xpc17:/xpc17_data1/home/jss

They all should have the same mount options (all the nfs partitions are mounted
with autofs, which has the same mount options for each mount point). This is
what /proc/mounts currently shows (though the problem seems to have temporarily
stopped currently without rebooting):

rootfs / rootfs rw 0 0
/dev/root / ext3 rw,noatime,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/sda3 /xpc17_data1 ext3 rw,noatime,data=ordered 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
nfsd /proc/fs/nfsd nfsd rw 0 0
nodev /dev/oprofile oprofilefs rw 0 0
auto.data1 /data1 autofs
rw,fd=5,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0
auto.data /data autofs
rw,fd=10,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0
auto.home /home autofs
rw,fd=15,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0
auto.scratch /scratch autofs
rw,fd=20,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0
/dev/sda3 /home/jss ext3 rw,noatime,data=ordered 0 0
xalph3.ast.cam.ac.uk:/soft3 /data/soft3 nfs
rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=xalph3.ast.cam.ac.uk
0 0
/dev/sda3 /data/jss ext3 rw,noatime,data=ordered 0 0
xalph3.ast.cam.ac.uk:/soft3/caldb /data/caldb nfs
rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=xalph3.ast.cam.ac.uk
0 0


Comment 10 Ian Kent 2007-09-18 10:12:20 UTC
(In reply to comment #9)
> xalph3:/soft3 and xalph3:/soft3/caldb are the same file system, so are
> xpc17:/xpc17_data1/data/jss, xpc17:/xpc17_data1/scratch/jss and 
> xpc17:/xpc17_data1/home/jss
> 
> They all should have the same mount options (all the nfs partitions are mounted
> with autofs, which has the same mount options for each mount point). This is
> what /proc/mounts currently shows (though the problem seems to have temporarily
> stopped currently without rebooting):

The point of the question is that, with the Fedora kernel above,
various sequences of events fail, like:

1) user logs in, causes mount of say fs A as rw.
2) pwd is changed away and A expires.
3) a mount of A+some path is mounted ro due to other activity.
4) user returns home, automount will not be able to mount the
   home directory.

If you think you're seeing this type of failure then you
can overcome it by changing the line '#OPTIONS=""' to
'OPTIONS="-O nosharecache"' in /etc/sysconfig/autofs,
provided you have nfs-utils-1.1.0 revision 3 or later.

Login itself has always been a bit ify when it thinks the
user home doesn't exist but still I'm quite puzzled by the
apparent loss of pwd in the kernel. I can't really relate
that to any changes. There are however two mistakes that
were discovered after an autofs patch went into 2.6.22 which
may be worth trying.

Given a patch, would you be able to patch and build a kernel
for testing?

Ian

Comment 11 Jeremy Sanders 2007-09-18 10:38:35 UTC
It seems unlikely that the nfs partitions are being remounted with different
options here I think. It's very rare for anything other than the automounter to
mount nfs drives.

I'd be willing to patch a kernel and test it. The problem is pretty random, so I
can't guarantee how long it might take to reoccur.

(By the way, the report that the problem had gone away temporarily is incorrect
- it is still present).


Comment 12 Ian Kent 2007-09-18 11:15:52 UTC
Created attachment 198321 [details]
Corrections for the mount/expire race patch in 2.6.22

This patch ia against the current Fedora kernel CVS
but (hopefully) should apply to your kernel.

Ian

Comment 13 Ian Kent 2007-09-18 11:26:01 UTC
The above patch corrects the two mistakes I mentioned.

For an explanation of the purpose of the original
patch look at:
http://marc.info/?l=linux-kernel&m=117151296416003&w=2

The first hunk deals with process wakeup order when
more than one process is waiting on an expire during
a mount request (see comment above).

The second hunk also deals with the case of multiple
processes competing for a mount request, only one can
succeed and so others must look for the dentry upon
which the mount occurred, if in fact it succeeded.

Ian


Comment 14 Ian Kent 2007-09-18 11:28:06 UTC
(In reply to comment #13)
> The first hunk deals with process wakeup order when
> more than one process is waiting on an expire during
> a mount request (see comment above).

Oops, I didn't actually add the comment bit in this
patch, sorry.

Ian


Comment 15 Ian Kent 2007-09-18 11:58:02 UTC
Oh and btw, are there any clues at all in the system log?

Comment 16 Jeremy Sanders 2007-09-18 12:04:16 UTC
Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the
nvidia tainted kernel - not my fault!).

WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF     )

Call Trace:
 [<ffffffff810b9024>] set_dentry_child_flags+0x6e/0x14d
 [<ffffffff81044475>] autoremove_wake_function+0x0/0x2e
 [<ffffffff810b923a>] remove_watch_no_event+0x38/0x47
 [<ffffffff810b9349>] inotify_remove_watch_locked+0x18/0x3b
 [<ffffffff810b962e>] inotify_rm_wd+0x7e/0xa1
 [<ffffffff810b9aa6>] sys_inotify_rm_watch+0x46/0x63
 [<ffffffff81009b5e>] system_call+0x7e/0x83



Comment 17 Ian Kent 2007-09-18 13:05:47 UTC
(In reply to comment #16)
> Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the
> nvidia tainted kernel - not my fault!).
> 
> WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF     )

It may be a clue as to what's happening but I can't see
that this, in itself, would cause the problem your seeing.

No automount messages?

Ian

Comment 18 Jeremy Sanders 2007-09-18 13:15:02 UTC
Sorry - I missed this one in an old message log:

Sep 12 10:25:28 xpc17 automount[2158]: umount_autofs_indirect: ask umount
returned busy /scratch
Sep 12 10:25:28 xpc17 automount[2158]: umount_autofs_indirect: ask umount
returned busy /home
Sep 12 10:25:29 xpc17 automount[2158]: umount_autofs_indirect: ask umount
returned busy /data
Sep 12 10:28:00 xpc17 mountd[2330]: authenticated mount request from
xserv2.ast.cam.ac.uk:783 for /xpc17_data1/home/jss (/xpc17_data1)
Sep 12 10:28:00 xpc17 mountd[2330]: authenticated mount request from
xserv2.ast.cam.ac.uk:786 for /xpc17_data1/data/jss (/xpc17_data1)

The machine has been up since this message.

Comment 19 Ian Kent 2007-09-18 13:17:46 UTC
And could you post a listing of a directory that contains
(at least) one of these bad entries please.

Comment 20 Jeremy Sanders 2007-09-18 13:20:37 UTC
I should add the above message happened just after an autofs update.

Do you mean these:

xpc17:/var/log:# ls -l /home
total 20
drwxr-xr-x 165 jss users 20480 2007-09-18 14:11 jss

xpc17:/var/log:# ls -l /data
total 8
drwxr-xr-x 41 jss  users 4096 2007-08-31 15:12 jss
drwxr-xr-x 95 star star  4096 2007-08-28 12:45 soft3

xpc17:/var/log:# ls -l /scratch/
total 8
drwxr-xr-x 24 jgraham users 4096 2007-09-04 11:40 jgraham
drwxr-xr-x 18 jss     users 4096 2007-03-06 10:16 jss

Or the root directory?
xpc17:/var/log:# ls -l /
total 160
drwxr-xr-x   2 root root  4096 2007-09-18 06:13 bin
drwxr-xr-x   3 root root  4096 2007-08-29 05:47 boot
lrwxrwxrwx   1 root root    11 2007-06-06 14:40 caldb -> /data/caldb
drwxr-xr-x   4 root root     0 2007-09-18 14:11 data
drwxr-xr-x   2 root root     0 2007-09-12 10:25 data1
drwxr-xr-x  13 root root  4540 2007-08-30 11:56 dev
drwxr-xr-x  95 root root 12288 2007-09-18 14:19 etc
drwxr-xr-x   3 root root     0 2007-09-18 12:09 home
lrwxrwxrwx   1 root root    26 2007-06-06 14:40 iraf -> /data/soft3/irafv2.12.PCIX
drwxr-xr-x  12 root root  4096 2007-09-17 05:34 lib
drwxr-xr-x   8 root root  4096 2007-09-13 05:35 lib64
drwx------   2 root root 16384 2007-06-06 13:24 lost+found
drwxr-xr-x   2 root root  4096 2007-08-29 09:43 media
drwxr-xr-x   5 root root  4096 2007-06-07 05:14 mnt
drwxr-xr-x   2 root root  4096 2007-04-17 13:46 opt
dr-xr-xr-x 221 root root     0 2007-08-29 09:42 proc
drwxr-x---  17 root root  4096 2007-09-18 14:11 root
drwxr-xr-x   2 root root 12288 2007-09-13 05:38 sbin
drwxr-xr-x   4 root root     0 2007-09-18 14:19 scratch
drwxr-xr-x   2 root root  4096 2007-06-06 13:24 selinux
drwxr-xr-x   3 root root  4096 2007-07-02 05:23 share
drwxr-xr-x   2 root root  4096 2007-04-17 13:46 srv
lrwxrwxrwx   1 root root    25 2007-06-06 14:40 star -> /data/soft3/starlink/star
drwxr-xr-x  12 root root     0 2007-08-29 09:42 sys
drwxrwxrwt  16 root root  4096 2007-09-18 14:10 tmp
drwxr-xr-x  16 root root  4096 2007-09-12 10:25 usr
drwxr-xr-x  22 root root  4096 2007-06-06 14:42 var
drwxr-xr-x   6 root root  4096 2005-07-01 14:59 xpc17_data1



Comment 21 Orion Poplawski 2007-09-18 13:55:20 UTC
(In reply to comment #7)
> Orion, what version of nfs-utils are you using?

nfs-utils-1.1.0-3.fc7

> Did this problem start happening recently or have you
> been seeing it for some time?
> Has the problem continued through various kernel versions
> or started with some particular kernel version?

Hard to say.  I'll ask my long time F7 user if he can recall.  I've just
completed upgrading our systems from F5 to F7 so I don't have much history on F7.

> Is there anything in the log?

Not really.

Comment 22 Ian Kent 2007-09-18 14:09:51 UTC
(In reply to comment #20)
> I should add the above message happened just after an autofs update.

As I said, that function shouldn't cause what your seeing but
I'll look deeper into the other functions in the call trace.

> 
> Do you mean these:

Think so, I wanted the automount directory that contains
problem mount points. I'm after a listing of a directory
immediately following a badness event. If you wait a little
while the VFS may reclaim any evidence of a problem that
may have occurred. So try for one straight after a problem
happens.

Ian


Comment 23 Jeremy Sanders 2007-09-19 14:18:23 UTC
We've just had another system start doing it in the last hour. There are no
messages in the logs from auto* recently. There were no unusual directory
entries. Each new terminal again appears to have an incorrect home directory as
CWD. We'll have to try patching the kernel...

Comment 24 Ian Kent 2007-09-19 14:55:46 UTC
(In reply to comment #23)
> We've just had another system start doing it in the last hour. There are no
> messages in the logs from auto* recently. There were no unusual directory
> entries. Each new terminal again appears to have an incorrect home directory as
> CWD. We'll have to try patching the kernel...

Mmm .. 

We really need to narrow the search on this if we're to
make progress. Perhaps a better approach would be to try
the kernel that was originally shipped with F-7 first or
has that been seen to be a problem already? This kernel
has been around for a while without reports of problems.

This would at least narrow the search and save patching
the kernel for the moment at least.

Ian


Comment 25 Orion Poplawski 2007-09-19 15:35:59 UTC
I've installed a variety of F7 kernels on some of our desktops as well as a
patched version of 2.6.22.6-81.  We'll see....


Comment 26 Jeremy Sanders 2007-09-19 15:41:55 UTC
I think it was in the original F7 kernel. I've found a note I made on our trac
database about the issue later (23rd July), when kernel-2.6.21-1.3228.fc7 was
installed:

"I have noticed a couple of times (with kernel-2.6.21-1.3228.fc7 and before)
that the kernel appears to lose track of the current directory of running
processes."

We'll try running the earliest kernel in case my memory is mistaken. I believe
there was only the install kernel before 1.3228.

By the way, it appears new logins from the console are okay. I think they are
broken when started from KDE, because the KDE desktop appears to have a broken
CWD (from proc).

Comment 27 Orion Poplawski 2007-09-19 15:47:14 UTC
We're using KDE too, wonder if that's meaningful...


Comment 28 Ian Kent 2007-09-27 14:21:35 UTC
Is there anything to report on this when using other
kernels guys?

Ian


Comment 29 Jeremy Sanders 2007-09-27 14:33:51 UTC
The system running 2.6.21-1.3194.fc7 hasn't failed yet. It can take a few weeks
however to show anything.

Comment 30 Orion Poplawski 2007-09-27 22:32:46 UTC
No failures again so far for me too.  I'm closely monitoring the systems though
and should be alerted as soon as it re-occurs.


Comment 31 Ian Kent 2007-09-28 03:43:49 UTC
One thing that may help is a <sys-rq>-t dump of a system
immediately (at least before a reboot, hehe) after this
happens.


Comment 32 Orion Poplawski 2007-12-10 18:29:11 UTC
Created attachment 283051 [details]
sysrq-t dump

Looks like this got triggered when the 1:5.0.2-17 update was installed and
autofs restarted.

[root@vault ~]# ls -l /proc/*/cwd 2>/dev/null | grep  '> $'
lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/15649/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4035/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4043/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4056/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4058/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4060/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4063/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4065/cwd ->
lrwxrwxrwx 1 ivm  cora 0 2007-12-09 04:47 /proc/4066/cwd ->
lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/4100/cwd ->
lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/4309/cwd ->
[root@vault ~]# ps -fu ivm
UID	   PID	PPID  C STIME TTY	   TIME CMD
ivm	  4035	   1  0 Nov21 ?        00:35:41 Xvnc :1 -desktop vault:1 (ivm)
-httpd /usr
ivm	  4043	   1  0 Nov21 ?        00:00:00 /bin/sh /etc/xdg/xfce4/xinitrc
ivm	  4051	   1  0 Nov21 ?        00:00:00 /usr/bin/ssh-agent -s
ivm	  4056	   1  0 Nov21 ?        00:00:00 xfce-mcs-manager
ivm	  4058	   1  0 Nov21 ?        00:00:01 xfwm4 --daemon
ivm	  4060	4043  0 Nov21 ?        00:01:43 xfdesktop
ivm	  4063	4043  0 Nov21 ?        00:15:24 /usr/bin/xfce4-panel
ivm	  4065	   1  0 Nov21 ?        00:01:51 /usr/libexec/gam_server
ivm	  4066	4063  0 Nov21 ?        00:00:52
/usr/libexec/xfce4/panel-plugins/xfce4-men
ivm	  4100	   1  0 Nov21 ?        00:00:01 xterm
ivm	  4102	4100  0 Nov21 pts/1    00:00:01 -csh
ivm	  4309	   1  0 Nov21 ?        00:10:28 xterm
ivm	  4311	4309  0 Nov21 pts/2    00:00:00 -csh
ivm	 15649	   1  0 Dec03 ?        00:00:01 xterm
ivm	 15651 15649  0 Dec03 pts/0    00:00:00 -csh
ivm	 23686	4102  0 Dec07 pts/1    00:00:00 /bin/bash
/opt/local/bin/readsolartape /de
ivm	 23691	4311  0 Dec07 pts/2    00:00:00 /bin/bash
/opt/local/bin/readsolartape /de

Comment 33 Orion Poplawski 2007-12-10 18:30:03 UTC
The above is with 2.6.23.1-49.fc8

Comment 34 Ian Kent 2007-12-11 02:14:03 UTC
So could we try and duplicate this by having a few logins
on a test machine and then restarting autofs?


Comment 35 Orion Poplawski 2007-12-17 19:09:10 UTC
Looks like it.  Just saw it triggered on most of my F7 machines when todays
autofs update was applied.

Comment 36 Ian Kent 2007-12-18 02:02:29 UTC
OK. I've started setting up a KDE install to try and duplicate it.

Comment 37 Jeff Moyer 2008-01-09 17:08:53 UTC
(In reply to comment #16)
> Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the
> nvidia tainted kernel - not my fault!).
> 
> WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF     )
> 
> Call Trace:
>  [<ffffffff810b9024>] set_dentry_child_flags+0x6e/0x14d
>  [<ffffffff81044475>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff810b923a>] remove_watch_no_event+0x38/0x47
>  [<ffffffff810b9349>] inotify_remove_watch_locked+0x18/0x3b
>  [<ffffffff810b962e>] inotify_rm_wd+0x7e/0xa1
>  [<ffffffff810b9aa6>] sys_inotify_rm_watch+0x46/0x63
>  [<ffffffff81009b5e>] system_call+0x7e/0x83

See bz #248355.

Comment 38 Christopher Brown 2008-01-13 23:36:42 UTC
*** Bug 248355 has been marked as a duplicate of this bug. ***

Comment 39 Christopher Brown 2008-01-13 23:42:34 UTC
Upstream kernel.org bug:

http://bugzilla.kernel.org/show_bug.cgi?id=8938

Please could all reporters add themselves to this bug upstream. There appears to
be a fix waiting to be confirmed for 2.6.24-rc6 therefore it would be
appreciated if 2.6.24 could be tested when it arrives.

Cheers
Chris

Comment 40 Ian Kent 2008-01-14 01:38:37 UTC
(In reply to comment #39)
> Upstream kernel.org bug:
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=8938

This isn't the bug described above.

Ian


Comment 41 Ian Kent 2008-01-14 01:41:32 UTC
*** Bug 248355 has been marked as a duplicate of this bug. ***

Comment 42 Ian Kent 2008-01-14 01:52:08 UTC
My investigation has shown that restarting autofs with
busy mounts results in this problem. I'm working to
resolve it but it is not a trivial problem.

Ian


Comment 43 Jon Stanley 2008-03-28 01:45:48 UTC
Adding Tracking keyword

Comment 44 Bug Zapper 2008-05-14 14:20:06 UTC
This message is a reminder that Fedora 7 is nearing the end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 7. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '7'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 7's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 7 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug. If you are unable to change the version, please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. If possible, it is recommended that you try the newest available Fedora distribution to see if your bug still exists.

Please read the Release Notes for the newest Fedora distribution to make sure it will meet your needs:
http://docs.fedoraproject.org/release-notes/

The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 45 Bug Zapper 2008-11-26 07:47:23 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 46 Bug Zapper 2009-01-09 07:15:06 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 47 Orion Poplawski 2009-05-21 14:16:20 UTC
I can reproduce this as well by hibernating/resuming without an autofs restart.

Comment 48 Ian Kent 2009-05-21 16:20:51 UTC
(In reply to comment #47)
> I can reproduce this as well by hibernating/resuming without an autofs restart.  

That could only happen if the sleep/wake functionality is doing
something like restarting autofs.

In any case this should be resolved in autofs-5.0.4 and with a
sufficiently recent kernel and F-10 doesn't have such a kernel
F-11 does and its autofs is based on 5.0.4.

Note that this cannot be fixed without a kernel that supports
the ioctl re-implementation that has been done for this and the
changes in autofs to use it.

Comment 49 Orion Poplawski 2009-05-21 17:29:08 UTC
Does the kernel-2.6.29.3-60.fc10 in updates-testing support this?

Comment 50 Ian Kent 2009-05-22 01:42:21 UTC
(In reply to comment #49)
> Does the kernel-2.6.29.3-60.fc10 in updates-testing support this?  

Yep, it certainly should.
So maybe I need to start thinking about back porting the autofs
active-restart (as I'm calling this) patches or update F-10 to
5.0.4 .... mmmm.

Comment 51 Orion Poplawski 2009-05-26 19:49:28 UTC
(In reply to comment #48)
> (In reply to comment #47)
> > I can reproduce this as well by hibernating/resuming without an autofs restart.  
> 
> That could only happen if the sleep/wake functionality is doing
> something like restarting autofs.
> 
> In any case this should be resolved in autofs-5.0.4 and with a
> sufficiently recent kernel and F-10 doesn't have such a kernel
> F-11 does and its autofs is based on 5.0.4.

Well, I'm now running F-11.  After a hibernate/resume I have lost current working directories but automount has not be restarted:

[root@orca ~]# cwdcheck     
[root@orca ~]# pm-hibernate
[root@orca ~]# cwdcheck
CWD problems           
[root@orca ~]# ls -l /proc/*/cwd                            
lrwxrwxrwx. 1 root      root      0 2009-05-26 13:39 /proc/1021/cwd -> /
....
lrwxrwxrwx. 1 orion     cora      0 2009-05-26 13:39 /proc/22695/cwd -> src/IT_TEST
lrwxrwxrwx. 1 orion     cora      0 2009-05-26 13:39 /proc/2348/cwd ->

Note that some are broken links.  "src/IT_TEST" is in /home/orion which is nfs automounted.  Looks like "/home/orion" was simply stripped from the cwd.

2.6.29.3-155.fc11.i686.PAE
autofs-5.0.4-24.i586

Comment 52 Ian Kent 2009-05-27 02:34:34 UTC
(In reply to comment #51)
> (In reply to comment #48)
> > (In reply to comment #47)
> > > I can reproduce this as well by hibernating/resuming without an autofs restart.  
> > 
> > That could only happen if the sleep/wake functionality is doing
> > something like restarting autofs.
> > 
> > In any case this should be resolved in autofs-5.0.4 and with a
> > sufficiently recent kernel and F-10 doesn't have such a kernel
> > F-11 does and its autofs is based on 5.0.4.
> 
> Well, I'm now running F-11.  After a hibernate/resume I have lost current
> working directories but automount has not be restarted:

Was it an upgrade or a fresh install?
IOW is the functionality enabled in the config.

Check that USE_MISC_DEVICE is not commented and is set to "yes"
in /etc/sysconfig/autofs.

It would also be useful to enable debug logging and check that
autofs has in fact been restarted over the hibernate cycle.

Just in case, instructions to setup debug logging are at
http://people.redhat.com/jmoyer.

Ian

Comment 53 Orion Poplawski 2009-05-27 15:21:42 UTC
Created attachment 345627 [details]
/var/log/debug.gz

Well, USE_MISC_DEVICE="yes" was missing (I run cfengine and it had replaced /etc/sysconfig/autofs).  Added it in but it didn't have any effect.  I've attached /var/log/debug.gz with autofs debug logging turned on.  Hibernated at 9:04:56, awake at 9:06:00.

Comment 54 Ian Kent 2009-05-28 00:04:16 UTC
(In reply to comment #53)
> Created an attachment (id=345627) [details]
> /var/log/debug.gz
> 
> Well, USE_MISC_DEVICE="yes" was missing (I run cfengine and it had replaced
> /etc/sysconfig/autofs).  Added it in but it didn't have any effect.  I've
> attached /var/log/debug.gz with autofs debug logging turned on.  Hibernated at
> 9:04:56, awake at 9:06:00.  

Thanks for the log.
It is quite interesting.

It shows autofs wasn't restarted and hasn't unlinked the mounts
but what your seeing, the path corruption, must be due to the
mount being unlinked. The cwd proc file uses a kernel call to
calculate the path and it returns the path up to the point at
which the mount was unlinked.

That leaves us with the question of what happens when a machine
hibernates? I'll ask around and see if anyone can enlighten me
on what goes on at hibernate time.

Comment 55 Orion Poplawski 2009-05-28 04:58:36 UTC
My first thought is that /etc/NetworkManager/dispatcher.d/05-netfs may unmount NFS mounts when NM detects that the network is down (which you see in the debug log when the machine is waking up).  I've been trying to trick NM into leaving the network alone (I removed shutting NM down from the hibernate sequence) but it is too clever by half.

Comment 56 Ian Kent 2009-05-28 09:02:43 UTC
Oh look, in /etc/init.d/netfs

if [ -n "$NFSMTAB" ]; then
        __umount_loop '$3 ~ /^nfs/ && $2 != "/" {print $2}' \
                /proc/mounts \
                $"Unmounting NFS filesystems: " \
                $"Unmounting NFS filesystems (retry): " \
                "-f -l"

The "-f -l" is $5 to __umount_loop() which looks like it's used
as arguments to umount. That would explain why were seeing the
problem but is not solution, of course.

Another thing that puzzles me, after a quick look the netfs init
script, is why does NetworkManager think it's OK to allow netfs to
"fuser -k" processes to get rid of mounts at hibernate? Maybe the
expected use of this is in fact intended to be only at shutdown.

Comment 57 Orion Poplawski 2009-05-28 16:03:30 UTC
Removing 05-netfs does prevent the problem from occurring.

/etc/NetworkManager/dispatcher.d/05-netfs is part of initscripts.  This approach does make more sense for the laptop case where the machine is likely to be in another location when it wakes up.  The better long term solution is probably more graceful handling of the disappearance of NFS servers, but I'm not sure how realistic that is.  In the meantime, I'm not sure how you distinguish between the desktop and laptop case.

Comment 58 Orion Poplawski 2009-05-28 16:12:51 UTC
BTW - I can confirm that this bug (restarted autofs) is fixed on F-11.  Shall I file a new bug on initscripts?

Comment 59 Ian Kent 2009-05-29 05:15:21 UTC
(In reply to comment #58)
> BTW - I can confirm that this bug (restarted autofs) is fixed on F-11.  Shall I
> file a new bug on initscripts?  

I'm not sure initscripts is the right component.

Running this at shutdown should be fine but NetworkManager
is running it at hibernate so perhaps NetworkManager is
the component that needs to work out what to do when the
machine is preparing to hibernate.

Comment 60 Orion Poplawski 2009-05-29 15:20:48 UTC
It is initscripts that put 05-netfs into /etc/NetworkManager/dispatcher.d.  NM is just doing that it is told - running a script when a network interface goes down.  Now, I'd prefer that NM never thought of the network as "down" during the hibernate/thaw process - but I don't think that is possible.  I've filed bug #503199.

It would be nice to see the fixed autofs into F-10 if it is not too much trouble as I'll probably be running that for a while.

Thanks for all the help.

Comment 61 Orion Poplawski 2009-05-29 21:19:03 UTC
So, is there realistically any way for the system to unmount busy nfs mounts and then re-mount them and have the kernel keep track of it properly to avoid this issue?

Comment 62 Ian Kent 2009-06-08 03:43:42 UTC
(In reply to comment #61)
> So, is there realistically any way for the system to unmount busy nfs mounts
> and then re-mount them and have the kernel keep track of it properly to avoid
> this issue?  

I don't think so, since mounts must be present for the kernel
(and other systems) to know about them.

Flushing active RPCs and then leaving them in place would be
a better approach, as NFSv3 and below should be stateless, but
there are no doubt problems with this I'm not aware of, NFSv4
would probably have difficulties.

Ian

Comment 63 Bug Zapper 2009-11-18 09:15:25 UTC
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 64 Bug Zapper 2009-12-18 05:58:18 UTC
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.