Hide Forgot
We have a fairly serious problem where the shell loses track of its current working directory. This is when in an autofs mounted NFS home directory (actually it is bind mount as the NFS server is the local machine). kernel is: Linux xpc17.ast.cam.ac.uk 2.6.22.4-65.fc7 #1 SMP Tue Aug 21 21:50:50 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux (kernel-2.6.22.4-65.fc7) The problem is intermittent, and occurs randomly in time across a whole set of shells. Here is some example output: xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ ipython Python 2.5 (r25:51908, Apr 10 2007, 10:27:40) Type "copyright", "credits" or "license" for more information. .. In [1]: import os In [2]: os.path.abspath('out.eps') Out[2]: 'jssxmm/cen/rgs_099psf_comb_rembad/out.eps' In [3]: Do you really want to exit ([y]/n)? y xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abs Can't get current working directory xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ pwd /data/jss/xmm/cen/rgs_099psf_comb_rembad xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ tcsh [jss@xpc17 rgs_099psf_comb_rembad]$ pwd jssxmm/cen/rgs_099psf_comb_rembad [jss@xpc17 rgs_099psf_comb_rembad]$ which abc /data/soft3/heasoft/headas-6.3.1/x86_64-unknown-linux-gnu/bin/abc [jss@xpc17 rgs_099psf_comb_rembad]$ exit xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abc Can't get current working directory xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ bash xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ which abc /data/soft3/heasoft/headas-6.3.1/x86_64-unknown-linux-gnu/bin/abc xpc17:/data/jss/xmm/cen/rgs_099psf_comb_rembad:$ exit If you examine the cwd symlink in /proc for the process, you find it has been mangled: xpc17:~:$ ls -l /proc/16141/cwd lrwxrwxrwx 1 jss users 0 2007-09-12 10:47 /proc/16141/cwd -> jssxmm/cen/rgs_099psf_comb_rembad New shells in new terminals have mangled cwds: [starts bash (pid 29643) in new terminal] xpc17:~:$ ls -l /proc/29643/cwd lrwxrwxrwx 1 jss users 0 2007-09-12 10:40 /proc/29643/cwd -> jss [why's that not an absolute path?] [enter command cd /home/jss] xpc17:~:$ ls -l /proc/29643/cwd lrwxrwxrwx 1 jss users 0 2007-09-12 10:40 /proc/29643/cwd -> /home/jss [enter command cd] xpc17:~:$ ls -l /proc/29643/cwd This normally just works, but randomly breaks for a whole set of users. versions of possible culprits: * kernel-2.6.22.4-65.fc7 * autofs-5.0.1-27 * nfs-utils-1.1.0-3.fc7
What does 'mount' output while this is happening? (Exactly what is mounted where isn't given in the report.)
Sorry for leaving that out: xpc17:~:$ mount /dev/sda1 on / type ext3 (rw,noatime) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda3 on /xpc17_data1 type ext3 (rw,noatime) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) nodev on /dev/oprofile type oprofilefs (rw) /xpc17_data1/home/jss on /home/jss type none (rw,bind) xalph3.ast.cam.ac.uk:/soft3 on /data/soft3 type nfs (rw,intr,addr=131.111.68.53) /xpc17_data1/data/jss on /data/jss type none (rw,bind) /xpc17_data1/scratch/jss on /scratch/jss type none (rw,bind) xpc9.ast.cam.ac.uk:/xpc9_data1/scratch/jgraham on /scratch/jgraham type nfs (rw,intr,addr=131.111.68.181) xalph3.ast.cam.ac.uk:/soft3/caldb on /data/caldb type nfs (rw,intr,addr=131.111.68.53)
(In reply to comment #0) > We have a fairly serious problem where the shell loses track of its current > working directory. This is when in an autofs mounted NFS home directory > (actually it is bind mount as the NFS server is the local machine). > When did the problem start happening?
We were using RHEL 4 previously, and moved to Fedora 7 from that. We noticed it shortly after the move, around the start of June 2007. So the problem's somewhere between 2.6.9 and 2.6.21. The problem only happens every few weeks, so it's hard to provoke it.
I'll add that when I log in the shell immediately has a broken CWD: [start terminal] xpc17:~:$ pwd /home/jss xpc17:~:$ which abc Can't get current working directory But this doesn't happen with the root user. The root homedir is not on the autofs mounted home partition, however. This is what the NIS auto.master map looks like: /home auto.home intr /data auto.data intr /data1 auto.data1 intr /scratch auto.scratch intr
*** Bug 293491 has been marked as a duplicate of this bug. ***
Orion, what version of nfs-utils are you using? Did this problem start happening recently or have you been seeing it for some time? Has the problem continued through various kernel versions or started with some particular kernel version? Is there anything in the log? Ian
(In reply to comment #2) > xalph3.ast.cam.ac.uk:/soft3 on /data/soft3 type nfs (rw,intr,addr=131.111.68.53) > /xpc17_data1/data/jss on /data/jss type none (rw,bind) > /xpc17_data1/scratch/jss on /scratch/jss type none (rw,bind) > xpc9.ast.cam.ac.uk:/xpc9_data1/scratch/jgraham on /scratch/jgraham type nfs > (rw,intr,addr=131.111.68.181) > xalph3.ast.cam.ac.uk:/soft3/caldb on /data/caldb type nfs > (rw,intr,addr=131.111.68.53) Are /soft3 and /soft3/caldb distinct filesystems on xalph3.ast.cam.ac.uk? Do you mount distinct exports from servers multiple times, possibly to multiple locations, possibly with different mount options. Ian
xalph3:/soft3 and xalph3:/soft3/caldb are the same file system, so are xpc17:/xpc17_data1/data/jss, xpc17:/xpc17_data1/scratch/jss and xpc17:/xpc17_data1/home/jss They all should have the same mount options (all the nfs partitions are mounted with autofs, which has the same mount options for each mount point). This is what /proc/mounts currently shows (though the problem seems to have temporarily stopped currently without rebooting): rootfs / rootfs rw 0 0 /dev/root / ext3 rw,noatime,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 devpts /dev/pts devpts rw 0 0 /dev/sda3 /xpc17_data1 ext3 rw,noatime,data=ordered 0 0 tmpfs /dev/shm tmpfs rw 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 nfsd /proc/fs/nfsd nfsd rw 0 0 nodev /dev/oprofile oprofilefs rw 0 0 auto.data1 /data1 autofs rw,fd=5,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0 auto.data /data autofs rw,fd=10,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0 auto.home /home autofs rw,fd=15,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0 auto.scratch /scratch autofs rw,fd=20,pgrp=29224,timeout=300,minproto=5,maxproto=5,indirect 0 0 /dev/sda3 /home/jss ext3 rw,noatime,data=ordered 0 0 xalph3.ast.cam.ac.uk:/soft3 /data/soft3 nfs rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=xalph3.ast.cam.ac.uk 0 0 /dev/sda3 /data/jss ext3 rw,noatime,data=ordered 0 0 xalph3.ast.cam.ac.uk:/soft3/caldb /data/caldb nfs rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=xalph3.ast.cam.ac.uk 0 0
(In reply to comment #9) > xalph3:/soft3 and xalph3:/soft3/caldb are the same file system, so are > xpc17:/xpc17_data1/data/jss, xpc17:/xpc17_data1/scratch/jss and > xpc17:/xpc17_data1/home/jss > > They all should have the same mount options (all the nfs partitions are mounted > with autofs, which has the same mount options for each mount point). This is > what /proc/mounts currently shows (though the problem seems to have temporarily > stopped currently without rebooting): The point of the question is that, with the Fedora kernel above, various sequences of events fail, like: 1) user logs in, causes mount of say fs A as rw. 2) pwd is changed away and A expires. 3) a mount of A+some path is mounted ro due to other activity. 4) user returns home, automount will not be able to mount the home directory. If you think you're seeing this type of failure then you can overcome it by changing the line '#OPTIONS=""' to 'OPTIONS="-O nosharecache"' in /etc/sysconfig/autofs, provided you have nfs-utils-1.1.0 revision 3 or later. Login itself has always been a bit ify when it thinks the user home doesn't exist but still I'm quite puzzled by the apparent loss of pwd in the kernel. I can't really relate that to any changes. There are however two mistakes that were discovered after an autofs patch went into 2.6.22 which may be worth trying. Given a patch, would you be able to patch and build a kernel for testing? Ian
It seems unlikely that the nfs partitions are being remounted with different options here I think. It's very rare for anything other than the automounter to mount nfs drives. I'd be willing to patch a kernel and test it. The problem is pretty random, so I can't guarantee how long it might take to reoccur. (By the way, the report that the problem had gone away temporarily is incorrect - it is still present).
Created attachment 198321 [details] Corrections for the mount/expire race patch in 2.6.22 This patch ia against the current Fedora kernel CVS but (hopefully) should apply to your kernel. Ian
The above patch corrects the two mistakes I mentioned. For an explanation of the purpose of the original patch look at: http://marc.info/?l=linux-kernel&m=117151296416003&w=2 The first hunk deals with process wakeup order when more than one process is waiting on an expire during a mount request (see comment above). The second hunk also deals with the case of multiple processes competing for a mount request, only one can succeed and so others must look for the dentry upon which the mount occurred, if in fact it succeeded. Ian
(In reply to comment #13) > The first hunk deals with process wakeup order when > more than one process is waiting on an expire during > a mount request (see comment above). Oops, I didn't actually add the comment bit in this patch, sorry. Ian
Oh and btw, are there any clues at all in the system log?
Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the nvidia tainted kernel - not my fault!). WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF ) Call Trace: [<ffffffff810b9024>] set_dentry_child_flags+0x6e/0x14d [<ffffffff81044475>] autoremove_wake_function+0x0/0x2e [<ffffffff810b923a>] remove_watch_no_event+0x38/0x47 [<ffffffff810b9349>] inotify_remove_watch_locked+0x18/0x3b [<ffffffff810b962e>] inotify_rm_wd+0x7e/0xa1 [<ffffffff810b9aa6>] sys_inotify_rm_watch+0x46/0x63 [<ffffffff81009b5e>] system_call+0x7e/0x83
(In reply to comment #16) > Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the > nvidia tainted kernel - not my fault!). > > WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF ) It may be a clue as to what's happening but I can't see that this, in itself, would cause the problem your seeing. No automount messages? Ian
Sorry - I missed this one in an old message log: Sep 12 10:25:28 xpc17 automount[2158]: umount_autofs_indirect: ask umount returned busy /scratch Sep 12 10:25:28 xpc17 automount[2158]: umount_autofs_indirect: ask umount returned busy /home Sep 12 10:25:29 xpc17 automount[2158]: umount_autofs_indirect: ask umount returned busy /data Sep 12 10:28:00 xpc17 mountd[2330]: authenticated mount request from xserv2.ast.cam.ac.uk:783 for /xpc17_data1/home/jss (/xpc17_data1) Sep 12 10:28:00 xpc17 mountd[2330]: authenticated mount request from xserv2.ast.cam.ac.uk:786 for /xpc17_data1/data/jss (/xpc17_data1) The machine has been up since this message.
And could you post a listing of a directory that contains (at least) one of these bad entries please.
I should add the above message happened just after an autofs update. Do you mean these: xpc17:/var/log:# ls -l /home total 20 drwxr-xr-x 165 jss users 20480 2007-09-18 14:11 jss xpc17:/var/log:# ls -l /data total 8 drwxr-xr-x 41 jss users 4096 2007-08-31 15:12 jss drwxr-xr-x 95 star star 4096 2007-08-28 12:45 soft3 xpc17:/var/log:# ls -l /scratch/ total 8 drwxr-xr-x 24 jgraham users 4096 2007-09-04 11:40 jgraham drwxr-xr-x 18 jss users 4096 2007-03-06 10:16 jss Or the root directory? xpc17:/var/log:# ls -l / total 160 drwxr-xr-x 2 root root 4096 2007-09-18 06:13 bin drwxr-xr-x 3 root root 4096 2007-08-29 05:47 boot lrwxrwxrwx 1 root root 11 2007-06-06 14:40 caldb -> /data/caldb drwxr-xr-x 4 root root 0 2007-09-18 14:11 data drwxr-xr-x 2 root root 0 2007-09-12 10:25 data1 drwxr-xr-x 13 root root 4540 2007-08-30 11:56 dev drwxr-xr-x 95 root root 12288 2007-09-18 14:19 etc drwxr-xr-x 3 root root 0 2007-09-18 12:09 home lrwxrwxrwx 1 root root 26 2007-06-06 14:40 iraf -> /data/soft3/irafv2.12.PCIX drwxr-xr-x 12 root root 4096 2007-09-17 05:34 lib drwxr-xr-x 8 root root 4096 2007-09-13 05:35 lib64 drwx------ 2 root root 16384 2007-06-06 13:24 lost+found drwxr-xr-x 2 root root 4096 2007-08-29 09:43 media drwxr-xr-x 5 root root 4096 2007-06-07 05:14 mnt drwxr-xr-x 2 root root 4096 2007-04-17 13:46 opt dr-xr-xr-x 221 root root 0 2007-08-29 09:42 proc drwxr-x--- 17 root root 4096 2007-09-18 14:11 root drwxr-xr-x 2 root root 12288 2007-09-13 05:38 sbin drwxr-xr-x 4 root root 0 2007-09-18 14:19 scratch drwxr-xr-x 2 root root 4096 2007-06-06 13:24 selinux drwxr-xr-x 3 root root 4096 2007-07-02 05:23 share drwxr-xr-x 2 root root 4096 2007-04-17 13:46 srv lrwxrwxrwx 1 root root 25 2007-06-06 14:40 star -> /data/soft3/starlink/star drwxr-xr-x 12 root root 0 2007-08-29 09:42 sys drwxrwxrwt 16 root root 4096 2007-09-18 14:10 tmp drwxr-xr-x 16 root root 4096 2007-09-12 10:25 usr drwxr-xr-x 22 root root 4096 2007-06-06 14:42 var drwxr-xr-x 6 root root 4096 2005-07-01 14:59 xpc17_data1
(In reply to comment #7) > Orion, what version of nfs-utils are you using? nfs-utils-1.1.0-3.fc7 > Did this problem start happening recently or have you > been seeing it for some time? > Has the problem continued through various kernel versions > or started with some particular kernel version? Hard to say. I'll ask my long time F7 user if he can recall. I've just completed upgrading our systems from F5 to F7 so I don't have much history on F7. > Is there anything in the log? Not really.
(In reply to comment #20) > I should add the above message happened just after an autofs update. As I said, that function shouldn't cause what your seeing but I'll look deeper into the other functions in the call trace. > > Do you mean these: Think so, I wanted the automount directory that contains problem mount points. I'm after a listing of a directory immediately following a badness event. If you wait a little while the VFS may reclaim any evidence of a problem that may have occurred. So try for one straight after a problem happens. Ian
We've just had another system start doing it in the last hour. There are no messages in the logs from auto* recently. There were no unusual directory entries. Each new terminal again appears to have an incorrect home directory as CWD. We'll have to try patching the kernel...
(In reply to comment #23) > We've just had another system start doing it in the last hour. There are no > messages in the logs from auto* recently. There were no unusual directory > entries. Each new terminal again appears to have an incorrect home directory as > CWD. We'll have to try patching the kernel... Mmm .. We really need to narrow the search on this if we're to make progress. Perhaps a better approach would be to try the kernel that was originally shipped with F-7 first or has that been seen to be a problem already? This kernel has been around for a while without reports of problems. This would at least narrow the search and save patching the kernel for the moment at least. Ian
I've installed a variety of F7 kernels on some of our desktops as well as a patched version of 2.6.22.6-81. We'll see....
I think it was in the original F7 kernel. I've found a note I made on our trac database about the issue later (23rd July), when kernel-2.6.21-1.3228.fc7 was installed: "I have noticed a couple of times (with kernel-2.6.21-1.3228.fc7 and before) that the kernel appears to lose track of the current directory of running processes." We'll try running the earliest kernel in case my memory is mistaken. I believe there was only the install kernel before 1.3228. By the way, it appears new logins from the console are okay. I think they are broken when started from KDE, because the KDE desktop appears to have a broken CWD (from proc).
We're using KDE too, wonder if that's meaningful...
Is there anything to report on this when using other kernels guys? Ian
The system running 2.6.21-1.3194.fc7 hasn't failed yet. It can take a few weeks however to show anything.
No failures again so far for me too. I'm closely monitoring the systems though and should be alerted as soon as it re-occurs.
One thing that may help is a <sys-rq>-t dump of a system immediately (at least before a reboot, hehe) after this happens.
Created attachment 283051 [details] sysrq-t dump Looks like this got triggered when the 1:5.0.2-17 update was installed and autofs restarted. [root@vault ~]# ls -l /proc/*/cwd 2>/dev/null | grep '> $' lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/15649/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4035/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4043/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4056/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4058/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4060/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4063/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4065/cwd -> lrwxrwxrwx 1 ivm cora 0 2007-12-09 04:47 /proc/4066/cwd -> lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/4100/cwd -> lrwxrwxrwx 1 root root 0 2007-12-09 04:47 /proc/4309/cwd -> [root@vault ~]# ps -fu ivm UID PID PPID C STIME TTY TIME CMD ivm 4035 1 0 Nov21 ? 00:35:41 Xvnc :1 -desktop vault:1 (ivm) -httpd /usr ivm 4043 1 0 Nov21 ? 00:00:00 /bin/sh /etc/xdg/xfce4/xinitrc ivm 4051 1 0 Nov21 ? 00:00:00 /usr/bin/ssh-agent -s ivm 4056 1 0 Nov21 ? 00:00:00 xfce-mcs-manager ivm 4058 1 0 Nov21 ? 00:00:01 xfwm4 --daemon ivm 4060 4043 0 Nov21 ? 00:01:43 xfdesktop ivm 4063 4043 0 Nov21 ? 00:15:24 /usr/bin/xfce4-panel ivm 4065 1 0 Nov21 ? 00:01:51 /usr/libexec/gam_server ivm 4066 4063 0 Nov21 ? 00:00:52 /usr/libexec/xfce4/panel-plugins/xfce4-men ivm 4100 1 0 Nov21 ? 00:00:01 xterm ivm 4102 4100 0 Nov21 pts/1 00:00:01 -csh ivm 4309 1 0 Nov21 ? 00:10:28 xterm ivm 4311 4309 0 Nov21 pts/2 00:00:00 -csh ivm 15649 1 0 Dec03 ? 00:00:01 xterm ivm 15651 15649 0 Dec03 pts/0 00:00:00 -csh ivm 23686 4102 0 Dec07 pts/1 00:00:00 /bin/bash /opt/local/bin/readsolartape /de ivm 23691 4311 0 Dec07 pts/2 00:00:00 /bin/bash /opt/local/bin/readsolartape /de
The above is with 2.6.23.1-49.fc8
So could we try and duplicate this by having a few logins on a test machine and then restarting autofs?
Looks like it. Just saw it triggered on most of my F7 machines when todays autofs update was applied.
OK. I've started setting up a KDE install to try and duplicate it.
(In reply to comment #16) > Hmmm - I forgot about this. I'm not sure it is relevant (apologies for the > nvidia tainted kernel - not my fault!). > > WARNING: at fs/inotify.c:172 set_dentry_child_flags() (Tainted: PF ) > > Call Trace: > [<ffffffff810b9024>] set_dentry_child_flags+0x6e/0x14d > [<ffffffff81044475>] autoremove_wake_function+0x0/0x2e > [<ffffffff810b923a>] remove_watch_no_event+0x38/0x47 > [<ffffffff810b9349>] inotify_remove_watch_locked+0x18/0x3b > [<ffffffff810b962e>] inotify_rm_wd+0x7e/0xa1 > [<ffffffff810b9aa6>] sys_inotify_rm_watch+0x46/0x63 > [<ffffffff81009b5e>] system_call+0x7e/0x83 See bz #248355.
*** Bug 248355 has been marked as a duplicate of this bug. ***
Upstream kernel.org bug: http://bugzilla.kernel.org/show_bug.cgi?id=8938 Please could all reporters add themselves to this bug upstream. There appears to be a fix waiting to be confirmed for 2.6.24-rc6 therefore it would be appreciated if 2.6.24 could be tested when it arrives. Cheers Chris
(In reply to comment #39) > Upstream kernel.org bug: > > http://bugzilla.kernel.org/show_bug.cgi?id=8938 This isn't the bug described above. Ian
My investigation has shown that restarting autofs with busy mounts results in this problem. I'm working to resolve it but it is not a trivial problem. Ian
Adding Tracking keyword
This message is a reminder that Fedora 7 is nearing the end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 7. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '7'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 7's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 7 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. If possible, it is recommended that you try the newest available Fedora distribution to see if your bug still exists. Please read the Release Notes for the newest Fedora distribution to make sure it will meet your needs: http://docs.fedoraproject.org/release-notes/ The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
I can reproduce this as well by hibernating/resuming without an autofs restart.
(In reply to comment #47) > I can reproduce this as well by hibernating/resuming without an autofs restart. That could only happen if the sleep/wake functionality is doing something like restarting autofs. In any case this should be resolved in autofs-5.0.4 and with a sufficiently recent kernel and F-10 doesn't have such a kernel F-11 does and its autofs is based on 5.0.4. Note that this cannot be fixed without a kernel that supports the ioctl re-implementation that has been done for this and the changes in autofs to use it.
Does the kernel-2.6.29.3-60.fc10 in updates-testing support this?
(In reply to comment #49) > Does the kernel-2.6.29.3-60.fc10 in updates-testing support this? Yep, it certainly should. So maybe I need to start thinking about back porting the autofs active-restart (as I'm calling this) patches or update F-10 to 5.0.4 .... mmmm.
(In reply to comment #48) > (In reply to comment #47) > > I can reproduce this as well by hibernating/resuming without an autofs restart. > > That could only happen if the sleep/wake functionality is doing > something like restarting autofs. > > In any case this should be resolved in autofs-5.0.4 and with a > sufficiently recent kernel and F-10 doesn't have such a kernel > F-11 does and its autofs is based on 5.0.4. Well, I'm now running F-11. After a hibernate/resume I have lost current working directories but automount has not be restarted: [root@orca ~]# cwdcheck [root@orca ~]# pm-hibernate [root@orca ~]# cwdcheck CWD problems [root@orca ~]# ls -l /proc/*/cwd lrwxrwxrwx. 1 root root 0 2009-05-26 13:39 /proc/1021/cwd -> / .... lrwxrwxrwx. 1 orion cora 0 2009-05-26 13:39 /proc/22695/cwd -> src/IT_TEST lrwxrwxrwx. 1 orion cora 0 2009-05-26 13:39 /proc/2348/cwd -> Note that some are broken links. "src/IT_TEST" is in /home/orion which is nfs automounted. Looks like "/home/orion" was simply stripped from the cwd. 2.6.29.3-155.fc11.i686.PAE autofs-5.0.4-24.i586
(In reply to comment #51) > (In reply to comment #48) > > (In reply to comment #47) > > > I can reproduce this as well by hibernating/resuming without an autofs restart. > > > > That could only happen if the sleep/wake functionality is doing > > something like restarting autofs. > > > > In any case this should be resolved in autofs-5.0.4 and with a > > sufficiently recent kernel and F-10 doesn't have such a kernel > > F-11 does and its autofs is based on 5.0.4. > > Well, I'm now running F-11. After a hibernate/resume I have lost current > working directories but automount has not be restarted: Was it an upgrade or a fresh install? IOW is the functionality enabled in the config. Check that USE_MISC_DEVICE is not commented and is set to "yes" in /etc/sysconfig/autofs. It would also be useful to enable debug logging and check that autofs has in fact been restarted over the hibernate cycle. Just in case, instructions to setup debug logging are at http://people.redhat.com/jmoyer. Ian
Created attachment 345627 [details] /var/log/debug.gz Well, USE_MISC_DEVICE="yes" was missing (I run cfengine and it had replaced /etc/sysconfig/autofs). Added it in but it didn't have any effect. I've attached /var/log/debug.gz with autofs debug logging turned on. Hibernated at 9:04:56, awake at 9:06:00.
(In reply to comment #53) > Created an attachment (id=345627) [details] > /var/log/debug.gz > > Well, USE_MISC_DEVICE="yes" was missing (I run cfengine and it had replaced > /etc/sysconfig/autofs). Added it in but it didn't have any effect. I've > attached /var/log/debug.gz with autofs debug logging turned on. Hibernated at > 9:04:56, awake at 9:06:00. Thanks for the log. It is quite interesting. It shows autofs wasn't restarted and hasn't unlinked the mounts but what your seeing, the path corruption, must be due to the mount being unlinked. The cwd proc file uses a kernel call to calculate the path and it returns the path up to the point at which the mount was unlinked. That leaves us with the question of what happens when a machine hibernates? I'll ask around and see if anyone can enlighten me on what goes on at hibernate time.
My first thought is that /etc/NetworkManager/dispatcher.d/05-netfs may unmount NFS mounts when NM detects that the network is down (which you see in the debug log when the machine is waking up). I've been trying to trick NM into leaving the network alone (I removed shutting NM down from the hibernate sequence) but it is too clever by half.
Oh look, in /etc/init.d/netfs if [ -n "$NFSMTAB" ]; then __umount_loop '$3 ~ /^nfs/ && $2 != "/" {print $2}' \ /proc/mounts \ $"Unmounting NFS filesystems: " \ $"Unmounting NFS filesystems (retry): " \ "-f -l" The "-f -l" is $5 to __umount_loop() which looks like it's used as arguments to umount. That would explain why were seeing the problem but is not solution, of course. Another thing that puzzles me, after a quick look the netfs init script, is why does NetworkManager think it's OK to allow netfs to "fuser -k" processes to get rid of mounts at hibernate? Maybe the expected use of this is in fact intended to be only at shutdown.
Removing 05-netfs does prevent the problem from occurring. /etc/NetworkManager/dispatcher.d/05-netfs is part of initscripts. This approach does make more sense for the laptop case where the machine is likely to be in another location when it wakes up. The better long term solution is probably more graceful handling of the disappearance of NFS servers, but I'm not sure how realistic that is. In the meantime, I'm not sure how you distinguish between the desktop and laptop case.
BTW - I can confirm that this bug (restarted autofs) is fixed on F-11. Shall I file a new bug on initscripts?
(In reply to comment #58) > BTW - I can confirm that this bug (restarted autofs) is fixed on F-11. Shall I > file a new bug on initscripts? I'm not sure initscripts is the right component. Running this at shutdown should be fine but NetworkManager is running it at hibernate so perhaps NetworkManager is the component that needs to work out what to do when the machine is preparing to hibernate.
It is initscripts that put 05-netfs into /etc/NetworkManager/dispatcher.d. NM is just doing that it is told - running a script when a network interface goes down. Now, I'd prefer that NM never thought of the network as "down" during the hibernate/thaw process - but I don't think that is possible. I've filed bug #503199. It would be nice to see the fixed autofs into F-10 if it is not too much trouble as I'll probably be running that for a while. Thanks for all the help.
So, is there realistically any way for the system to unmount busy nfs mounts and then re-mount them and have the kernel keep track of it properly to avoid this issue?
(In reply to comment #61) > So, is there realistically any way for the system to unmount busy nfs mounts > and then re-mount them and have the kernel keep track of it properly to avoid > this issue? I don't think so, since mounts must be present for the kernel (and other systems) to know about them. Flushing active RPCs and then leaving them in place would be a better approach, as NFSv3 and below should be stateless, but there are no doubt problems with this I'm not aware of, NFSv4 would probably have difficulties. Ian
This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.