Hide Forgot
Description of problem: Since an upgrade to Fedora 8, one of our workstations started to behave erratically. Mainly programs that were run from user's home directories suddenly lost their working directories etc. This is a major regression from Fedora 7 where we were able to run this kind of installation for ages. Downgrading the autofs package to the version used with Fedora 7 does not change the behavior. We traced this problem to the current kernel version of Fedora 8, so it seems to be a kernel problem. We are using a heavily networked infrastructure with autofs mounted home directories. The pathes are resolved with LDAP using RFC2307bis entries without wildcards. All entries are actually in the auto_home map. After a while, the current working directory suddenly disappears. E.g. with firefox: [henning@forge tmp]$ ps auxw | grep firefox henning 7293 0.0 0.0 4652 1104 ? S 21:42 0:00 /bin/sh /usr/lib/firefox-2.0.0.10/firefox [henning@forge tmp]$ ls -la /proc/7293/cwd lrwxrwxrwx 1 henning henning 0 2008-01-04 21:42 /proc/7293/cwd -> The symlink for /proc/<pid>/cwd no longer points to the users home directory where firefox was started. This even happens with TIMEOUT=0 in the /etc/sysconfig/autofs configuration file. While the mounts never time out, processes still lose their working directory. Version-Release number of selected component (if applicable): 2.6.23.9-85.fc8 The XEN kernel (2.6.21-2952.fc8xen) does not exhibit this behavior. How reproducible: Always Steps to Reproduce: 1. cd <autofs mounted directory> 2. bash & (note the PID ) 3. run "while [ 1 ]; do ( date; ls -la /proc/<noted PID>/cwd ; sleep 1 ) >> /tmp/logfile.losing_homedir ; done &" 4. fg %1 5. in another shell: tail -f /tmp/logfile.losing_homedir 6. wait for a while Actual results: After a while, the /proc/<pid>/cwd link suddenly points to nothing. Expected results: The /proc/<pid>/cwd link should point to the <autofs mounted directory> the whole time. Additional info:
Created attachment 290870 [details] logfile from a bash losing its working directory
Created attachment 290871 [details] logfile from a firefox browser losing its working directory.
(In reply to comment #0) > Version-Release number of selected component (if applicable): > > 2.6.23.9-85.fc8 > The XEN kernel (2.6.21-2952.fc8xen) does not exhibit this behavior. Are you sure about this? What evidence allowed you to arrive at this? Are you using KDE? > > How reproducible: > > Always But I'm not able to reproduce the problem and I've been trying for a while now. What else can you tell me about you're environment? Ian
Yes, I am very sure about this. I arrived to this conclusion by running the test as described above. I attached you a bunch of relevant files from the client and a dump of the relevant entries in the LDAP Server. If you need more, just tell me. We are running FC8 on XEN (base system is CentOS5) for ages and never experienced these problems. I will try downgrading to the latest known FC7 kernel for testing next week. We are not using KDE at all, only Gnome for desktops. But even a bash started from a text prompt shows this behavior. What else can I tell you? This is pretty much a vanilla FC8 installation with all upgrades installed. What info would you need? autofs: autofs-5.0.2-24 kernel: 2.6.23.9-85.fc8 openldap: 2.3.39-1.fc8 The mounted filesystems are exported by a CentOS5 x86_64 server using NFSv3 over TCP. We use autofs indirect maps: server:/mnt/disk0/home/henning /home/henning nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=shirley.intermeta.de 0 0 (there are a bunch more mounts under /home) All user information (UID, GID) also comes from LDAP. doing a "ls -la /proc/*/cwd" as root yields: lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/1545/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16498/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16513/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16518/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/31924/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32145/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32146/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32147/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32173/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32180/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32218/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32237/cwd -> lrwxrwxrwx 1 root root 0 2008-01-05 07:42 /proc/32248/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8112/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8155/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8159/cwd -> lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8315/cwd -> so there are a number of processes which have lost their cwd. I've booted the box various times and the problem persists. There are no OOps or other errors in dmesg or the system log. The only unusual thing that we run on the box is the nvidia gfx driver. I know that you guys are sensitive to this, so I tried without the driver first and it yields the same results.
Created attachment 290883 [details] bunch of configuration files and LDAP dumps relevant to this problem
(In reply to comment #4) > Yes, I am very sure about this. I arrived to this conclusion by running the test > as described above. OK, that's new information. We have another bug logged on this issue, 287411, but I'd like to keep this bug separate for now while I see what new information I can get from you're experiences. > > I attached you a bunch of relevant files from the client and a dump of the > relevant entries in the LDAP Server. If you need more, just tell me. > > We are running FC8 on XEN (base system is CentOS5) for ages and never > experienced these problems. I will try downgrading to the latest known FC7 > kernel for testing next week. > > We are not using KDE at all, only Gnome for desktops. But even a bash started > from a text prompt shows this behavior. OK, that's also different. > > What else can I tell you? This is pretty much a vanilla FC8 installation with > all upgrades installed. What info would you need? Not sure but a debug log from autofs from a time prior to this happening may be useful. Make sure you are actually sending daemon.* to syslog and enable debug logging in the autofs configuration please. > > autofs: autofs-5.0.2-24 > kernel: 2.6.23.9-85.fc8 > openldap: 2.3.39-1.fc8 > > The mounted filesystems are exported by a CentOS5 x86_64 server using NFSv3 over > TCP. We use autofs indirect maps: > > server:/mnt/disk0/home/henning /home/henning nfs > rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=shirley.intermeta.de > 0 0 > > (there are a bunch more mounts under /home) > > All user information (UID, GID) also comes from LDAP. > > doing a "ls -la /proc/*/cwd" as root yields: > > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/1545/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16498/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16513/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/16518/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/31924/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32145/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32146/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32147/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32173/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32180/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32218/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/32237/cwd -> > lrwxrwxrwx 1 root root 0 2008-01-05 07:42 /proc/32248/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8112/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8155/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8159/cwd -> > lrwxrwxrwx 1 henning henning 0 2008-01-05 07:42 /proc/8315/cwd -> > > so there are a number of processes which have lost their cwd. > > I've booted the box various times and the problem persists. There are no OOps or > other errors in dmesg or the system log. > > The only unusual thing that we run on the box is the nvidia gfx driver. I know > that you guys are sensitive to this, so I tried without the driver first and it > yields the same results. Personally, I don't really care, I just want to get more info about this so I can find out what's wrong. I also have to run the Nvidia binaries on my Fedora install. Ian
(In reply to comment #4) > Yes, I am very sure about this. I arrived to this conclusion by running the test > as described above. The main reason I want to nail this down is that I've not been able to get a clear indication of broken and not broken kernel versions so far. I really need to start by finding out if this started after I made a change to the autofs4 kernel module or not. If I don't I'll just end up going round in circles. Ian
We are using the livna RPMs for nvidia, this should save you time installing the driver. I activated debug logging for autofs on that box and we log *.* to a file anyway. As soon as I have some debug logs, I will attach them here. Thanks a lot for looking into this.
(In reply to comment #8) > We are using the livna RPMs for nvidia, this should save you time installing the > driver. I use atrpms. > > I activated debug logging for autofs on that box and we log *.* to a file > anyway. As soon as I have some debug logs, I will attach them here. > > Thanks a lot for looking into this. I updated my machine to F8 today. I have [raven@raven ~]$ uname -a Linux raven.themaw.net 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:36 EST 2007 x86_64 x86_64 x86_64 GNU/Linux [raven@raven ~]$ rpm -q autofs autofs-5.0.2-24.x86_64 Using a file map auto.home2 [raven@raven ~]$ cat /etc/auto.home2 auto shark:/export/& Carrying out the exact same steps outlined I've not yet seen a problem yet. I'll keep trying. Ian
You are testing with x86_64! My box is an i386. % uname -an Linux forge.intermeta.de 2.6.23.9-85.fc8 #1 SMP Fri Dec 7 15:49:59 EST 2007 i686 i686 i386 GNU/Linux Here is a 100% sure way to reproduce the problem: 1. open xterm 2. echo $$ --> yields a PID 3. ls -la /proc/$$/cwd lrwxrwxrwx 1 henning henning 0 2008-01-05 22:39 /proc/3341/cwd -> /home/henning 4. open another xterm 5. become root, run /sbin/service autofs restart 6. in the first xterm, repeat the ls -la /proc/$$/cwd command: lrwxrwxrwx 1 henning henning 0 2008-01-05 22:39 /proc/3341/cwd -> => cwd is gone. here is another datapoint that might be a different bug but looks related. 1. pwd /home/henning 2. ls -la /proc/$$/cwd lrwxrwxrwx 1 henning henning 0 2008-01-05 22:42 /proc/3581/cwd -> /home/henning 3. df -h . server:/mnt/disk0/home/henning 1.5T 343G 1.2T 23% /home/henning 4. /usr/sbin/lsof -p $$ | grep cwd bash 3581 henning cwd DIR 0,25 16384 159777672 /home/henning (server:/mnt/disk0/home/windows) Whoops! /mnt/disk0/home/windows is mounted as "/home/windows", not "/home/henning". That somehow got confused by lsof. Where does it get its informations from? /proc/mounts seems ok: # cat /proc/mounts /dev/root / ext3 rw,relatime,data=ordered 0 0 /dev /dev tmpfs rw,relatime 0 0 /proc /proc proc rw,relatime 0 0 /sys /sys sysfs rw,relatime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 devpts /dev/pts devpts rw,relatime 0 0 /dev/sda1 /boot ext3 rw,relatime,data=ordered 0 0 tmpfs /dev/shm tmpfs rw,relatime 0 0 /dev/mapper/system_vg-lv_var /var ext3 rw,relatime,data=ordered 0 0 /dev/mapper/system_vg-data_lv /mnt/disk0 ext3 rw,relatime,data=ordered 0 0 /dev/mapper/system_vg-data_lv /usr/src ext3 rw,relatime,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 nfsd /proc/fs/nfsd nfsd rw,relatime 0 0 -hosts /net autofs rw,relatime,fd=6,pgrp=3673,timeout=0,minproto=5,maxproto=5,indirect 0 0 auto_cd /mnt/cd autofs rw,relatime,fd=12,pgrp=3673,timeout=0,minproto=5,maxproto=5,indirect 0 0 auto_home /home autofs rw,relatime,fd=18,pgrp=3673,timeout=0,minproto=5,maxproto=5,indirect 0 0 server:/mnt/disk0/home/henning /home/henning nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 server:/mnt/media/video /home/video nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 server:/mnt/disk0/home/windows /home/windows nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 /dev/mapper/system_vg-data_lv /home/scratch ext3 rw,relatime,data=ordered 0 0 /dev/mapper/system_vg-data_lv /home/henning.stand ext3 rw,relatime,data=ordered 0 0 /dev/mapper/system_vg-data_lv /home/rpmbuild ext3 rw,relatime,data=ordered 0 0 I also attached you a dump of my full log (*.* /var/log/fulllog in /etc/rsyslog.conf) around an autofs restart.
I also noticed that you only have a single entry in your automounter map. If this is a race, there is a good chance, that you need multiple entries. Try getting a number of mounts and use them like this: auto.master: /test /etc/auto.test auto.test: test1: server:/export/& test2: server:/export/& test3: server:/export/& (your server must actually have and export /export/test1 - test3) Make sure that your actually mount multiple directories from the server, not just one. Restarting autofs makes this behavior 100% reproduceable. Open a shell, "cd /test/test1", restart autofs. cwd gone.
Created attachment 290893 [details] debug level dump of system log while autofs restarts
Thanks to your persistence I've been able to reproduce this also now. It is due to the "umount -l" that autofs does on its active mounts at startup. This was done to allow for restarting when mounts are still busy. This works fine for the process itself accessing the old mount and subsequent file requests simply get a new mount and almost everything is OK. But after the mount is detached the proc filesystem can no longer walk the path from the mount up to the root to return the working directory string. This is kind of a big deal for autofs. Ian
Well, restarting autofs is a way to trigger this problem. I am however very sure that this is not the only way and there is still more trouble, because that does not explain the first logs that I sent you. At this time, I ran autofs with TIMEOUT=0 (never unmount any mounted filesystem) and did not restart autofs at all (I found this out by accident yesterday). Still some processes lost their cwd without any obvious reason. As you asked about KDE and I use gnome, could it be that the mount managers of the desktops (gnome-volume-manager) exhibit a similar problem. I don't think that we can point our finger to autofs yet.
(In reply to comment #14) > Well, restarting autofs is a way to trigger this problem. I am however very sure > that this is not the only way and there is still more trouble, because that does > not explain the first logs that I sent you. At this time, I ran autofs with > TIMEOUT=0 (never unmount any mounted filesystem) and did not restart autofs at > all (I found this out by accident yesterday). Still some processes lost their > cwd without any obvious reason. Mmmm ... but I think the "umount -l" is a problem and I still need to do some more testing as I also have more questions. > > As you asked about KDE and I use gnome, could it be that the mount managers of > the desktops (gnome-volume-manager) exhibit a similar problem. Interesting thought. If this happens without umounting and without a restart then that is a very different situation. We really need to confirm if that is the case. Unfortunately (for me), I suspect not. > > I don't think that we can point our finger to autofs yet. The finger may still be a little bent but it's straightening up (ouch). Ian
ok, I ran our Fedora 8 boxes for a while and made sure that autofs never gets touched and/or restarted. I did not see any lost cwd's, so the "autofs is the culprit" suspicion is probably right. Some other things still bother me: as root, I can e.g. run "umount /home/henning" and have it disappear from df: [henning@forge ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/system_vg-root_lv 16G 5.5G 9.3G 38% / /dev/sda1 190M 19M 162M 11% /boot tmpfs 1.7G 12K 1.7G 1% /dev/shm /dev/mapper/system_vg-lv_var 3.9G 799M 2.9G 22% /var /dev/mapper/system_vg-data_lv 89G 40G 45G 47% /mnt/disk0 /mnt/disk0/src 89G 40G 45G 47% /usr/src server:/mnt/media/video 1.1T 527G 531G 50% /home/video server:/mnt/disk0/home/windows 1.5T 325G 1.2T 22% /home/windows server2:/mnt/disk0/install 35G 18G 16G 54% /home/install server:/mnt/disk0/archiv 1.5T 325G 1.2T 22% /home/archiv server:/mnt/disk0/mirror 1.5T 325G 1.2T 22% /home/mirror server:/mnt/media/mp3 1.1T 527G 531G 50% /home/mp3 [henning@forge ~]$ pwd /home/henning [henning@forge ~]$ df -h . Filesystem Size Used Avail Use% Mounted on server:/mnt/disk0/mirror 1.5T 325G 1.2T 22% /home/mirror That is obviously wrong. /home/henning is at server:/mnt/disk0/home/henning, not /mnt/disk0/mirror. This seems to be the same problem as the "lsof" display. This is annoying but (hopefully) only cosmetic. /proc/mounts gets it right: # cat /proc/mounts | grep : server:/mnt/disk0/home/henning /home/henning nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 server:/mnt/media/video /home/video nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 server:/mnt/disk0/home/windows /home/windows nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0 server:/mnt/media/mp3 /home/mp3 nfs rw,relatime,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=server 0 0
(In reply to comment #16) > ok, I ran our Fedora 8 boxes for a while and made sure that autofs never gets > touched and/or restarted. I did not see any lost cwd's, so the "autofs is the > culprit" suspicion is probably right. I can see how the restart is a problem and I'm working to fix that. Unfortunately, it's not straight forward or easy. > > Some other things still bother me: > > as root, I can e.g. run "umount /home/henning" and have it disappear from df: > > [henning@forge ~]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/system_vg-root_lv > 16G 5.5G 9.3G 38% / > /dev/sda1 190M 19M 162M 11% /boot > tmpfs 1.7G 12K 1.7G 1% /dev/shm > /dev/mapper/system_vg-lv_var > 3.9G 799M 2.9G 22% /var > /dev/mapper/system_vg-data_lv > 89G 40G 45G 47% /mnt/disk0 > /mnt/disk0/src 89G 40G 45G 47% /usr/src > server:/mnt/media/video > 1.1T 527G 531G 50% /home/video > server:/mnt/disk0/home/windows > 1.5T 325G 1.2T 22% /home/windows > server2:/mnt/disk0/install > 35G 18G 16G 54% /home/install > server:/mnt/disk0/archiv > 1.5T 325G 1.2T 22% /home/archiv > server:/mnt/disk0/mirror > 1.5T 325G 1.2T 22% /home/mirror > server:/mnt/media/mp3 > 1.1T 527G 531G 50% /home/mp3 > > [henning@forge ~]$ pwd > /home/henning > [henning@forge ~]$ df -h . > Filesystem Size Used Avail Use% Mounted on > server:/mnt/disk0/mirror > 1.5T 325G 1.2T 22% /home/mirror > > That is obviously wrong. /home/henning is at server:/mnt/disk0/home/henning, > not /mnt/disk0/mirror. > > This seems to be the same problem as the "lsof" display. This is > annoying but (hopefully) only cosmetic. /proc/mounts gets it right: That is strange, but we would need to log a bug against nfs-utils or possibly util-linux especially since the kernel has correct info. Ian
I'm going to mark this a duplicate of 287411, as that was the original bug, to avoid further confusion.
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.