Description of problem: autofs hangs in when using the /net setting in /etc/auto.master Version-Release number of selected component (if applicable): autofs-4.1.3-28 How reproducible: Always, but at random times. It never lasts 24 hours without failing. Steps to Reproduce: 1.Uncomment the line with /net in /etc/auto.master 2.Turn on autofs service and have entries referring to /net/machinename/fs 3.Wait for machine to hang either on logins or for some programs. Actual results: When machine is hung, df hangs after listing some of the mounted partitions. Also, ls /net/machinename only lists some of the exported file systems, not all. Expected results: Machine should not hang. df should list all mounted partitions without hanging. ls /net/machinename should list all partitions exported by "machinename". Additional info: This problem is only being seen on modern motherboards (Asus P4P800 series or Abit IS7 series) using hyperthreaded CPU's. The problem persists with both smp and regular kernels. On older motherboards (with older ethernet hardware) autofs seems to work, but occassionally is very slow. This problem has existed in Fedora 2 and in the test versions of Fedora 3 and has been reported previously (although initially I thought that autofs-4.1.3-16 had fixed it on FC3test2). Ever since moving to autofs version 4, we have been seeing this problem. While it may be unrelated, this problem was also observed on Sun SPARCstations running Solaris 8 ever since Sun's security patches introduced in August of 2003. This problem has never been observed on Linux installations running autofs version 3.
It sounds to me like one of the servers which is automounted becomes unavailable. NFS accesses to the machine will block. Can you confirm that this is the case? Please get the output from the 'mount' command with no arguments, and test the liveness of each system listed.
This is not the case. The servers are all fine and seen on all other systems. The problem is not limited to hyperthreaded CPU's - I am seeing this problem on an early Pentium IV, but so far have not seen it on Pentium III's. The problem exists with both automount and amd. The ouput from messages is: Nov 16 11:38:59 deepthought kernel: nfs_statfs: statfs error = 5 Nov 16 11:38:59 deepthought last message repeated 2 times Nov 16 11:38:59 deepthought kernel: RPC: error 5 connecting to server math This was an error with regards to an NFS directly mounted partition with amd running. When I turn both amd and autofs off, I have not seen any problems with NFS mounts so far. Also, when the error occurs, I cannot manually mount the partition that fails, and ps shows that the status of the mount request is D and thus is unkillable.
Disable amd. Modify your /etc/auto.master to add --debug to all mount points. Append the following line to you /etc/syslog.conf: *.* /var/log/debug Restart syslogd. Restart the automounter. When the problem shows up again, please attach the contents of /var/log/debug to this bugzilla. Please also attach the contents of your maps, and the output of uname -a. I'm insterested in seeing the output/logs from *one* system only.
I did as you asked. uname -a gives the following: Linux deepthought.uchicago.edu 2.6.9-1.667 #1 Tue Nov 2 14:41:25 EST 2004 i686 i686 i386 GNU/Linux /etc/auto.master is: # # $Id: auto.master,v 1.3 2003/09/29 08:22:35 raven Exp $ # # Sample auto.master file # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # For details of the format look at autofs(5). #/misc /etc/auto.misc --timeout=60 #/misc /etc/auto.misc /net /etc/auto.net --debug The following is the appropriate part of /var/log/debug when the partition /aa hung: Nov 17 12:59:34 deepthought automount[1819]: sigchld: exp 14879 finished, switching from 2 to 1 Nov 17 12:59:34 deepthought automount[1819]: get_pkt: state 2, next 1 Nov 17 12:59:34 deepthought automount[1819]: st_ready(): state = 2 Nov 17 12:59:49 deepthought automount[1819]: sig 14 switching from 1 to 2 Nov 17 12:59:49 deepthought automount[1819]: get_pkt: state 1, next 2 Nov 17 12:59:49 deepthought automount[1819]: st_expire(): state = 1 Nov 17 12:59:49 deepthought automount[14883]: expire_proc: 11 remaining in /net Nov 17 12:59:49 deepthought automount[1819]: expire_proc: exp_proc=14883 Nov 17 12:59:49 deepthought automount[1819]: handle_child: got pid 14883, sig 0 (0), stat 1 Nov 17 12:59:49 deepthought automount[1819]: sigchld: exp 14883 finished, switching from 2 to 1 Nov 17 12:59:49 deepthought automount[1819]: get_pkt: state 2, next 1 Nov 17 12:59:49 deepthought automount[1819]: st_ready(): state = 2 Nov 17 12:59:51 deepthought automount[1819]: handle_packet: type = 0 Nov 17 12:59:51 deepthought automount[1819]: handle_packet_missing: token 78, name tachyon/aa Nov 17 12:59:51 deepthought automount[1819]: attempting to mount entry /net/tachyon/aa Nov 17 12:59:51 deepthought automount[14885]: lookup(program): looking up tachyon/aa Nov 17 12:59:51 deepthought automount[14885]: >> /usr/sbin/showmount: can't get address for tachyon/aa Nov 17 12:59:51 deepthought automount[14885]: lookup(program): lookup for tachyon/aa failed Nov 17 12:59:51 deepthought automount[14885]: failed to mount /net/tachyon/aa Nov 17 12:59:51 deepthought automount[14885]: umount_multi: path=/net/tachyon/aa incl=1
I would like to see more of the log file. Please, if it isn't an inconvenience, attach the log in its entirety. If it is a big problem, then at least provide the lines that show the initial access to /net/tachyon. It looks like there should be more info on this. Thanks.
Created attachment 106923 [details] /var/log/debug file in gzip format Here is the log you asked for. Just to clarify, everything works fine immediately after a reboot. The hanging seems to occur sometime within 24 hours after each reboot.
This problem is more extensive than just autofs. Just having an NFS partition mounted from a remote machine is sufficient for the computer to hang within 24 hours. This is happening on 5 different machines with 3 different motherboards. I even tried turning on nscd, hoping that by looking at cached host info it would not hang if it got a timeout when querying DNS (we have heavy net traffic with occasional server not responding messages), but that did not help any. I'm afraid I'll have to go back to Fedora 1 until you guys find a fix.
I have a similar problem. My home directory is automounted from another machine. Sometimes when I boot and try to login, gsm seems to hang and /var/log/messages shows "kernel: RPC: error 5 connecting to server [server]" However, unlike the original report, I am not using /net. I have /etc/auto.master configured to use /etc/auto.misc. My auto.misc file mounts my home directory as: home -rw,soft,intr,rsize=8192,wsize=8192 blackbox:/exports/home I have modified the auto.master file with auto.misc --debug and I will attach a log the next time it happens. My motherboard is a Gigabyte PE667 (P4 845 chipset) with a built in intel e100 eth0, if that helps. uname -a is: Linux sheryl.levitt.org 2.6.9-1.681_FC3 #1 Thu Nov 18 15:10:10 EST 2004 i686 i686 i386 GNU/Linux
Created attachment 108404 [details] syslog with automount --debug option
I think that I am seeing another manifestation of the same issue. With an NFS server on FC3 machine a response time to automount requests, either via autofs or amd and with clients running various distros (I tried with RH7.3 and FC4devel) is unpredictable. Sometimes is immediate, as expected, and sometimes it takes a long time. I measured up to 30 seconds of wait. When I tried to strace what is happening it looks that the whole thing sits in a stat() call. Like that (with a server called 'zeno'): stat("/net/zeno", while one would expect stat("/net/zeno", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 socket(PF_FILE, SOCK_STREAM, 0) = 3 .... and so on. One past that point everything is normal. 'statd' is fine on both ends. OTOH if 'strace' is used this seem to make _really_ hard to catch that thing in the act. Usually with strace it scrolls of the screen immediately - as expected. 'statd' is fine on both ends.
I should have mentioned that my NFS client is FC3 but the server is FC2.
Hi, I have similar problem to this on my laptop running FC2 (client). Server is either Windows 2K (smb client) or Debian or Mandrake. I float between muliple locations and usually suspend the computer between. My network is either fixed or dhcp IP on the lan or dhcp on wlan depending on location. If I'm using a autofs mount and I suspend/resume or switch from lan to wlan the mount is locked up and I can't even restart the autofs. This hangs konqueror and nautilus completely. The only safe way to work (a total pain) is before I know there is a network change, (1) get out of all the shares (shells, browsers, apps, etc), (2) issue a specific umount to all active mounts, (3) shutdown autofs (/etc/init.d/autofs stop), (4) when on different network restart autofs. If I forget to do the above... Reach back into my extensive windows 95 training and reboot the computer.
This may be off topic. I have yet another and truly strange problem with autofs. Except I'm not quite sure that autofs is the problem. Is there a mechanism in autofs or kernel that might delete a symbolic link to an autofs mount point? I see this happening, but it seems random and happens infrequently, so it's hard to reproduce. We have three dozen FC3 clients (2.6.9-1.681_FC3), where /opt is a symbolic link to /misc/opt, which is an autofs mount point. Autofs is configured entirely via LDAP. The LDAP and NFS server are the same machine, which is running FC2. There have been a handful of instances where the symbolic link /opt->misc/opt *is deleted* automatically, right in the middle of a user session. This seems to happen when /misc/opt is mounted and in use. It has happened on four different machines maybe six or seven times over the about six weeks that FC3 has been in use, so its rare enough to make it very difficult to get clues as to what the problem is. It did happen to one user twice during one day. The logs do not show anything that looks suspicious to me. Otherwise autofs seems to work correcly. The /opt->misc/opt link can be put back after it has disappeared, and it will work without autofs being restarted. NFS performance is ok as far as I can tell. We have a largish number of non-RPM-packaged applications in /opt, so it is going to be a major nuisance if the /opt->misc/opt link cannot be relied on to stay put. Has anyone seen anything like this? I posted a question on the Fedora list, but there were no responses. What would be the method of choice to mount /opt automatically over NFS, if not a symbolic link to a subdirectory? Mikko
Mikko, Enable debugging for the automounter. When you notice the symbolic link has been removed, then please open another bugzilla, and attach the logs there. You can enable debugging for the automount daemon by modifying the /etc/sysconfig/autofs file. There is a line for DAEMONOPTIONS. The default looks like this: DAEMONOPTIONS="--timeout=60" Change that line to look like this: DAEMONOPTIONS="--timeout=60 --debug" Then, in /etc/syslog.conf, add a line that looks like so: *.* /var/log/debug Then restart the syslog daemon: # service syslog restart The output I'm looking for will be in /var/log/debug. Thanks, Jeff
Could you please try an rpm from here: http://people.redhat.com/jmoyer/autofs/fc4/4.1.3-123 I believe it will address your issues. Thanks.
This is a major improvement and things are now almost perfect. I have noticed that occasionally an automounted partition will unmount even when I have a shell that is cd'ed to that directory. In every case, I can remount it with ease (cd, then cd to that directory). However, shouldn't it stay mounted whenever any user is cd'ed to that directory?
I'm guessing that you were cd'd into a "scaffolding" directory used for setting up the directory hierarchy in multimount maps (such as /net). So, if your cwd is not an actual mounted directory, it can be unmounted out from under you. I believe Ian Kent (the upstream maintainer) put together a patch to address this. I will be considering that patch for the next round of updates.
Well, I don't directly cd to /net, but I do cd to /partition which is a symbolic link to /net/machinename/partition.
At least one problem reported here has been resolved. Namely, the automount daemon will not remove directories for which there is an active reference. As such, I'm closing this bug. There were, I think, 3 distinct bugs reported here. For the others, please file separate bugzillas. One bug per bugzilla, please. Thanks.