Bug 139190 - autofs hangs and often makes using the computer impossible
Summary: autofs hangs and often makes using the computer impossible
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: autofs
Version: 3
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeff Moyer
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-13 20:36 UTC by Ed Friedman
Modified: 2007-11-30 22:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-04-17 21:45:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
/var/log/debug file in gzip format (396.08 KB, application/octet-stream)
2004-11-17 22:46 UTC, Ed Friedman
no flags Details
syslog with automount --debug option (55.48 KB, text/plain)
2004-12-12 14:28 UTC, Mark Levitt
no flags Details

Description Ed Friedman 2004-11-13 20:36:19 UTC
Description of problem: 
autofs hangs in when using the /net setting in /etc/auto.master 
 
Version-Release number of selected component (if applicable): 
autofs-4.1.3-28 
 
How reproducible: 
Always, but at random times.  It never lasts 24 hours without 
failing. 
 
Steps to Reproduce: 
1.Uncomment the line with /net in /etc/auto.master 
2.Turn on autofs service and have entries referring 
to /net/machinename/fs 
3.Wait for machine to hang either on logins or for some programs. 
 
   
Actual results: 
When machine is hung, df hangs after listing some of the mounted 
partitions.  Also, ls /net/machinename only lists some of the 
exported file systems, not all. 
 
Expected results: 
Machine should not hang.  df should list all mounted partitions 
without hanging.  ls /net/machinename should list all partitions 
exported by "machinename". 
 
Additional info: 
This problem is only being seen on modern motherboards (Asus P4P800 
series or Abit IS7 series) using hyperthreaded CPU's.  The problem 
persists with both smp and regular kernels.  On older motherboards 
(with older ethernet hardware) autofs seems to work, but 
occassionally is very slow.  This problem has existed in Fedora 2 and 
in the test versions of Fedora 3 and has been reported previously 
(although initially I thought that autofs-4.1.3-16 had fixed it on 
FC3test2).  Ever since moving to autofs version 4, we have been 
seeing this problem. 
 
While it may be unrelated, this problem was also observed on Sun 
SPARCstations running Solaris 8 ever since Sun's security patches 
introduced in August of 2003.  This problem has never been observed 
on Linux installations running autofs version 3.

Comment 1 Jeff Moyer 2004-11-15 18:54:56 UTC
It sounds to me like one of the servers which is automounted becomes
unavailable.  NFS accesses to the machine will block.  Can you confirm
that this is the case?  Please get the output from the 'mount' command
with no arguments, and test the liveness of each system listed.


Comment 2 Ed Friedman 2004-11-16 17:51:29 UTC
This is not the case.  The servers are all fine and seen on all other
systems.  The problem is not limited to hyperthreaded CPU's - I am
seeing this problem on an early Pentium IV, but so far have not seen
it on Pentium III's.

The problem exists with both automount and amd.  The ouput from
messages is:

Nov 16 11:38:59 deepthought kernel: nfs_statfs: statfs error = 5
Nov 16 11:38:59 deepthought last message repeated 2 times
Nov 16 11:38:59 deepthought kernel: RPC: error 5 connecting to server math

This was an error with regards to an NFS directly mounted partition
with amd running.  When I turn both amd and autofs off, I have not
seen any problems with NFS mounts so far.

Also, when the error occurs, I cannot manually mount the partition
that fails, and ps shows that the status of the mount request is D and
thus is unkillable.


Comment 3 Jeff Moyer 2004-11-16 18:11:39 UTC
Disable amd.  Modify your /etc/auto.master to add --debug to all mount
points.  Append the following line to you /etc/syslog.conf:

*.*              /var/log/debug

Restart syslogd.  Restart the automounter.  When the problem shows up
again, please attach the contents of /var/log/debug to this bugzilla.
 Please also attach the contents of your maps, and the output of uname
-a.  I'm insterested in seeing the output/logs from *one* system only.

Comment 4 Ed Friedman 2004-11-17 19:57:24 UTC
I did as you asked.

uname -a gives the following:

Linux deepthought.uchicago.edu 2.6.9-1.667 #1 Tue Nov 2 14:41:25 EST
2004 i686 i686 i386 GNU/Linux

/etc/auto.master is:

#
# $Id: auto.master,v 1.3 2003/09/29 08:22:35 raven Exp $
#
# Sample auto.master file
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# For details of the format look at autofs(5).
#/misc  /etc/auto.misc --timeout=60
#/misc  /etc/auto.misc
/net    /etc/auto.net --debug

The following is the appropriate part of /var/log/debug when the
partition /aa hung:

Nov 17 12:59:34 deepthought automount[1819]: sigchld: exp 14879
finished, switching from 2 to 1
Nov 17 12:59:34 deepthought automount[1819]: get_pkt: state 2, next 1
Nov 17 12:59:34 deepthought automount[1819]: st_ready(): state = 2
Nov 17 12:59:49 deepthought automount[1819]: sig 14 switching from 1 to 2
Nov 17 12:59:49 deepthought automount[1819]: get_pkt: state 1, next 2
Nov 17 12:59:49 deepthought automount[1819]: st_expire(): state = 1
Nov 17 12:59:49 deepthought automount[14883]: expire_proc: 11
remaining in /net
Nov 17 12:59:49 deepthought automount[1819]: expire_proc: exp_proc=14883
Nov 17 12:59:49 deepthought automount[1819]: handle_child: got pid
14883, sig 0
(0), stat 1
Nov 17 12:59:49 deepthought automount[1819]: sigchld: exp 14883
finished, switching from 2 to 1
Nov 17 12:59:49 deepthought automount[1819]: get_pkt: state 2, next 1
Nov 17 12:59:49 deepthought automount[1819]: st_ready(): state = 2
Nov 17 12:59:51 deepthought automount[1819]: handle_packet: type = 0
Nov 17 12:59:51 deepthought automount[1819]: handle_packet_missing:
token 78, name tachyon/aa
Nov 17 12:59:51 deepthought automount[1819]: attempting to mount entry
/net/tachyon/aa
Nov 17 12:59:51 deepthought automount[14885]: lookup(program): looking
up tachyon/aa
Nov 17 12:59:51 deepthought automount[14885]: >> /usr/sbin/showmount:
can't get
address for tachyon/aa
Nov 17 12:59:51 deepthought automount[14885]: lookup(program): lookup
for tachyon/aa failed
Nov 17 12:59:51 deepthought automount[14885]: failed to mount
/net/tachyon/aa
Nov 17 12:59:51 deepthought automount[14885]: umount_multi:
path=/net/tachyon/aa incl=1


Comment 5 Jeff Moyer 2004-11-17 21:57:44 UTC
I would like to see more of the log file.  Please, if it isn't an inconvenience,
attach the log in its entirety.  If it is a big problem, then at least provide
the lines that show the initial access to /net/tachyon.  It looks like there
should be more info on this.

Thanks.

Comment 6 Ed Friedman 2004-11-17 22:46:43 UTC
Created attachment 106923 [details]
/var/log/debug file in gzip format

Here is the log you asked for.	Just to clarify, everything works fine
immediately after a reboot.  The hanging seems to occur sometime within 24
hours after each reboot.

Comment 7 Ed Friedman 2004-11-23 17:17:10 UTC
This problem is more extensive than just autofs.  Just having an NFS
partition mounted from a remote machine is sufficient for the computer
to hang within 24 hours. This is happening on 5 different machines
with 3 different motherboards.  I even tried turning on nscd, hoping
that by looking at cached host info it would not hang if it got a
timeout when querying DNS (we have heavy net traffic with occasional
server not responding messages), but that did not help any.  I'm
afraid I'll have to go back to Fedora 1 until you guys find a fix.

Comment 8 Mark Levitt 2004-12-04 12:15:56 UTC
I have a similar problem. My home directory is automounted from
another machine. Sometimes when I boot and try to login, gsm seems to
hang and /var/log/messages shows "kernel: RPC: error 5 connecting to
server [server]"

However, unlike the original report, I am not using /net. I have
/etc/auto.master configured to use /etc/auto.misc. My auto.misc file
mounts my home directory as:
home       -rw,soft,intr,rsize=8192,wsize=8192 blackbox:/exports/home

I have modified the auto.master file with auto.misc --debug and I will
attach a log the next time it happens. 

My motherboard is a Gigabyte PE667 (P4 845 chipset) with a built in
intel e100 eth0, if that helps. 

uname -a is:
Linux sheryl.levitt.org 2.6.9-1.681_FC3 #1 Thu Nov 18 15:10:10 EST
2004 i686 i686 i386 GNU/Linux

Comment 9 Mark Levitt 2004-12-12 14:28:45 UTC
Created attachment 108404 [details]
syslog with automount --debug option

Comment 10 Michal Jaegermann 2004-12-20 05:34:02 UTC
I think that I am seeing another manifestation of the same issue.
With an NFS server on FC3 machine a response time to automount
requests, either via autofs or amd and with clients running
various distros (I tried with RH7.3 and FC4devel) is unpredictable.
Sometimes is immediate, as expected, and sometimes it takes a long
time.  I measured up to 30 seconds of wait.

When I tried to strace what is happening it looks that the whole
thing sits in a stat() call.  Like that (with a server called 'zeno'):

stat("/net/zeno", 

while one would expect

stat("/net/zeno", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
....

and so on.  One past that point everything is normal. 'statd' 
is fine on both ends.

OTOH if 'strace' is used this seem to make _really_ hard to catch
that thing in the act.  Usually with strace it scrolls of the
screen immediately - as expected.  'statd' is fine on both ends.

Comment 11 Mark Levitt 2004-12-20 09:31:26 UTC
I should have mentioned that my NFS client is FC3 but the server is FC2. 



Comment 12 Sepehr 2004-12-29 15:47:29 UTC
Hi, I have similar problem to this on my laptop running FC2 (client).
Server is either Windows 2K (smb client) or Debian or Mandrake. I
float  between muliple locations and usually suspend the computer
between. My network is either fixed or dhcp IP on the lan or dhcp on
wlan depending on location. If I'm using a autofs mount and I
suspend/resume or switch from lan to wlan the mount is locked up and I
can't even restart the autofs. This hangs konqueror and nautilus
completely. The only safe way to work (a total pain) is before I know
there is a network change, (1) get out of all the shares (shells,
browsers, apps, etc), (2) issue a specific umount to all active
mounts, (3) shutdown autofs (/etc/init.d/autofs stop), (4) when on
different network restart autofs.

If I forget to do the above... Reach back into my extensive windows 95
training and reboot the computer. 

Comment 13 Mikko Huhtala 2005-01-05 22:34:55 UTC
This may be off topic. I have yet another and truly strange problem
with autofs. Except I'm not quite sure that autofs is the problem.

Is there a mechanism in autofs or kernel that might delete a symbolic
link to an autofs mount point? I see this happening, but it seems
random and happens infrequently, so it's hard to reproduce.

We have three dozen FC3 clients (2.6.9-1.681_FC3), where /opt is a
symbolic link to /misc/opt, which is an autofs mount point. Autofs is
configured entirely via LDAP. The LDAP and NFS server are the same
machine, which is running FC2.

There have been a handful of instances where the symbolic link
/opt->misc/opt *is deleted* automatically, right in the middle of a
user session. This seems to happen when /misc/opt is mounted and in
use. It has happened on four different machines maybe six or seven
times over the about six weeks that FC3 has been in use, so its rare
enough to make it very difficult to get clues as to what the problem
is. It did happen to one user twice during one day. The logs do not
show anything that looks suspicious to me.

Otherwise autofs seems to work correcly. The /opt->misc/opt link can
be put back after it has disappeared, and it will work without autofs
being restarted. NFS performance is ok as far as I can tell.

We have a largish number of non-RPM-packaged applications in /opt, so
it is going to be a major nuisance if the /opt->misc/opt link cannot
be relied on to stay put.

Has anyone seen anything like this? I posted a question on the Fedora
list, but there were no responses.

What would be the method of choice to mount /opt automatically over
NFS, if not a symbolic link to a subdirectory?

Mikko


Comment 14 Jeff Moyer 2005-01-05 22:43:30 UTC
Mikko, 

Enable debugging for the automounter.  When you notice the symbolic link has
been removed, then please open another bugzilla, and attach the logs there.

You can enable debugging for the automount daemon by modifying the
/etc/sysconfig/autofs file.  There is a line for DAEMONOPTIONS.  The default
looks like this:

DAEMONOPTIONS="--timeout=60"

Change that line to look like this:

DAEMONOPTIONS="--timeout=60 --debug"

Then, in /etc/syslog.conf, add a line that looks like so: 

*.*                                             /var/log/debug

Then restart the syslog daemon:

# service syslog restart

The output I'm looking for will be in /var/log/debug.

Thanks,

Jeff

Comment 15 Jeff Moyer 2005-04-05 20:12:59 UTC
Could you please try an rpm from here:

  http://people.redhat.com/jmoyer/autofs/fc4/4.1.3-123

I believe it will address your issues.

Thanks.

Comment 16 Ed Friedman 2005-04-18 16:29:59 UTC
This is a major improvement and things are now almost perfect.  I have noticed
that occasionally an automounted partition will unmount even when I have a shell
that is cd'ed to that directory.  In every case, I can remount it with ease (cd,
then cd to that directory).  However, shouldn't it stay mounted whenever any
user is cd'ed to that directory?

Comment 17 Jeff Moyer 2005-04-18 16:44:55 UTC
I'm guessing that you were cd'd into a "scaffolding" directory used for setting
up the directory hierarchy in multimount maps (such as /net).  So, if your cwd
is not an actual mounted directory, it can be unmounted out from under you.

I believe Ian Kent (the upstream maintainer) put together a patch to address
this.  I will be considering that patch for the next round of updates.

Comment 18 Ed Friedman 2005-04-18 17:23:56 UTC
Well, I don't directly cd to /net, but I do cd to /partition which is a symbolic
link to /net/machinename/partition.


Comment 19 Jeff Moyer 2006-04-17 21:45:39 UTC
At least one problem reported here has been resolved.  Namely, the automount
daemon will not remove directories for which there is an active reference.  As
such, I'm closing this bug.

There were, I think, 3 distinct bugs reported here.  For the others, please file
separate bugzillas.  One bug per bugzilla, please.

Thanks.


Note You need to log in before you can comment on or make changes to this bug.