225215 – autofs hangs

Bug 225215 - autofs hangs

Summary: autofs hangs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	autofs
Sub Component:
Version:	6
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ian Kent
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-29 19:53 UTC by william hanlon
Modified:	2008-05-06 19:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 19:07:17 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages with mount errors and attempted to stop autofs (6.11 KB, text/plain) 2007-01-29 19:53 UTC, william hanlon	no flags	Details
sysrq-t dump during autofs hang. (355.12 KB, text/plain) 2007-01-31 21:58 UTC, william hanlon	no flags	Details
View All

Description william hanlon 2007-01-29 19:53:52 UTC

Description of problem:
after variable amounts of time (a day or more) autofs hangs. this is indicated
by the following symptoms:

1) shells on the machine where autofs is running "hang", e.g. after issuing a
command, the command succesfully runs but the command prompt does not return
unless control-c is issued.

2) the load on the machine goes up continually.

3) trying to "ls" the mounted directory returns nothing, the shell just hangs
until control-c is issued.


Version-Release number of selected component (if applicable):
rpm: autofs-5.0.1-0.rc3.2

How reproducible:
indeterminate amount of time occurs between hangs, but always eventually hangs
the longer the machine runs.


Steps to Reproduce:
I don't know. This last time the machine ran for 9 days before this happened. On
 other attempts it will only go a couple of days before hanging. Log is attached.
  
Actual results:
autofs stops responding. only a reboot gets things working correctly again.

Expected results:
autofs should not hang or should at least be able to be cleanly killed and
restarted.

Additional info:
First the NFS server the autofs is talking to may be partly to blame. The
hanging is always associated with a certain server running nfs-utils-1.0.1-2.9
(Redhat 9). I'm mounting the mail spool from that machine and one client
periodically probes the mounted directory (xbiff). 

When I issue "service autofs stop" no errors are reported but the problem continues.

ps reveals that a mount process is still running:
root      9450     1  0 11:52 ?        00:00:00 /bin/mount -t nfs -s -o
rw,no_root_squash bessie:/var/spool/mail /bessie/mail
root      9451  9450  0 11:52 ?        00:00:00 /sbin/mount.nfs
bessie:/var/spool/mail /bessie/mail -s -o rw,no_root_squash

9450 can be killed. 9451 can not be killed even with SIGKILL. When attempting to
kill 9451 the shell hangs until control-c is issued.

After shutting down autofs /etc/mtab indicates all autofs files are unmounted
but /proc/mounts indicates otherwise.

cat /etc/mtab
/dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
proc /proc proc rw 0 0
sysfs /sys sysfs rw 0 0
devpts /dev/pts devpts rw,gid=5,mode=620 0 0
/dev/hda1 /boot ext3 rw 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0

cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
none /selinux selinuxfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/hda1 /boot ext3 rw,data=ordered 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
/etc/auto.bessie /bessie autofs
rw,fd=6,pgrp=2064,timeout=300,minproto=5,maxproto=5,indirect 0 0

Note that in the attached log I see that the  messages "mount still busy
/bessie" start occurring after 2.5 days of uptime, but no problems are noticed
with the mount until 9 days of uptime.

Comment 1 william hanlon 2007-01-29 19:53:52 UTC

Created attachment 146857 [details]
/var/log/messages with mount errors and attempted to stop autofs

Comment 2 Ian Kent 2007-01-30 03:07:12 UTC

(In reply to comment #0)
> Description of problem:
> after variable amounts of time (a day or more) autofs hangs. this is indicated
> by the following symptoms:
> 
> 1) shells on the machine where autofs is running "hang", e.g. after issuing a
> command, the command succesfully runs but the command prompt does not return
> unless control-c is issued.
> 
> 2) the load on the machine goes up continually.

Is the automount process or the mount(8) process you
saw is using a lot of cpu?

> 
> 3) trying to "ls" the mounted directory returns nothing, the shell just hangs
> until control-c is issued.

Yes, autofs can't send further notifications for that
directory until the previous one has completed.

> 
> 
> Version-Release number of selected component (if applicable):
> rpm: autofs-5.0.1-0.rc3.2

I don't think it will make a difference to this issue but
could you update autofs to revision 0.rc3.12 please. It
should be in updates testing.

> 
> How reproducible:
> indeterminate amount of time occurs between hangs, but always eventually hangs
> the longer the machine runs.
> 
> 
> Steps to Reproduce:
> I don't know. This last time the machine ran for 9 days before this happened. On
>  other attempts it will only go a couple of days before hanging. Log is attached.
>   
> Actual results:
> autofs stops responding. only a reboot gets things working correctly again.
> 
> Expected results:
> autofs should not hang or should at least be able to be cleanly killed and
> restarted.

If a mount process is in the D state then we have no choice
but to work out what is causing that. We can't do anything
with a process in the D state.

> 
> Additional info:
> First the NFS server the autofs is talking to may be partly to blame. The
> hanging is always associated with a certain server running nfs-utils-1.0.1-2.9
> (Redhat 9). I'm mounting the mail spool from that machine and one client
> periodically probes the mounted directory (xbiff). 
> 
> When I issue "service autofs stop" no errors are reported but the problem
continues.
> 
> ps reveals that a mount process is still running:
> root      9450     1  0 11:52 ?        00:00:00 /bin/mount -t nfs -s -o
> rw,no_root_squash bessie:/var/spool/mail /bessie/mail
> root      9451  9450  0 11:52 ?        00:00:00 /sbin/mount.nfs
> bessie:/var/spool/mail /bessie/mail -s -o rw,no_root_squash
> 
> 9450 can be killed. 9451 can not be killed even with SIGKILL. When attempting to
> kill 9451 the shell hangs until control-c is issued.

This is clearly a problem for autofs.
We've had a few problems with mount(8) and it is often difficult
to work out what is wrong. Unfortuneately, autofs requires that
mount(8) work without problem. Even if there is a problem
mounting something it should return after a rather lengthy 
timeout and the problem should go away after that.

Is process 9451 in the D state when this happens (ps ax will
show it)?
 
> 
> After shutting down autofs /etc/mtab indicates all autofs files are unmounted
> but /proc/mounts indicates otherwise.

That is normal for autofs v5.
A restart in this case would work fine except that those
mount processes are probably preventing autofs from continuing
normally after the restart.

> 
> cat /etc/mtab
> /dev/mapper/VolGroup00-LogVol00 / ext3 rw 0 0
> proc /proc proc rw 0 0
> sysfs /sys sysfs rw 0 0
> devpts /dev/pts devpts rw,gid=5,mode=620 0 0
> /dev/hda1 /boot ext3 rw 0 0
> tmpfs /dev/shm tmpfs rw 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
> 
> cat /proc/mounts
> rootfs / rootfs rw 0 0
> /dev/root / ext3 rw,data=ordered 0 0
> /dev /dev tmpfs rw 0 0
> /proc /proc proc rw 0 0
> /sys /sys sysfs rw 0 0
> none /selinux selinuxfs rw 0 0
> /proc/bus/usb /proc/bus/usb usbfs rw 0 0
> devpts /dev/pts devpts rw 0 0
> /dev/hda1 /boot ext3 rw,data=ordered 0 0
> tmpfs /dev/shm tmpfs rw 0 0
> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
> /etc/auto.bessie /bessie autofs
> rw,fd=6,pgrp=2064,timeout=300,minproto=5,maxproto=5,indirect 0 0

In itself this doesn't matter as autofs will "umount -l" any
pre-existing mounts at startup. The existing mount running
that can't be killed is an indication of a problem, perhaps
within the RPC subsystem. We've heard of this before and
it's always attributed to autofs but I'm not convinced that
is the case.

> 
> Note that in the attached log I see that the  messages "mount still busy
> /bessie" start occurring after 2.5 days of uptime, but no problems are noticed
> with the mount until 9 days of uptime.

Once again that may not be a problem as it is saying that
the mount can't be umounted yet because it is still busy
for some reason, which is often the case. However, it could
also be because when the mount first failed to complete the
process may (depending on how far it got with the mount)
still hold a reference to that directory so the containing
autofs mount will be seen as busy.

Ian

Comment 3 Ian Kent 2007-01-30 03:11:01 UTC

(In reply to comment #2)
> 
> If a mount process is in the D state then we have no choice
> but to work out what is causing that. We can't do anything
> with a process in the D state.

If this is the case would it be possible to get a sysrq-t
dump of the system please.

Ian

Comment 4 william hanlon 2007-01-31 21:58:16 UTC

Created attachment 147054 [details]
sysrq-t dump during autofs hang.

Comment 5 william hanlon 2007-01-31 22:01:58 UTC

It's been two days since my first report and autofs has hung again.

(In reply to comment #2)
> (In reply to comment #0)
> > 2) the load on the machine goes up continually.
> 
> Is the automount process or the mount(8) process you
> saw is using a lot of cpu?

Neither process is using lots of cpu. I see this behavior whenever machines lose
NFS mounts. Top shows nothing unusal as far as a single process eating up CPU
time but the load keeps increasing.

> > Version-Release number of selected component (if applicable):
> > rpm: autofs-5.0.1-0.rc3.2
> 
> I don't think it will make a difference to this issue but
> could you update autofs to revision 0.rc3.12 please. It
> should be in updates testing.

I haven't done it yet. I'll see what I can do. This is my workstation at work
and I'm reticent to make it into a test platform.

> > How reproducible:
> > indeterminate amount of time occurs between hangs, but always eventually hangs
> > the longer the machine runs.

It's been two days since the last autofs hang.

> > Expected results:
> > autofs should not hang or should at least be able to be cleanly killed and
> > restarted.
> 
> If a mount process is in the D state then we have no choice
> but to work out what is causing that. We can't do anything
> with a process in the D state.
> 
> > ps reveals that a mount process is still running:
> > root      9450     1  0 11:52 ?        00:00:00 /bin/mount -t nfs -s -o
> > rw,no_root_squash bessie:/var/spool/mail /bessie/mail
> > root      9451  9450  0 11:52 ?        00:00:00 /sbin/mount.nfs
> > bessie:/var/spool/mail /bessie/mail -s -o rw,no_root_squash
> > 
> > 9450 can be killed. 9451 can not be killed even with SIGKILL. When attempting to
> > kill 9451 the shell hangs until control-c is issued.
> 
> This is clearly a problem for autofs.
> We've had a few problems with mount(8) and it is often difficult
> to work out what is wrong. Unfortuneately, autofs requires that
> mount(8) work without problem. Even if there is a problem
> mounting something it should return after a rather lengthy 
> timeout and the problem should go away after that.
> 
> Is process 9451 in the D state when this happens (ps ax will
> show it)?

Here it is today in D state:

ps ax  | grep bessie
22099 ?        S      0:00 /bin/mount -t nfs -s -o rw,no_root_squash
bessie:/var/spool/mail /bessie/mail
22100 ?        D      0:00 /sbin/mount.nfs bessie:/var/spool/mail /bessie/mail
-s -o rw,no_root_squash

Included is the sysrq-t dump shortly after the hang occurred.

Comment 6 william hanlon 2007-01-31 22:36:31 UTC

(In reply to comment #2)
> > Version-Release number of selected component (if applicable):
> > rpm: autofs-5.0.1-0.rc3.2
> 
> I don't think it will make a difference to this issue but
> could you update autofs to revision 0.rc3.12 please. It
> should be in updates testing.
> 

I forgot to mention that this problem did not exist in Red Hat 9 which I ran for
several years. Then recently I fresh install to FC5 and the problem started for
me then. I did a fresh install of FC6 in the hopes this problem would go away.

Comment 7 Ian Kent 2007-02-01 04:04:29 UTC

(In reply to comment #6)
> (In reply to comment #2)
> > > Version-Release number of selected component (if applicable):
> > > rpm: autofs-5.0.1-0.rc3.2
> > 
> > I don't think it will make a difference to this issue but
> > could you update autofs to revision 0.rc3.12 please. It
> > should be in updates testing.
> > 
> 
> I forgot to mention that this problem did not exist in Red Hat 9 which I ran for
> several years. Then recently I fresh install to FC5 and the problem started for
> me then. I did a fresh install of FC6 in the hopes this problem would go away.

Yes, this problem has been around for a long time.
Perhaps the biggest problem is that people, including me, have
thought this may be autofs or mount or NFS when it's probably
none of these.

My recent investigation points me to the socket layer in the
kernel. I'm not sure how to proceed just yet. I'll keep looking
while I work out the best way to get help with this.

Ian

Comment 8 Bug Zapper 2008-04-04 05:47:13 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 9 Bug Zapper 2008-05-06 19:07:15 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.