151431 – automount hangs due to unsafe call in signal handler

Bug 151431 - automount hangs due to unsafe call in signal handler

Summary: automount hangs due to unsafe call in signal handler

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	autofs
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jeff Moyer
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	154224 (view as bug list)
Depends On:
Blocks:	156321
TreeView+	depends on / blocked

Reported:	2005-03-17 20:39 UTC by Sev Binello
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2005-654
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-09-28 19:10:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gzipped debug file for autofs (1.66 MB, text/plain) 2005-03-18 20:39 UTC, Sev Binello	no flags	Details
gzipped autofs debug log file #2 (5.83 MB, text/plain) 2005-03-31 14:23 UTC, Sev Binello	no flags	Details
tar file of autofs map files fror problem system (10.00 KB, text/plain) 2005-03-31 16:50 UTC, Sev Binello	no flags	Details
comment out syslogs in signal handler context (3.13 KB, patch) 2005-04-19 14:31 UTC, Jeff Moyer	no flags	Details \| Diff
rpm with syslog patch applied (193.82 KB, application/octet-stream) 2005-04-19 15:21 UTC, Jeff Moyer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:654	0	qe-ready	SHIPPED_LIVE	autofs bug fix update	2005-09-28 04:00:00 UTC

Description Sev Binello 2005-03-17 20:39:10 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.2) Gecko/20040301

Description of problem:
automount daemon seems to hang, and will not mount (or expire) anything.
attached strace to automount, saw the following ...
[root@acnlin86 tmp]# strace -p 3719
Process 3719 attached - interrupt to quit
futex(0x24720c, FUTEX_WAIT, 2, NULL

Tried to mount filesystems while in strace,
saw absolutely know activity.



Version-Release number of selected component (if applicable):
autofs-4.1.3-47/2.4.21-27.0.2  autofs-4.1.3-104/2.4.21-27.0.1

How reproducible:
Couldn't Reproduce


Additional info:

Couldn;t reproduce but problem seems to have started
when a file server exporting a filesystem went down.
Automount never recovered after that.

Comment 1 Jeff Moyer 2005-03-17 22:01:29 UTC

Is there a hung umount process?  Can you manually umount the filesystem that was
mounted from the server that went down?  What is the output of alt-sysrq-t when
this happens?  What do the logs show?

Comment 2 Sev Binello 2005-03-17 22:41:42 UTC

Unfortunately didn't try all the things you mentioned. 
Will keep this in mind next time.

But I could manually mount filesystems, 
though didn't try the one that had previously failed.

The machine was up and functioniong so I didn't do an alt-sysrq-t.

The /proc/mounts and /etc/mtabs didn't show the filestystem that 
had gone bad, nor did it show any of the ones that couldn't be mounted.
So I didn't try unmounting it.

If they had been previously mounted then they were okay, but no
new ones could be mounted.

When we tried to reboot, we got a lot of these messages...
 NXNODE 1.3.2-25[28966]: ERROR: file match line: cannot open file
'/.nx/C-acnlin86.pbn.bnl.gov-1114-0BD1438F69351E511DE69789FE2A43B4/session': No
such file or directory 'main:nxnode_ee:4383'
kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice
day..

Followed by a kernel panic
Wrote down the following stack info...

eip @destroy_inode
dput
link_path_walk
default_do_nmi
path_lookup
open_namei
filp_open
sys_open

I can send the /var/log/message file if that helps

We actually had several machines crash with similar messages
i.e have a nice day when the machine exporting the filesystem went bad. 


We got mesgs like this..
MVFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
automount[25838]: >> mount: RPC: Port mapper failure - RPC: Timed out
automount[25838]: mount(nfs): nfs: mount failure acnlin31.pbn.bnl.gov:/cfsi on
/cfs/i

Comment 3 Sev Binello 2005-03-18 15:36:39 UTC

We seem to be in the same state now, ie. automount not expiring or mounting
anything new. No problematic filesystems this time.
Umounting any mounted filesytem works sort of
ie. it disappears from /proc/mounts but is still present in /etc/mtabs
and the mt point still exists in the auto.xxx directory.
I see no umount msgs in /var/log/messages

Automount daemon looks hung in....
[root@acnlin86 root]# strace -p 3745
Process 3745 attached - interrupt to quit
futex(0x3f320c, FUTEX_WAIT, 2, NULL

We have 2 automount daemons, we are only having problems with one of them
and it's always for the same filesystem ??

Heres some results you asked for
acnlin86 102:ps -elf | grep mount
1 S root      3743     1  0  75   0    -   441 -      Mar17 ?        00:00:02
/usr/sbin/automount --timeout=60 --debug /misc file /etc/auto.misc
1 S root      3877     1  0  85   0    -   438 -      Mar17 ?        00:00:00
rpc.mountd
1 S root      3745     1  0  75   0    -   440 -      Mar17 ?        00:00:00
/usr/sbin/automount --timeout=60 --debug /cfs file /etc/auto.cfs

Let me know what other info I can get to you while the machine is in this state.
Should we try restarting autofs ?

Comment 4 Jeff Moyer 2005-03-18 20:27:54 UTC

Debug logs.  I see you have debugging enabled.   Do you also send all messages
to a debug log?  Something like this in your syslog.conf would do the trick:

*.*    /var/log/debug

You mentioned 2 different versions of the kernel and automounter.  When you post
test results, please let me know which versions you are running.

The busy inodes after umount issue is being tracked in bz #124600.  You may want
to add yourself to the CC list there, though that isn't the main bug you are
running into.

So, in summary, please get me debug logs.

Thanks.

Comment 5 Sev Binello 2005-03-18 20:39:30 UTC

Created attachment 112138 [details]
gzipped debug file for autofs

debug file created by automount

Comment 6 Sev Binello 2005-03-18 20:49:06 UTC

The info I am (and have been) sending is for
kernel 2.4.21-27.0.2.EL
WS release 3 (Taroon Update 3). 

The first set of info I sent was for autofs-4.1.3-47
We then upgraded to autofs-4.1.3-104,
So the second set of info was for  autofs-4.1.3-104

Comment 7 Sev Binello 2005-03-31 14:23:43 UTC

Created attachment 112511 [details]
gzipped autofs debug log file #2

The problem is continueing and consistent on only one of our machines.
Rebooting does not help, since it quickly reverts to the bad state,
where the daemon hangs in a futex wait, and it no longer expires or mounts
filesystems.
Even stranger is the fact that the problem seems to occur mostly with only one
automount daemon on this system.
I will attach the debug log in case any one is still looking into this problem.


Currently, we have to manually mount the filesystems on this machine.

Comment 8 Jeff Moyer 2005-03-31 16:00:31 UTC

Yes, I'm still working on this.  Could you please try the following kernel:

http://people.redhat.com/dhoward/bz124600/

This will likely not resolve your autofs issues, but I would like to know if you
still get the panics and the busy inode after umount messages.

I'm looking at your logs now.

Comment 9 Jeff Moyer 2005-03-31 16:04:34 UTC

Could you post the map file for the troublesome automount?

Thanks.

Comment 10 Sev Binello 2005-03-31 16:50:49 UTC

Created attachment 112518 [details]
tar file of autofs map files fror problem system

attached is a tar file containing the map files for our problem system.
Not sure about the kernel upgrade, can't reproduce the panics at will.

Comment 11 Sev Binello 2005-04-11 14:02:11 UTC

Jeff -

    I noticed some comments in issue 12 of autofs Digest about a hanging autofs
condition...

"It's possible for an event wait request to arive before the event
requestor. If this happens the daemon never gets notified and autofs
hangs."


Could this problem be behind our hanging autofs as well ?
i.e bug 151431

Thanks
-Sev

Comment 12 Jeff Moyer 2005-04-11 17:52:22 UTC

I'm not sure.  I've requested more information on this specific patch.

Comment 13 Jeff Moyer 2005-04-11 22:22:37 UTC

This may be a duplicate of bz #144729.

Comment 14 Sev Binello 2005-04-12 14:13:05 UTC

The symptoms seem to be the similar.
However, it mentions the problem went away when --ghost option was removed.
We do not use that option, so that won't help.
It would have been interesting to see if the daemon in bz 144729 was stuck on a
futex, but I saw no mention.

Comment 15 Jeff Moyer 2005-04-12 14:19:09 UTC

Oh, duh!  The futex....  Thanks for mentioning that again.  It seems that autofs
will issue syslog(3) calls while in a signal handler.  This is a no no, and can
result in the automount process hanging.

See bug 154224.  I put together this patch:

  https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=112984

But it is against 4.1.4_beta2.  I'll put together a patch against our package
and post it for you to try.

Comment 16 Sev Binello 2005-04-12 18:57:58 UTC

Ok, looking forward to the patched package.
Didn't seem to have permission to view bug 154224 you mentioned ?

Comment 17 Sev Binello 2005-04-19 14:21:51 UTC

Hi Jeff-
   Since today is our maintenance day,
   I was wondering if you got around to putting together a patch for us. Thanks

Comment 18 Jeff Moyer 2005-04-19 14:31:56 UTC

Created attachment 113359 [details]
comment out syslogs in signal handler context

Dan Berrange put together this patch to verify the problem.  If you apply this,
the problem should go away, but we won't get any of the log information from
signal handlers.  In other words, this patch is by no means the solution, but
it should help to verify we are addressing the right problem in your
environment.

I'm currently working with upstream to resolve the problem in a more permanent
fashion.  The proper fix will take another week or two to hammer out.

Please try this patch, and let me know if it resolves your issues.

Thanks!

Comment 19 Sev Binello 2005-04-19 15:15:08 UTC

Would like to try it.
But we don't have source for autofs-4.1.3-104.
Would you happen to have an rpm package ready to go ?
Thanks

Comment 20 Jeff Moyer 2005-04-19 15:21:32 UTC

Created attachment 113364 [details]
rpm with syslog patch applied

Here is an i386 rpm, based on autofs-4.1.3-120, which includes the syslog
patch.	Please give this a try.

Thanks.

Comment 21 Sev Binello 2005-04-19 15:36:38 UTC

Will do, I'll keep you posted.
Thanks

Comment 22 Jeff Moyer 2005-05-12 19:27:59 UTC

Does this patch resolve your hangs?  Did you have a chance to try it?

Thanks.

Comment 23 Sev Binello 2005-05-12 20:04:54 UTC

Yes, it did.
Let me know when there is a permanent fix.
Thanks.

Comment 25 Sev Binello 2005-05-24 15:35:44 UTC

Jeff -
Can you tell me if this current release of autofs 4.13-130 
contains a fix for this problem.
Thanks

Comment 26 Jeff Moyer 2005-05-24 15:56:59 UTC

autofs-4.1.3-130 does not contain the fix for this problem.

Comment 27 Sev Binello 2005-06-01 14:10:06 UTC

I was wondering if you could provide an rpm for 4.1.3-130
with the patch you sent us earlier.
This way we can upgrade autofs on some of our systems
experiencing mount problems. 
Thanks

Comment 28 Sev Binello 2005-06-06 19:41:26 UTC

Could you advice on the best course of action for us ?
We have a large number of systems that we need to upgrade autofs,
to prevent failed mounts, or hung daemons.
Any idea when the fix above will be released.
Should we upgrade with the patched version you gave us earlier ?
Or, is there a more recent version that we could use ? 

Thanks

Comment 29 Jeff Moyer 2005-06-07 13:59:29 UTC

I can't release anything in a supported fashion.  If you would like, I can apply
the patch I made for 4.1.3-120 to the 4.1.3-130 RPM.  Most likely, the solution
will be the reentrant syslog implementation that is being developed upstream.  I
am targetting U6 for that bug fix.  Can you wait that long?

Comment 30 Sev Binello 2005-06-07 14:55:48 UTC

Any idea of a time frame for U6 ?
Is there an alternative to waiting ?
If not, then I guess trying the latest version 
with the patch applied makes some sense.
Thanks.

Comment 31 Jeff Moyer 2005-06-07 15:09:08 UTC

If you have a support contract with Red Hat, then the proper method for getting
this resolved more quickly is to go through Issue Tracker (or through your TAM).

I'll put together an RPM for you with the patch listed in this bugzilla.  This,
unfortunately, will not be a supported RPM.  What that means is that before you
report any bugs on autofs, you'll have to reproduce them on an unpatched
autofs-4.1.3-130.

Comment 32 Sev Binello 2005-06-07 15:26:42 UTC

Ok thanks.

Comment 33 Sev Binello 2005-06-07 15:37:56 UTC

Not faimiliar with issue tracker.
How do I get there ?

Comment 34 Jeff Moyer 2005-06-07 16:21:17 UTC

https://enterprise.redhat.com/issue-tracker/

Comment 41 Jeff Moyer 2005-06-08 17:02:18 UTC

*** Bug 154224 has been marked as a duplicate of this bug. ***

Comment 51 Jeff Moyer 2005-07-12 15:30:58 UTC

A fix for this was built into autofs version 4.1.3-138.

Comment 58 Red Hat Bugzilla 2005-09-28 19:10:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-654.html

Note You need to log in before you can comment on or make changes to this bug.