Bug 151431
Summary: | automount hangs due to unsafe call in signal handler | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Sev Binello <sev> | ||||||||||||
Component: | autofs | Assignee: | Jeff Moyer <jmoyer> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brock Organ <borgan> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 3.0 | CC: | cfeist, rak, tao | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i386 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2005-654 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2005-09-28 19:10:33 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 156321 | ||||||||||||||
Attachments: |
|
Description
Sev Binello
2005-03-17 20:39:10 UTC
Is there a hung umount process? Can you manually umount the filesystem that was mounted from the server that went down? What is the output of alt-sysrq-t when this happens? What do the logs show? Unfortunately didn't try all the things you mentioned. Will keep this in mind next time. But I could manually mount filesystems, though didn't try the one that had previously failed. The machine was up and functioniong so I didn't do an alt-sysrq-t. The /proc/mounts and /etc/mtabs didn't show the filestystem that had gone bad, nor did it show any of the ones that couldn't be mounted. So I didn't try unmounting it. If they had been previously mounted then they were okay, but no new ones could be mounted. When we tried to reboot, we got a lot of these messages... NXNODE 1.3.2-25[28966]: ERROR: file match line: cannot open file '/.nx/C-acnlin86.pbn.bnl.gov-1114-0BD1438F69351E511DE69789FE2A43B4/session': No such file or directory 'main:nxnode_ee:4383' kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day.. Followed by a kernel panic Wrote down the following stack info... eip @destroy_inode dput link_path_walk default_do_nmi path_lookup open_namei filp_open sys_open I can send the /var/log/message file if that helps We actually had several machines crash with similar messages i.e have a nice day when the machine exporting the filesystem went bad. We got mesgs like this.. MVFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... automount[25838]: >> mount: RPC: Port mapper failure - RPC: Timed out automount[25838]: mount(nfs): nfs: mount failure acnlin31.pbn.bnl.gov:/cfsi on /cfs/i We seem to be in the same state now, ie. automount not expiring or mounting anything new. No problematic filesystems this time. Umounting any mounted filesytem works sort of ie. it disappears from /proc/mounts but is still present in /etc/mtabs and the mt point still exists in the auto.xxx directory. I see no umount msgs in /var/log/messages Automount daemon looks hung in.... [root@acnlin86 root]# strace -p 3745 Process 3745 attached - interrupt to quit futex(0x3f320c, FUTEX_WAIT, 2, NULL We have 2 automount daemons, we are only having problems with one of them and it's always for the same filesystem ?? Heres some results you asked for acnlin86 102:ps -elf | grep mount 1 S root 3743 1 0 75 0 - 441 - Mar17 ? 00:00:02 /usr/sbin/automount --timeout=60 --debug /misc file /etc/auto.misc 1 S root 3877 1 0 85 0 - 438 - Mar17 ? 00:00:00 rpc.mountd 1 S root 3745 1 0 75 0 - 440 - Mar17 ? 00:00:00 /usr/sbin/automount --timeout=60 --debug /cfs file /etc/auto.cfs Let me know what other info I can get to you while the machine is in this state. Should we try restarting autofs ? Debug logs. I see you have debugging enabled. Do you also send all messages to a debug log? Something like this in your syslog.conf would do the trick: *.* /var/log/debug You mentioned 2 different versions of the kernel and automounter. When you post test results, please let me know which versions you are running. The busy inodes after umount issue is being tracked in bz #124600. You may want to add yourself to the CC list there, though that isn't the main bug you are running into. So, in summary, please get me debug logs. Thanks. Created attachment 112138 [details]
gzipped debug file for autofs
debug file created by automount
The info I am (and have been) sending is for kernel 2.4.21-27.0.2.EL WS release 3 (Taroon Update 3). The first set of info I sent was for autofs-4.1.3-47 We then upgraded to autofs-4.1.3-104, So the second set of info was for autofs-4.1.3-104 Created attachment 112511 [details]
gzipped autofs debug log file #2
The problem is continueing and consistent on only one of our machines.
Rebooting does not help, since it quickly reverts to the bad state,
where the daemon hangs in a futex wait, and it no longer expires or mounts
filesystems.
Even stranger is the fact that the problem seems to occur mostly with only one
automount daemon on this system.
I will attach the debug log in case any one is still looking into this problem.
Currently, we have to manually mount the filesystems on this machine.
Yes, I'm still working on this. Could you please try the following kernel: http://people.redhat.com/dhoward/bz124600/ This will likely not resolve your autofs issues, but I would like to know if you still get the panics and the busy inode after umount messages. I'm looking at your logs now. Could you post the map file for the troublesome automount? Thanks. Created attachment 112518 [details]
tar file of autofs map files fror problem system
attached is a tar file containing the map files for our problem system.
Not sure about the kernel upgrade, can't reproduce the panics at will.
Jeff - I noticed some comments in issue 12 of autofs Digest about a hanging autofs condition... "It's possible for an event wait request to arive before the event requestor. If this happens the daemon never gets notified and autofs hangs." Could this problem be behind our hanging autofs as well ? i.e bug 151431 Thanks -Sev I'm not sure. I've requested more information on this specific patch. This may be a duplicate of bz #144729. The symptoms seem to be the similar. However, it mentions the problem went away when --ghost option was removed. We do not use that option, so that won't help. It would have been interesting to see if the daemon in bz 144729 was stuck on a futex, but I saw no mention. Oh, duh! The futex.... Thanks for mentioning that again. It seems that autofs will issue syslog(3) calls while in a signal handler. This is a no no, and can result in the automount process hanging. See bug 154224. I put together this patch: https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=112984 But it is against 4.1.4_beta2. I'll put together a patch against our package and post it for you to try. Ok, looking forward to the patched package. Didn't seem to have permission to view bug 154224 you mentioned ? Hi Jeff- Since today is our maintenance day, I was wondering if you got around to putting together a patch for us. Thanks Created attachment 113359 [details]
comment out syslogs in signal handler context
Dan Berrange put together this patch to verify the problem. If you apply this,
the problem should go away, but we won't get any of the log information from
signal handlers. In other words, this patch is by no means the solution, but
it should help to verify we are addressing the right problem in your
environment.
I'm currently working with upstream to resolve the problem in a more permanent
fashion. The proper fix will take another week or two to hammer out.
Please try this patch, and let me know if it resolves your issues.
Thanks!
Would like to try it. But we don't have source for autofs-4.1.3-104. Would you happen to have an rpm package ready to go ? Thanks Created attachment 113364 [details]
rpm with syslog patch applied
Here is an i386 rpm, based on autofs-4.1.3-120, which includes the syslog
patch. Please give this a try.
Thanks.
Will do, I'll keep you posted. Thanks Does this patch resolve your hangs? Did you have a chance to try it? Thanks. Yes, it did. Let me know when there is a permanent fix. Thanks. Jeff - Can you tell me if this current release of autofs 4.13-130 contains a fix for this problem. Thanks autofs-4.1.3-130 does not contain the fix for this problem. I was wondering if you could provide an rpm for 4.1.3-130 with the patch you sent us earlier. This way we can upgrade autofs on some of our systems experiencing mount problems. Thanks Could you advice on the best course of action for us ? We have a large number of systems that we need to upgrade autofs, to prevent failed mounts, or hung daemons. Any idea when the fix above will be released. Should we upgrade with the patched version you gave us earlier ? Or, is there a more recent version that we could use ? Thanks I can't release anything in a supported fashion. If you would like, I can apply the patch I made for 4.1.3-120 to the 4.1.3-130 RPM. Most likely, the solution will be the reentrant syslog implementation that is being developed upstream. I am targetting U6 for that bug fix. Can you wait that long? Any idea of a time frame for U6 ? Is there an alternative to waiting ? If not, then I guess trying the latest version with the patch applied makes some sense. Thanks. If you have a support contract with Red Hat, then the proper method for getting this resolved more quickly is to go through Issue Tracker (or through your TAM). I'll put together an RPM for you with the patch listed in this bugzilla. This, unfortunately, will not be a supported RPM. What that means is that before you report any bugs on autofs, you'll have to reproduce them on an unpatched autofs-4.1.3-130. Ok thanks. Not faimiliar with issue tracker. How do I get there ? *** Bug 154224 has been marked as a duplicate of this bug. *** A fix for this was built into autofs version 4.1.3-138. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-654.html |