From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007 Description of problem: We are currently seeing the following error messages appearing in the RHEL 3.0 system logs (/var/log/messages): kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... These error messages appear to be related to automount unmounts; they appear to always be immediately preceded or followed by an automount expiration. E.g.: May 18 02:26:50 mm2dev15 kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... May 18 02:26:50 mm2dev15 automount[27846]: expired /usr/prod/viewstore118 -or- May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/phxscfm4 May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/labspt May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/land25 May 14 09:51:29 mm2dev11 kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/ldbuomc3 May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/fstest May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/ftwbtsrq Further, we have turned off automount expirations on several systems (via '--timeout 0') and these error messages appear to go away when we do so. (We are experiencing unexplained system instability in our RHEL 3.0-U1 systems and are trying to track down and resolve any unexpected error messages that may be contributing to this instability.) Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Configure and turn on the automounter 2. Automount file systems 3. Wait for those automounts to expire; sometimes this error message will accompany it Additional info: The following may be a useful/related reference: http://groups.google.com/groups?q=%22Previously+anonymous+dentries+were+hashed%22&hl=en&lr=&ie=UTF-8&selm=nfs-valinux.15186.9973.696326.885764%40notabene.cse.unsw.edu.au&rnum=1
What version of the automounter are you using? Could you also send the output from lsmod? Thanks.
(Note: I'm not entirely sure whether this is really an autofs issue; I have a feeling that it's more accurately a kernel issue that's just showing up due to the way that autofs works.) % rpm -q autofs autofs-3.1.7-41 % uname -r 2.4.21-9.ELsmp % lsmod Module Size Used by Tainted: PF nfs 96880 21 (autoclean) lockd 60624 1 (autoclean) [nfs] mvfs 309024 108 vnode 76692 108 [mvfs] sunrpc 91996 1 [nfs lockd vnode] autofs 13780 7 (autoclean) tg3 57800 1 floppy 59056 0 (autoclean) sg 38060 0 (autoclean) microcode 5248 0 (autoclean) keybdev 2976 0 (unused) mousedev 5688 0 (unused) hid 22404 0 (unused) input 6208 0 [keybdev mousedev hid] usb-ohci 23688 0 (unused) usbcore 83168 1 [hid usb-ohci] ext3 92360 2 jbd 57016 2 [ext3] mptscsih 42288 3 mptbase 44736 3 [mptscsih] sd_mod 13744 6 scsi_mod 117800 3 [sg mptscsih sd_mod]
Note: We are seeing this problem on both RHEL 3.0 Update 1 and Update 2 systems.
A fix for this problem has just been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-15.11.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html
This has been seen again on the AS3 15-9 kernel. I will attach the oops ttext from the ticket. This is from issue #55076. Here's the description from the ticket: There are some cases of kernel panic following "Busy inodes after unmount" w/ different functions, like destory_inode, ext3_get_inode_loc, clear_inode, etc. Checking the oope data, all of them are due to corrupted inode data structure; I think the root cause is that superblock is freed even if there are still busy inodes hanging around. So the fix can be: 1. let kill_super waits for busy inodes; but that may cause it to wait forever. So those busy inodes have to be get rid of first. Yet, due to service failures or bugs or even HW problems, the busy inodes may be there forever w/o outside help. 2. fix those offending functions but locking and checking may bring performance problem for normal operations. And the potential offending functions can be quite a few. This issue is present among AS2.1 series and AS3.x kernel. Recently, its triggered a bit more frequently for AS3 hosts.
Created attachment 107401 [details] oops output
Steve, The fix was committed to 15.11. The fix also requires the autofs4 kernel module to be in use. You can add the following to module.conf: alias autofs autofs4 If it is at all possible, I would advise moving to an updated autofs package, as well. Please let me know if the problem is reproducable on the kernel reported to fix the problem.
Created attachment 108737 [details] Scritps that always reproduces the "Busy inodes" message and ofter reproduces the subsequent crash To reproduce the bug with this script, customise the following variables according to the comments in the script itself: NFSDEVICE1=server1:/path1 NFSDEVICE2=server2:/path2 NFSMOUNTPOINT1=/tmp/nfs1 NFSMOUNTPOINT2=/tmp/nfs2 NFSTARGET=$NFSMOUNTPOINT1/filename1 NFSLINK=$NFSMOUNTPOINT2/filename2 You will need a client to run this script on and two NFS servers to mount the shares from. Only the client will have problems and eventually crash, it is safe to use two production NFS servers.
The reproducer does not require autofs, autofs simply triggers the bug with more probability because it auto-unmounts when it thinks than no processes are accessing the mounted file system.
Created attachment 110974 [details] Netdump log from autofs 4 crash
Just FYI, we have seen this error two days ago on one of our file servers, during the backup routine. The server made a kernel panic with the usual: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... I've obtained the oops message and a memory dump of the kernel via netdump. I'll attach the oops message, but not the kernel memory dump, as it's a whopping 1 gig in size. If you need it, however, I should be able to provide it to you The system information is: [root@atlantis root]# uname -a Linux atlantis 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux [root@atlantis root]# lsmod Module Size Used by Not tainted ide-cd 34016 0 (autoclean) cdrom 32896 0 (autoclean) [ide-cd] nfs 100564 13 (autoclean) nfsd 86160 32 (autoclean) lockd 59600 1 (autoclean) [nfs nfsd] sunrpc 89244 1 (autoclean) [nfs nfsd lockd] netconsole 16332 0 (unused) autofs4 16984 4 (autoclean) e1000 77884 1 ipt_REJECT 4632 2 (autoclean) ipt_state 1080 0 (autoclean) ip_conntrack 29800 1 (autoclean) [ipt_state] iptable_filter 2412 1 (autoclean) ip_tables 16544 3 [ipt_REJECT ipt_state iptable_filter] floppy 57552 0 (autoclean) sg 37388 0 (autoclean) microcode 6912 0 (autoclean) ext3 89992 9 jbd 55092 9 [ext3] dpt_i2o 30144 4 aic7xxx 163120 6 diskdumplib 5260 0 [dpt_i2o aic7xxx] sd_mod 13936 20 scsi_mod 115240 4 [sg dpt_i2o aic7xxx sd_mod] [root@atlantis root]# [root@atlantis root]# rpm -q autofs autofs-4.1.3-47 [root@atlantis root]# cat /etc/redhat-release Red Hat Enterprise Linux AS release 3 (Taroon Update 4) The server is fully updated with all the available updates for RedHat Advanced Server 3
Kim B. Nielsen, Please make the vmcore image available. FTP would be ideal. Regards, Chris
Sure thing.. I'll mail the login information to Chris vanhoof. Unfortunally, I'm only able to offer a http download at this time. MD5SUM of vmcore image: f2c4ae2f4969c7fb8724d8cc944cb505 vmcore Regards, Kim
Created attachment 111252 [details] Updated version of reproducer script Here is an updated version of the reproducer script that automatically brings the server down and then up. I was able to reproduce the oops every time with this script.
Created attachment 111253 [details] A proposed patch Here is a proposed patch that stops the oops from occurring. Unfortunately it does not directly address the race condition that is causing the oops, which appears to be more of an VFS issue than an NFS on and will (probably) need to be addressed at that layer.
What's the plan, if any, to incorporate this patch which is supposed to stop the problem?
We will take up the cause again in the U6 timeframe...
*** Bug 132322 has been marked as a duplicate of this bug. ***
*** Bug 143542 has been marked as a duplicate of this bug. ***
We are seeing this at Cisco too, running the 2.4.21-27.0.1.ELsmp kernel. I haven't been able to get a netdump yet because most of the hosts are running the x86_64 kernel. I have console logging on twenty hosts, and soon hope to have 100 or more. I'll update this bug if/when I get an oops trace or netdump.
Created attachment 112402 [details] Upstream patch to avoid follow_link()/unmount race (Greg Banks at SGI)
Created attachment 112403 [details] Cisco: Kernel oops subsequent to "Have a Nice Day" This is a console log showing both the VFS message and a subsequent kernel oops. I'm seeing the VFS message on most of the clients I'm monitoring. The system is RHEL3/U3 with the U4 kernel.
I have a customer who is experiencing this issue. Is there any suggested fix for this problem? Is this fix present in the beta kernel in RHN? Thank you.
A test kernel with the patch from comment #41 is available at http://people.redhat.com/dhoward/bz124600/ Could folks who are experiencing this problem test and provide feedback on this kernel?
Don, thanks for creating the kernel. I currently have 9 systems running it that see this bug fairly often. I want to give it ~2 weeks to see if I run into this issue or not. My company has well over 1000 systems affected by this bug, so it will be vital to us (and others from the looks of this thread) to include this patch in the next RHEL3 update if it ends up solving the problem.
Don posted the patch in comment #41 for internal review on 28-Mar-2005, and this patch is on track for inclusion within the first couple of U6 builds.
A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-32.EL).
Excellent news on the inclusion in U5! If I want to load a test kernel with this patch, should I use the one referenced in this bug? Or is there a more recent one I should load in preference? Thanks,
I note also that the test kernels referenced here are all i686. Any chance we could get one for X86_64/SMP? (The majority of my problem systems are Opteron based, running the 64bit kernel.) Or could you point us at the SRPM so we could build our own? Thanks again.
Howard, the -32.EL kernel is available on the Red Hat Network in the beta channel for your architecture.
I've loaded th -32.EL kernel on six of my most problematic systems. These are all dual Opteron boxes running the x86_64 kernel. (Half are Sun v20zs the other half are from Rackable systems) I'm able to monitor the console on five of the boxes. Of those, four have stopped showing the "Busy inodes after umount" message. The fifth, a Sun, is spewing the message almost continuously. However it hasn't locked up as yet. The systems whose consoles are quiet now were seeing the busy indoes message quite a bit, so I consider this an improvement. I'm not sure what the difference in workload is between the quiet boxes and the noisy one. Other than that, they should be quite similar, all running hardware simulations in batch mode using Clearcase 6.
I now have a second Opteron system (out of five in test) running the test kernel and showing the "VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day..." message. Still no crashes.
Hi Howard- Do these 5 machines use autofs? Are you using any 3rd party filesystems (and if so, which ones)? The fix in u5 is specificly for symlinks on nfs-mounted filesystems.
Yes and yes. The autofs is autofs4, and the filesystem is MVFS, Clearcase 6, running on top of NFS. Should I open a seperate bug? This one seems to be where you are consolidating effort.
Howard - Would it be possible for you to collect a netdump for me to examine when your machine is report the busy inodes message? I've recieved a similar report where mvfs is in the mix and I'd like to look for similarities.
They are all running the 64bit kernel, which doesn't have netdump support AFAIK.
netdump is supported on x86_64 as of RHEL3 U5. Any chance you can give it a try?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html
I am seeing the same problem while running 2.4.21-32.0.1EL. The system has clearcase and automounter with 600 seconds timeout. 10.105.171.33-2005-09-28-08:08/log :::::::::::::: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day...
can that be reproduced without clearcase?
I don't know how to reproduce it at all, it just appears once in a while on many systems. However I wasn't able to find one without ClearCase. Regards, Roman
I tried using "reproducer.sh" to reproduce it. It worked on the machine that has ClearCase installed, but the kernel is 2.4.9-e.59 (AS2.1 - and I was told it is fixed in e.65). I used the same script on 2.4.21-32.0.1 but was unable to reproduce it. Maybe there's a different set of commands needed. I'll just keep watching it.
Tested and verified issue no longer exist in RHEL 4.5 beta.