Bug 124600 - Unexpected error: VFS: Busy inodes after unmount. Self-destruct in 5 seconds.
Unexpected error: VFS: Busy inodes after unmount. Self-destruct in 5 seconds.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
high Severity medium
: ---
: ---
Assigned To: Don Howard
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-05-27 16:33 EDT by Paul Waterman
Modified: 2008-04-13 23:05 EDT (History)
22 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-08-05 16:49:08 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
oops output (9.25 KB, text/plain)
2004-11-24 11:28 EST, Steve Conklin
no flags Details
Scritps that always reproduces the "Busy inodes" message and ofter reproduces the subsequent crash (1.43 KB, text/plain)
2004-12-16 14:19 EST, Pancrazio `ezio' de Mauro
no flags Details
Netdump log from autofs 4 crash (4.61 KB, text/plain)
2005-02-11 08:27 EST, Kim B. Nielsen
no flags Details
Updated version of reproducer script (1.70 KB, text/plain)
2005-02-21 07:20 EST, Steve Dickson
no flags Details
A proposed patch (454 bytes, patch)
2005-02-21 07:35 EST, Steve Dickson
no flags Details | Diff
Upstream patch to avoid follow_link()/unmount race (Greg Banks at SGI) (1.28 KB, patch)
2005-03-28 18:42 EST, Don Howard
no flags Details | Diff
Cisco: Kernel oops subsequent to "Have a Nice Day" (4.73 KB, text/plain)
2005-03-28 18:58 EST, Howard Owen
no flags Details

  None (edit)
Description Paul Waterman 2004-05-27 16:33:37 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5)
Gecko/20031007

Description of problem:
We are currently seeing the following error messages appearing in the
RHEL 3.0 system logs (/var/log/messages):

kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. 
Have a nice day...

These error messages appear to be related to automount unmounts; they
appear to always be immediately preceded or followed by an automount
expiration. E.g.:

May 18 02:26:50 mm2dev15 kernel: VFS: Busy inodes after unmount.
Self-destruct in 5 seconds.  Have a nice day...
May 18 02:26:50 mm2dev15 automount[27846]: expired /usr/prod/viewstore118

-or-

May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/phxscfm4
May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/labspt
May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/land25
May 14 09:51:29 mm2dev11 kernel: VFS: Busy inodes after unmount.
Self-destruct in 5 seconds.  Have a nice day...
May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/ldbuomc3
May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/fstest
May 14 09:51:29 mm2dev11 automount[12358]: expired /usr/test/ftwbtsrq

Further, we have turned off automount expirations on several systems
(via '--timeout 0') and these error messages appear to go away when we
do so.

(We are experiencing unexplained system instability in our RHEL 3.0-U1
systems and are trying to track down and resolve any unexpected error
messages that may be contributing to this instability.)

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Configure and turn on the automounter
2. Automount file systems
3. Wait for those automounts to expire; sometimes this error message
will accompany it
    

Additional info:

The following may be a useful/related reference:

http://groups.google.com/groups?q=%22Previously+anonymous+dentries+were+hashed%22&hl=en&lr=&ie=UTF-8&selm=nfs-valinux.15186.9973.696326.885764%40notabene.cse.unsw.edu.au&rnum=1
Comment 1 Jeffrey Moyer 2004-05-27 16:39:20 EDT
What version of the automounter are you using?  Could you also send
the output from lsmod?

Thanks.
Comment 2 Paul Waterman 2004-05-27 16:45:43 EDT
(Note: I'm not entirely sure whether this is really an autofs issue; I
have a feeling that it's more accurately a kernel issue that's just
showing up due to the way that autofs works.)

% rpm -q autofs
autofs-3.1.7-41

% uname -r
2.4.21-9.ELsmp

% lsmod
Module                  Size  Used by    Tainted: PF 
nfs                    96880  21 (autoclean)
lockd                  60624   1 (autoclean) [nfs]
mvfs                  309024 108
vnode                  76692 108 [mvfs]
sunrpc                 91996   1 [nfs lockd vnode]
autofs                 13780   7 (autoclean)
tg3                    57800   1
floppy                 59056   0 (autoclean)
sg                     38060   0 (autoclean)
microcode               5248   0 (autoclean)
keybdev                 2976   0 (unused)
mousedev                5688   0 (unused)
hid                    22404   0 (unused)
input                   6208   0 [keybdev mousedev hid]
usb-ohci               23688   0 (unused)
usbcore                83168   1 [hid usb-ohci]
ext3                   92360   2
jbd                    57016   2 [ext3]
mptscsih               42288   3
mptbase                44736   3 [mptscsih]
sd_mod                 13744   6
scsi_mod              117800   3 [sg mptscsih sd_mod]
Comment 3 Paul Waterman 2004-06-14 19:15:55 EDT
Note: We are seeing this problem on both RHEL 3.0 Update 1 and Update
2 systems.
Comment 4 Ernie Petrides 2004-06-15 05:54:51 EDT
A fix for this problem has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.11.EL).
Comment 5 John Flanagan 2004-09-02 00:31:41 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html
Comment 6 Steve Conklin 2004-11-24 11:27:10 EST
This has been seen again on the AS3 15-9 kernel. I will attach the
oops ttext from the ticket. This is from issue #55076.

Here's the description from the ticket:

There are some cases of  kernel panic following "Busy inodes after
unmount" w/ different functions, like destory_inode,
ext3_get_inode_loc, clear_inode, etc.

Checking the oope data, all of them are due to corrupted inode data
structure; I think the root cause is that superblock is freed even if
there are still busy inodes hanging around.

So the fix can be:

1. let kill_super waits for busy inodes; but that may cause it to wait
forever. So those busy inodes have to be get rid of first. Yet, due to
service failures or bugs or even HW problems, the busy inodes may be
there forever w/o outside help.
2. fix those offending functions but locking and checking may bring
performance problem for normal operations. And the potential offending
functions can be quite a few.

This issue is present among AS2.1 series and AS3.x kernel. Recently,
its triggered a bit more frequently for AS3 hosts.
Comment 7 Steve Conklin 2004-11-24 11:28:12 EST
Created attachment 107401 [details]
oops output
Comment 9 Jeffrey Moyer 2004-11-29 13:12:11 EST
Steve,

The fix was committed to 15.11.  The fix also requires the autofs4 kernel module
to be in use.  You can add the following to module.conf:

alias autofs autofs4

If it is at all possible, I would advise moving to an updated autofs package, as
well.

Please let me know if the problem is reproducable on the kernel reported to fix
the problem.
Comment 12 Pancrazio `ezio' de Mauro 2004-12-16 14:19:03 EST
Created attachment 108737 [details]
Scritps that always reproduces the "Busy inodes" message and ofter reproduces the subsequent crash

To reproduce the bug with this script, customise the following variables
according to the comments in the script itself:

NFSDEVICE1=server1:/path1
NFSDEVICE2=server2:/path2
NFSMOUNTPOINT1=/tmp/nfs1
NFSMOUNTPOINT2=/tmp/nfs2
NFSTARGET=$NFSMOUNTPOINT1/filename1
NFSLINK=$NFSMOUNTPOINT2/filename2

You will need a client to run this script on and two NFS servers to mount the
shares from. Only the client will have problems and eventually crash, it is
safe to use two production NFS servers.
Comment 13 Pancrazio `ezio' de Mauro 2004-12-16 14:24:46 EST
The reproducer does not require autofs, autofs simply triggers the bug
with more probability because it auto-unmounts when it thinks than no
processes are accessing the mounted file system.
Comment 23 Kim B. Nielsen 2005-02-11 08:27:59 EST
Created attachment 110974 [details]
Netdump log from autofs 4 crash
Comment 24 Kim B. Nielsen 2005-02-11 08:29:57 EST
Just FYI, we have seen this error two days ago on one of our file
servers, during the backup routine. The server made a kernel panic
with the usual:

VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a
nice day...

I've obtained the oops message and a memory dump of the kernel via
netdump. I'll attach the oops message, but not the kernel memory dump,
as it's a whopping 1 gig in size. If you need it, however, I should be
able to provide it to you

The system information is:

[root@atlantis root]# uname -a
Linux atlantis 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005
i686 i686 i386 GNU/Linux
[root@atlantis root]# lsmod
Module                  Size  Used by    Not tainted
ide-cd                 34016   0  (autoclean)
cdrom                  32896   0  (autoclean) [ide-cd]
nfs                   100564  13  (autoclean)
nfsd                   86160  32  (autoclean)
lockd                  59600   1  (autoclean) [nfs nfsd]
sunrpc                 89244   1  (autoclean) [nfs nfsd lockd]
netconsole             16332   0  (unused)
autofs4                16984   4  (autoclean)
e1000                  77884   1
ipt_REJECT              4632   2  (autoclean)
ipt_state               1080   0  (autoclean)
ip_conntrack           29800   1  (autoclean) [ipt_state]
iptable_filter          2412   1  (autoclean)
ip_tables              16544   3  [ipt_REJECT ipt_state iptable_filter]
floppy                 57552   0  (autoclean)
sg                     37388   0  (autoclean)
microcode               6912   0  (autoclean)
ext3                   89992   9
jbd                    55092   9  [ext3]
dpt_i2o                30144   4
aic7xxx               163120   6
diskdumplib             5260   0  [dpt_i2o aic7xxx]
sd_mod                 13936  20
scsi_mod              115240   4  [sg dpt_i2o aic7xxx sd_mod]
[root@atlantis root]#
[root@atlantis root]# rpm -q autofs
autofs-4.1.3-47
[root@atlantis root]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 3 (Taroon Update 4)

The server is fully updated with all the available updates for RedHat
Advanced Server 3
Comment 26 Chris Van Hoof 2005-02-18 11:42:32 EST
Kim B. Nielsen,
Please make the vmcore image available.  FTP would be ideal.  

Regards,
Chris
Comment 28 Kim B. Nielsen 2005-02-19 11:16:36 EST
Sure thing.. I'll mail the login information to Chris vanhoof.

Unfortunally, I'm only able to offer a http download at this time.

MD5SUM of vmcore image:
f2c4ae2f4969c7fb8724d8cc944cb505  vmcore

Regards,
Kim
Comment 29 Steve Dickson 2005-02-21 07:20:17 EST
Created attachment 111252 [details]
Updated version of reproducer script

Here is an updated version of the reproducer script that
automatically brings the server down and then up.
I was able to reproduce the oops every time with this script.
Comment 30 Steve Dickson 2005-02-21 07:35:27 EST
Created attachment 111253 [details]
A proposed patch

Here is a proposed patch that stops the oops from occurring.
Unfortunately it does not directly address the race condition
that is causing the oops, which appears to be more of an VFS
issue than an NFS on and will (probably) need to be addressed
at that layer.
Comment 34 Eric Hagberg 2005-03-09 17:38:40 EST
What's the plan, if any, to incorporate this patch which is supposed
to stop the problem?
Comment 35 Steve Dickson 2005-03-09 20:28:11 EST
We will take up the cause again in the U6 timeframe...
Comment 36 Ernie Petrides 2005-03-11 18:42:50 EST
*** Bug 132322 has been marked as a duplicate of this bug. ***
Comment 37 Ernie Petrides 2005-03-11 18:47:54 EST
*** Bug 143542 has been marked as a duplicate of this bug. ***
Comment 39 Howard Owen 2005-03-25 19:59:41 EST
We are seeing this at Cisco too, running the 2.4.21-27.0.1.ELsmp kernel. I
haven't been able to get a netdump yet because most of the hosts are running the
x86_64 kernel. I have console logging on twenty hosts, and soon hope to have 100
or more. I'll update this bug if/when I get an oops trace or netdump.
Comment 41 Don Howard 2005-03-28 18:42:23 EST
Created attachment 112402 [details]
Upstream patch to avoid follow_link()/unmount race (Greg Banks at SGI)
Comment 42 Howard Owen 2005-03-28 18:58:44 EST
Created attachment 112403 [details]
Cisco: Kernel oops subsequent to "Have a Nice Day"

This is a console log showing both the VFS message and a subsequent kernel
oops. I'm seeing the VFS message on most of the clients I'm monitoring.

The system is RHEL3/U3 with the U4 kernel.
Comment 43 Josef Bacik 2005-03-29 13:58:01 EST
I have a customer who is experiencing this issue.  Is there any suggested fix
for this problem?  Is this fix present in the beta kernel in RHN?  Thank you.
Comment 44 Don Howard 2005-03-29 14:30:57 EST
A test kernel with the patch from comment #41 is available at 
http://people.redhat.com/dhoward/bz124600/

Could folks who are experiencing this problem test and provide feedback on this
kernel?

Comment 49 Martin Bowers 2005-04-05 17:51:55 EDT
Don, thanks for creating the kernel.  I currently have 9 systems running it that
see this bug fairly often.  I want to give it ~2 weeks to see if I run into this
issue or not.  

My company has well over 1000 systems affected by this bug, so it will be vital
to us (and others from the looks of this thread) to include this patch in the
next RHEL3 update if it ends up solving the problem.
Comment 53 Ernie Petrides 2005-04-12 21:35:19 EDT
Don posted the patch in comment #41 for internal review on 28-Mar-2005, and
this patch is on track for inclusion within the first couple of U6 builds.
Comment 55 Ernie Petrides 2005-04-15 21:10:01 EDT
A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-32.EL).
Comment 56 Howard Owen 2005-04-26 11:49:26 EDT
Excellent news on the inclusion in U5! If I want to load a test kernel with this
patch, should I use the one referenced in this bug? Or is there a more recent
one I should load in preference?

Thanks,
Comment 57 Howard Owen 2005-04-26 12:00:13 EDT
I note also that the test kernels referenced here are all i686. Any chance we
could get one for X86_64/SMP? (The majority of my problem systems are Opteron
based, running the 64bit kernel.) Or could you point us at the SRPM so we could
build our own?

Thanks again.
Comment 58 Bastien Nocera 2005-04-26 12:39:50 EDT
Howard, the -32.EL kernel is available on the Red Hat Network in the beta
channel for your architecture.
Comment 59 Howard Owen 2005-05-02 17:34:45 EDT
I've loaded th -32.EL kernel on six of my most problematic systems. These are
all dual Opteron boxes running the x86_64 kernel. (Half are Sun v20zs the other
half are from Rackable systems) I'm able to monitor the console on five of the
boxes. Of those, four have stopped showing the "Busy inodes after umount"
message. The fifth, a Sun, is spewing the message almost continuously. However
it hasn't locked up as yet.

The systems whose consoles are quiet now were seeing the busy indoes message
quite a bit, so I consider this an improvement. I'm not sure what the difference
in workload is between the quiet boxes and the noisy one.  Other than that, they
should be quite similar, all running hardware simulations in batch mode using
Clearcase 6.
Comment 60 Howard Owen 2005-05-04 16:36:47 EDT
I now have a second Opteron system (out of five in test) running the test kernel
and showing the 
"VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day..."
message. Still no crashes.
Comment 61 Don Howard 2005-05-04 16:56:35 EDT
Hi Howard-

Do these 5 machines use autofs?  
Are you using any 3rd party filesystems (and if so, which ones)?

The fix in u5 is specificly for symlinks on nfs-mounted filesystems.
Comment 62 Howard Owen 2005-05-04 17:42:28 EDT
Yes and yes. The autofs is autofs4, and the filesystem is MVFS, Clearcase 6,
running on top of NFS.

Should I open a seperate bug? This one seems to be where you are consolidating
effort.
Comment 63 Don Howard 2005-05-05 20:19:21 EDT
Howard -

Would it be possible for you to collect a netdump for me to examine when your
machine is report the busy inodes message?

I've recieved a similar report where mvfs is in the mix and I'd like to look for
similarities.
Comment 64 Howard Owen 2005-05-05 20:46:52 EDT
They are all running the 64bit kernel, which doesn't have netdump support AFAIK.
Comment 65 Don Howard 2005-05-06 13:24:09 EDT
netdump is supported on x86_64 as of RHEL3 U5.  Any chance you can give it a try?
Comment 66 Tim Powers 2005-05-18 09:27:37 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html
Comment 82 Roman Lazarev 2005-09-28 11:43:51 EDT
I am seeing the same problem while running 2.4.21-32.0.1EL. The system has 
clearcase and automounter with 600 seconds timeout.

10.105.171.33-2005-09-28-08:08/log
::::::::::::::
VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
Comment 83 Roman Lazarev 2005-09-28 11:45:02 EDT
I am seeing the same problem while running 2.4.21-32.0.1EL. The system has 
clearcase and automounter with 600 seconds timeout.

10.105.171.33-2005-09-28-08:08/log
::::::::::::::
VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
Comment 84 Alexander Viro 2005-09-28 12:12:24 EDT
can that be reproduced without clearcase?
Comment 85 Roman Lazarev 2005-09-28 14:11:19 EDT
I don't know how to reproduce it at all, it just appears once in a while on 
many systems. However I wasn't able to find one without ClearCase.

Regards,
Roman
Comment 86 Roman Lazarev 2005-09-28 14:45:44 EDT
I tried using "reproducer.sh" to reproduce it. It worked on the machine that 
has ClearCase installed, but the kernel is 2.4.9-e.59 (AS2.1 - and I was told 
it is fixed in e.65). I used the same script on 2.4.21-32.0.1 but was unable 
to reproduce it. Maybe there's a different set of commands needed. I'll just 
keep watching it.
Comment 89 Wayne Berthiaume 2007-03-21 12:12:52 EDT
Tested and verified issue no longer exist in RHEL 4.5 beta.

Note You need to log in before you can comment on or make changes to this bug.