Bug 234587

Summary: VFS: Busy inodes at umount on NFSv4 filesystem
Product: Red Hat Enterprise Linux 4 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: aviro, jbaron, lwoodman, mflint, rgautier, sfolkwil, staubach, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:24:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 209419, 228292    
Bug Blocks:    
Attachments:
Description Flags
patch -- if setattr call errors out, attempt to clean up incomplete open
none
patch -- track incomplete opens and clean up in setattr
none
patch -- track incomplete opens and clean them up if a setattr errors out
none
patch -- track incomplete opens and clean up in setattr
none
patch -- have open context keep a reference to the vfsmount none

Description Jeff Layton 2007-03-30 12:05:52 UTC
+++ This bug was initially created as a clone of Bug #209419 +++

Description of problem:

Kernel crash, see the log from /var/log/messages below

Oct  5 04:09:56 fieldarcher kernel: ------------[ cut here ]------------
Oct  5 04:09:56 fieldarcher kernel: kernel BUG at fs/nfs/inode.c:1650!
Oct  5 04:09:56 fieldarcher kernel: invalid operand: 0000 [#1]
Oct  5 04:09:56 fieldarcher kernel: SMP 
Oct  5 04:09:56 fieldarcher kernel: Modules linked in: nfs ide_scsi nfsd
exportfs lockd nfs_acl md5 ipv6 i2c_dev i2c_core sunrpc ipt
_LOG ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod
button battery ac pcnet32 mii floppy ext3 jbd mptsc
sih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Oct  5 04:09:56 fieldarcher kernel: CPU:    0
Oct  5 04:09:56 fieldarcher kernel: EIP:    0060:[<f89ac6d8>]    Not tainted VLI
Oct  5 04:09:56 fieldarcher kernel: EFLAGS: 00010202   (2.6.9-42.0.2.ELsmp) 
Oct  5 04:09:56 fieldarcher kernel: EIP is at nfs4_clear_inode+0x6b/0x90 [nfs]
Oct  5 04:09:56 fieldarcher kernel: eax: 00000001   ebx: ceafbe5c   ecx:
00000000   edx: ceafbec4
Oct  5 04:09:56 fieldarcher kernel: esi: f7daf980   edi: ceafbd1c   ebp:
c032de78   esp: f7cd7eb4
Oct  5 04:09:56 fieldarcher kernel: ds: 007b   es: 007b   ss: 0068
Oct  5 04:09:56 fieldarcher kernel: Process kswapd0 (pid: 41,
threadinfo=f7cd7000 task=f7d076f0)
Oct  5 04:09:56 fieldarcher kernel: Stack: ceafbe5c f7cd7ee8 0000000e c0170d78
ceafbe5c c0170df5 ede74164 ede7416c 
Oct  5 04:09:56 fieldarcher kernel:        00000054 c017117a 00000080 00000080
00000080 e23b9404 e4e8dbcc 00000000 
Oct  5 04:09:56 fieldarcher kernel:        00000083 00000000 f7ffe9a0 c01711e7
c01497f8 006aef00 00000000 0000002c 
Oct  5 04:09:56 fieldarcher kernel: Call Trace:
Oct  5 04:09:56 fieldarcher kernel:  [<c0170d78>] clear_inode+0xcc/0x102
Oct  5 04:09:56 fieldarcher kernel:  [<c0170df5>] dispose_list+0x47/0x6d
Oct  5 04:09:56 fieldarcher kernel:  [<c017117a>] prune_icache+0x193/0x1ec
Oct  5 04:09:56 fieldarcher kernel:  [<c01711e7>] shrink_icache_memory+0x14/0x2b
Oct  5 04:09:56 fieldarcher kernel:  [<c01497f8>] shrink_slab+0xf8/0x161
Oct  5 04:09:56 fieldarcher kernel:  [<c014aa93>] balance_pgdat+0x1e1/0x30e
Oct  5 04:09:56 fieldarcher kernel:  [<c02d26b1>] schedule+0x86d/0x8db
Oct  5 04:09:56 fieldarcher kernel:  [<c0120420>] prepare_to_wait+0x12/0x4c
Oct  5 04:09:56 fieldarcher kernel:  [<c014ac8a>] kswapd+0xca/0xcc
Oct  5 04:09:56 fieldarcher kernel:  [<c01204f5>] autoremove_wake_function+0x0/0x2d
Oct  5 04:09:56 fieldarcher kernel:  [<c02d46da>] ret_from_fork+0x6/0x14
Oct  5 04:09:56 fieldarcher kernel:  [<c01204f5>] autoremove_wake_function+0x0/0x2d
Oct  5 04:09:56 fieldarcher kernel:  [<c014abc0>] kswapd+0x0/0xcc
Oct  5 04:09:56 fieldarcher kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Oct  5 04:09:56 fieldarcher kernel: Code: 00 ff b3 c4 fe ff ff 05 60 01 00 00 ff
b3 c0 fe ff ff 50 68 e5 84 9c f8 68 e2 98 9c f8 e8 
cf 61 77 c7 83 c4 18 8b 46 60 48 74 08 <0f> 0b 72 06 74 94 9c f8 0f b7 56 5c 89
f0 e8 cd 7c 01 00 8d 87 
Oct  5 04:09:56 fieldarcher kernel:  <0>Fatal exception: panic in 5 seconds

Version-Release number of selected component (if applicable):


How reproducible:

The machine has a nfsv4 over tcp mount from a netapp filer. The kernel has
crasht 2 times with the same message. When the machine is rebooted, the problem
occurs after a week approximately.

Steps to Reproduce:
1. 
2.
3.
  
Actual results:


Expected results:


Additional info:

-- Additional comment from staubach on 2006-12-08 09:07 EST --
I don't suppose that you have any idea what the machine was doing
when it crashed these two times, do you?

-- Additional comment from rene on 2006-12-08 09:21 EST --
The machine is running in vmware server, it runs mainly apache with php. The two
crashes we not at the same time of day. We switched to nfsv3, and the server
runs stable now.

-- Additional comment from staubach on 2006-12-08 09:35 EST --
Well, I wouldn't know how to approach this without more information,
so I will close it.  If it reoccurs and there is more information for
me to use, then please reopen this bz and I will look at it again then.

-- Additional comment from tao on 2007-03-26 08:57 EST --
What happened was that the kernel hit this BUG() in nfs4_clear_inode:

     BUG_ON(atomic_read(&state->count) != 1);

...this corresponds to this section of code:

0xf96486d2 <nfs4_clear_inode+101>:      mov    0x60(%esi),%eax
0xf96486d5 <nfs4_clear_inode+104>:      dec    %eax
0xf96486d6 <nfs4_clear_inode+105>:      je     0xf96486e0
0xf96486d8 <nfs4_clear_inode+107>:      ud2a                    <<<< CRASH
HERE

...so we load %eax with the contents of the &state->count and then
decrement it and compare to 0. So the count at the time here was 3 (since
%eax is 2). %esi has the nfs4_state address and it bears this out:

  nreaders = 3, 
  nwriters = 0, 
  state = 1, 
  count = {
    counter = 3
  }

The inode associated with this nfs4_state has a 0 reference count,
however...

  i_count = {
    counter = 0
  }, 

...there are no remaining dentries attached to it, so we can't tell any of
its filenames. The mode is:

  i_mode = 33279 = 0100777

...a regular file but with 777 perms.The inode appears to be freeable and
unused (from its reference counts), but the state owner lists 3 processes
as still having a reading reference count against it. crash also confirms
that the inode is being held open by no processes:

crash> fuser 0xe53953fc
No users of e53953fc found

Essentially, this reference count gets increased when the inode is opened
for reading, and decremented on close (it's a bit more complex than that,
but that's the basic idea). So we're getting too many increments or not
enough decrements (or both). Are they able to reproduce this or (even
better) describe to us how to reproduce it?

I see this as well running on cpu3 at the time:

crash> bt
PID: 585    TASK: f74699b0  CPU: 3   COMMAND: "vethrpc"
 #0 [f74d6f18] schedule at c02d268d
 #1 [f74d6f7c] veth_rpc_thread at f8e91d6a
 #2 [f74d6ff0] kernel_thread_helper at c01041f3

...I'm presuming that this has nothing to do with ONS-RPC (that NFS is
layered upon), but it would be good to have Egenera confirm this.



This event sent from IssueTracker by jlayton 
 issue 117012

-- Additional comment from jlayton on 2007-03-26 09:07 EST --
Reopening bug...Egenera has gotten a couple of these crashes and we have a core
for it. Unfortunately, it doesn't tell us much about what happened leading up to
the event.

Peter, I'll go ahead and grab this since I've been doing a bit of nfs4 work
anyway. Let me know if you'd rather work it, though...


-- Additional comment from rene on 2007-03-26 09:10 EST --
I moved from nfs4 to nfs3, the problem hasn't occured again.

-- Additional comment from jlayton on 2007-03-26 09:18 EST --
Thanks Rene. We've had another person report this and I'm hoping they have a
reproducer, so I'm reopening this bug for that reason.

-- Additional comment from jlayton on 2007-03-27 08:14 EST --
I've been chasing a not too dissimilar race in bug 228292. Now that I have a
patch for that deadlock, it seems to have brought out at least one case of a
crash similar to this one, and a "VFS: Busy inodes after umount" problem as
well. My guess is that this is not due to the patch itself, but rather that
clearing the deadlock there has made the race occur with more frequency.

I'm going to keep pursuing that case for now, with the hope that I'll be able to
track down the race that's occurring here as well.


-- Additional comment from jlayton on 2007-03-27 14:14 EST --
This is the post I just made in bug 228292. My thinking is that that patch for
that BZ just happens to make this bug occur more frequently, but is not
responsible for it. I'll continue work on this race here. Note that I've seen
what appears to be a number of reference count related races with that
reproducer, and not all of them are alike. Still I think they may have a common
root...

I think I'm starting to see the problem and it is related to signal handling. In
the most common case that I've seen it appears that open_namei ends up returning
-ERESTARTSYS, likely due to the fact that I'm killing the process.

I think that the way this is supposed to work is that the lookup ends up calling
nfs4_do_open, and then later __dentry_open is supposed to initiate a sequence of
calls that does a nfs4_proc_file_open. That does a nfs4_close_state and would
likely clean up the i_count, etc.

Because open_namei exits with an error, however, this is not occuring and that's
causing this particular reference count leak. That said, I'm thinking this is a
different bug and one that exists with or without the workqueue patch. I'll
continue work on that problem in bug 209419.

It's not yet clear to me what the correct solution for this problem is...


-- Additional comment from jlayton on 2007-03-27 17:22 EST --
Replicating the post I just made to bug 228292:

----------[snip]-----------
Actually, after looking over the code, the "VFS Busy inodes" problem may be
introduced by the first patch in this set (the one to convert nfs4_close_state
to do an async call). That patch adds an igrab/iput in nfs4_get/put_open_state.

Essentially, an nfs4 open seems to be a 2 stage process. The first part is
handled by a chain like this:

filp_open
  open_namei
    __lookup_hash
      i_op->lookup (nfs_atomic_lookup)
         nfs4_atomic_open
           nfs4_do_open
              _nfs4_do_open
                   nfs4_get_open_state

nfs4_get_open_state does an igrab, and that bumps i_count by 1. _nfs4_do_open
increments the nreaders and nwriters for the state as well (as appropriate for
the flags it's passed).

The second part of the open has a call chain something like this:

filp_open
  __dentry_open
     f_op->open  (nfs_file_open)
       rpc_ops->file_open (nfs4_proc_file_open)
          nfs4_close_state
            nfs4_do_close
               nfs4_close_done
                  nfs4_put_open_state

nfs4_put_open_state does an iput and decrements the i_count. nfs4_close_done
decrements the nreaders/nwriters as appropriate as well.

I believe, the problem is that if open_namei returns an error to filp_open, then
the second call chain never happens and the i_count and nreaders/writers remains
artificially high. I think this problem also may explain bug 209419 as well.

When we send a signal to these processes (most likely, ones in the middle of an
open() call), we sometimes make open_namei return a  -ERESTARTSYS.

I still I need to do a bit of investigation to see where the ERESTARTSYS is
ultimately coming from. This seems like a sticky problem, however. It seems like
we need a hook to get in a nfs4_close_state if open_namei is going to return an
error after doing its lookup, but we don't have anything like that. We may have
to do something to force it in the dput, or find a way to make sure that we
don't return a -ERESTARTSYS at all.
----------[snip]-----------

So anyone who's seen this particular crash, could you let us know definitively
whether you had processes doing I/O on the filesystem that may have caught a
signal? That may help confirm whether my speculation about the above situation
is true. In the meantime, I'm doing more investigation to see if I can tell
where the -ERESTARTSYS is ultimately coming from.


-- Additional comment from mflint on 2007-03-28 08:31 EST --
it seems that no I/O was occuring in our case. The customer was runing
automounter and it was umounting a filesystem at the time. We have the following
messages prior to the panic

NFS: mount path /auto/home/oracle does not exist!
NFS: suggestion: try mounting '/' instead.
------------[ cut here ]------------
kernel BUG at fs/nfs/inode.c:1650!
invalid operand: 0000 [#1]
SMP 
Modules linked in: iptable_filter ip_tables nfsd exportfs nfs lockd md5 ipv6
parport_pc lp parport autofs4 i2c_dev i2c_core hw_random nfs_acl sunrpc
egenera_nmi(U) egenera_veth(U) sd_mod egenera_vscsi(U) scsi_mod
egenera_vmdump(U) egenera_dumpdev(U) egenera_ipmi(U) egenera_base(U)
egenera_virtual_bus(U) egenera_fs(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
CPU:    0
EIP:    0060:[<f96486d8>]    Tainted: PF     VLI
EFLAGS: 00010202   (2.6.9-42.0.3.ELsmp) 
EIP is at nfs4_clear_inode+0x6b/0x90 [nfs]
eax: 00000002   ebx: e53953fc   ecx: 000001c7   edx: 00000000
esi: f38f9d80   edi: e53952bc   ebp: c032de78   esp: e7a67ee4
ds: 007b   es: 007b   ss: 0068
Process umount (pid: 20118, threadinfo=e7a67000 task=f74683b0)
Stack: e53953fc e7a67f0c 0000002c c0170d78 e53953fc c0170df5 00000000 f00f7400 
       f00f7400 c0170f5e e539569c f1433e64 e7a67000 f00f7400 00000000 f9674a60 
       c0160396 00000011 f028d400 f9674ac0 e7a67000 c0160c13 f00f7400 f9648ea2 
Call Trace:
 [<c0170d78>] clear_inode+0xcc/0x102
 [<c0170df5>] dispose_list+0x47/0x6d
 [<c0170f5e>] invalidate_inodes+0x96/0xae
 [<c0160396>] generic_shutdown_super+0x88/0x150
 [<c0160c13>] kill_anon_super+0x9/0x30
 [<f9648ea2>] nfs4_kill_super+0x16/0x50 [nfs]
 [<c016025c>] deactivate_super+0x5b/0x70
 [<c017389f>] sys_umount+0x65/0x6c
 [<c016314a>] sys_stat64+0xf/0x23
 [<c01507f5>] unmap_vma_list+0xe/0x17
 [<c0150b2f>] do_munmap+0x129/0x137
 [<c01738b1>] sys_oldumount+0xb/0xe
 [<c02d47cb>] syscall_call+0x7/0xb


-- Additional comment from jlayton on 2007-03-29 11:49 EST --
I wasn't so much concerned with I/O happening at the time of the umount (by
definition there wouldnt be any, or you wouldn't be able to umount). My question
was more along the lines of "did any application that was doing any I/O to this
filesystem prior to the unmount end up being killed by a signal". Unfortunately,
it may be hard to tell.

I have a reproducer of sorts for this issue, but it involves a much larger NFSv4
patch that's still in progress. Essentially I run a hefty amount of filesystem
activity via several concurrent threads and then kill off the processes doing it
and unmount. After several attempts at this I end up with a panic that's similar
to the one above. I believe the issue there is that the i_count for the inode is
 bad at the time that an iput is done and we end up trying to delete an inode
that's still being actively worked with.

There's a later patch that adds an igrab/iput in nfs4_get/put_open_state. That
seems to correct the problems of panicing in nfs4_clear_inode for me, but I
still see a "VFS: Busy inodes at umount" error, and some other panics that are
related to reference counting. So the problem is not completely fixed by that.

There seems to be a reference count leak that's occurring and I believe it's
tied up with signal handling while in the middle of an open call, but tracking
this down is tricky since it's pretty racy.


-- Additional comment from jlayton on 2007-03-29 14:48 EST --
Instrumented a kernel to watch for the case where may_open/open_namei returns
with an error and to print the inode address that it had been working with at
the time. I also made it dump out the inode address that had been busy at the
time. So in my reproducer, it appears that may_open returning with an error
(-ERESTARTSYS in this case) is causing the problem:

open_namei: returning -512, inode ffffff803baf4a20 left dangling
invalidate_list2[inode_in_use]: busy inode=ffffff803baf4a20 i_count=1
i_state=0x0 i_debug=0x33, i_mode=0100666 i_dentry=empty
VFS: [ffffff8000981400:10] 1 busy inodes after unmount. Self-destruct in 5
seconds.  Have a nice day...

...I've yet to come up with an elegant way to "clean up" in this situation,
however. We really need some way to call down into NFS from the VFS and clean up
the state, but there are no such hooks that are suitable for that...


-- Additional comment from jlayton on 2007-03-30 07:45 EST --
Created an attachment (id=151265)
patch -- increment and decrement i_count when getting and putting open state

I feel pretty confident that this patch will correct this panic. It's a
backported piece of an upstream patch that converts the NFSv4 close call to be
asynchronous:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=09ca6dde884139015027ea102cbb560de5836eec;hp=cedf3820493f7fefec00da3699507bf7cdfd6544


This piece simply makes sure that we increase the inode reference count when we
do a get on the open state and decrement it when we put the open state. I
believe this will prevent the panic since it should prevent the code from ever
clear_inode on the inode until the nfs4_state has been put for the last time.

Unfortunately, in the testing I've done, this seems to simply trade this
particular panic for a "VFS: Busy inodes at unmount" problem when the fs is
eventually unmounted. I think that there are still some problems with reference
counting with the nfs4_state, but this does cure the immediate problem and
seems to be the right thing to do.

Comment 1 Jeff Layton 2007-03-30 12:21:51 UTC
The patch for BZ 209419 seems to have traded the nfs4_clear_inode BUG() for a
VFS Busy inodes at umount problem. My guess is that these are due to the same
root cause (or causes) -- bad nfs4_state handing in the nfsv4 code. I'm opening
this BZ for tracking down these issues (having looked over this, I think there's
more than one).


Comment 2 Jeff Layton 2007-03-30 17:30:42 UTC
Created attachment 151296 [details]
patch -- if setattr call errors out, attempt to clean up incomplete open

This patch seems to fix up one of the races here. This is a rambly explanation,
but I hope it will explain it...

The issue here is that nfs4 open calls are done in more than one stage. The
first part of the open call is actually done as part of the lookup (via
nfs_atomic_lookup which calls nfs4_do_open). This is when the actual open call
goes out on the wire and when the nfs4_get_open_state is done (which does the
igrab due to the patch for bug 209419. The second part of the open is actually
done as a result of __dentry_open, which "puts" all of the dangling "gets" from
the lookup. The problem there is that open_namei can do a successful lookup,
but may still return error. If this occurs then the __dentry_open doesn't get
done and these "gets" are left "unput".

If a file is opened with O_CREAT, then open_namei will do the lookup (which is
actually an open call), and then (via a series of intervening functions) will
try to truncate the file. This involves doing a setattr call. If that setattr
call is interrupted (perhaps because the process was killed), then it will pass
an error back to open_namei, and filp_open will never call __dentry_open.

This patch attempts to catch this problem. If a setattr call returns an error,
then we'll check the nfs4 state to see if an nfs4_do_open call was made and
left "dangling". If so, then we do an extra nfs4_close_state with the flags
that were passed to the last open.

I'm not completely thrilled with this fix. There may be situations where it
falls down, but it does seem to fix this problem as far as I can tell and
doesn't involve any VFS layer changes.

A look at the upstream code seems to indicate that this problem may be there as
well. I've not yet tested it on a more recent kernel in order to be sure.

Unfortunately, there seems to be a different "busy inodes" race that is
unrelated to this one. It seems to be less easy to trigger than this one and
I'm still working on tracking it down.

Comment 3 Jeff Layton 2007-03-30 17:54:29 UTC
For the record. My reproducer is this:

-------[snip]-------
#!/bin/sh

FSSTRESS=/usr/lib/ltp/testcases/kernel/fs/fsstress/fsstress
TESTDIR=/mnt/dantu/testdir

while true; do
        mount /mnt/dantu
        nohup $FSSTRESS -d $TESTDIR/NFS1 -l 0 -n 1000 -p 4 -r &
        nohup $FSSTRESS -d $TESTDIR/NFS2 -l 0 -n 1000 -p 4 -r &
        sleep 3
        fuser -k -m /mnt/dantu
        sleep 3
        umount /mnt/dantu
        sleep 1
done
-------[snip]-------

fsstress is the one that comes with the ltp package and /mnt/dantu is a nfs4
mount (with default mount options). My test box is a RHEL4 paravirt xen guest.



Comment 4 Jeff Layton 2007-03-30 17:56:36 UTC
Note, however that this reproducer will quickly deadlock with any stock RHEL4
kernel. In order to even make it run for any reasonable length of time, you need
the patches from bug 228292.


Comment 5 Jeff Layton 2007-03-31 01:05:07 UTC
Reproduced the second problem with the debug patch (thanks again, Dave
Anderson!), and got a core. The debug patch forces a panic when there are busy
inodes and dumps some info about the busy inodes on each. This seems to be the
typical situation for the second problem:

invalidate_list2[inode_in_use]: busy inode=ffffff803b436aa0 i_count=1
i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803b75dba0 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803b1da480 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803b125810 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803b6bb690 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803b5ddba0 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff803daead20 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty
VFS: [ffffff8000992c00:10] 7 busy inodes after unmount. Self-destruct in 5
seconds.  Have a nice day...

So we have 7 busy inodes, one is a regular file and the rest are directories.
I've not yet backtracked all of them, but the one with the 40755 mode is the
root of the filesystem. I'm guessing that the busy file is nested within a set
of directories whose inodes are also busy here.

In any case, on the file, there is one dentry attached to the inode, here are
some interesting bits:

struct dentry {
  d_count = {
    counter = 0
  }, 
  d_flags = 8, 
  d_lock = {
    lock = 1, 
    magic = 3735899821
  }, 
...
  d_lru = {
    next = 0xffffff803bbc40c0, 
    prev = 0xffffffff803351a0
  }, 


...also...

$4 = (struct list_head *) 0xffffffff803351a0

...so the d_count is 0, the DCACHE_REFERENCED flag is set and (I think) the lock
is unlocked. So it looks like this dentry is ready to be reaped, but for some
reason, it wasn't. Also, this is apparently at the very top of the dentry_unused
list.

It sort of looks like a dput must have occurred just after the shrink_dcache_anon.

Comment 6 Jeff Layton 2007-03-31 13:44:35 UTC
Yes, in looking at the code, this is in dput:

        if (dentry->d_op && dentry->d_op->d_delete) {
                if (dentry->d_op->d_delete(dentry))
                        goto unhash_it;
        }

...but the d_delete function for nfs (nfs_dentry_delete) has this:

        if (!(dentry->d_sb->s_flags & MS_ACTIVE)) {
                /* Unhash it, so that ancestors of killed async unlink
                 * files will be cleaned up during umount */
                return 1;
        }
        return 0;

...but generic_shutdown_super doesn't clear the MS_ACTIVE flag until well after 
shrink_dcache_parent/shrink_dcache_anon has run. So there might be a race here,
if a dput is done after the shrink_dcache functions and before the MS_ACTIVE
flag is cleared. I need to consider this more closely, but I think something
like this could explain it. Interestingly, this does not seem to be nfs4-specific.


Comment 7 Jeff Layton 2007-03-31 14:03:39 UTC
Except that I don't seem to be able to trigger this on NFSv3, so it's either
nfs4 specific, or the timing is such that v3 doesn't hit it.


Comment 8 Jeff Layton 2007-04-02 13:06:58 UTC
Been taking a look at this on RHEL5. I don't seem to be able to reproduce this
problem there, though my systemtap script seems to indicate that open_namei does
occasionally return with a -ERESTARTSYS when the process is killed.

So there seems to be some mechanism that's cleaning things up on RHEL5 that's
not present on RHEL4.


Comment 9 Jeff Layton 2007-04-02 13:57:50 UTC
I think the fix that went into RHEL5 was due to this patch:

commit 02a913a73b52071e93f4b76db3e86138d19efffd
Author: Trond Myklebust <Trond.Myklebust>
Date:   Tue Oct 18 14:20:17 2005 -0700

    NFSv4: Eliminate nfsv4 open race...
    
     Make NFSv4 return the fully initialized file pointer with the
     stateid that it created in the lookup w/intent.
    
     Signed-off-by: Trond Myklebust <Trond.Myklebust>


...

and I think that that patch is probably too much for RHEL4 (looks like it
requires some generic VFS layer changes as well). I think we're probably stuck
with fixing this up in a different way than upstream did.




Comment 10 Jeff Layton 2007-04-02 20:02:07 UTC
The latest race seems to be due to a dput that occurs after shrink_dcache_parent
and shrink_dcache_anon calls have already completed. I set up a debug kernel to
catch a stack trace when this happens:

dput: [umount:2484] dput while UMOUNT_IN_PROGRESS dentry=ffffff801bbdde98
inode=ffffff80191b7280 s_debug=f

Call Trace:<ffffffff8018aaaf>{dput+242}
<ffffffffa015be2e>{:nfs:put_nfs_open_context+54} 
       <ffffffffa01600a0>{:nfs:nfs_release_request+68}
<ffffffffa01638ef>{:nfs:nfs_wait_on_requests+131} 
       <ffffffffa01653ea>{:nfs:nfs_writepages+218}
<ffffffff80193e75>{__writeback_single_inode+252} 
       <ffffffff80194502>{sync_sb_inodes+467}
<ffffffff80194673>{sync_inodes_sb+135} 
       <ffffffff801759d2>{fsync_super+159}
<ffffffff8017a04e>{generic_shutdown_super+106} 
       <ffffffff8017ad73>{kill_anon_super+9}
<ffffffffa015e659>{:nfs:nfs4_kill_super+26} 
       <ffffffff80179fcb>{deactivate_super+95} <ffffffff801901b9>{sys_umount+940} 
       <ffffffff8017d4af>{sys_newstat+17} <ffffffff8010ddc3>{error_exit+0} 
       <ffffffff8010d66e>{system_call+134} <ffffffff8010d5e8>{system_call+0} 
       
invalidate_list2[inode_in_use]: busy inode=ffffff80191b7280 i_count=1
i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff80192dcb20 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff8019056b20 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff80197c7400 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff8019a50400 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff8019af3d20 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff8019618690 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff8019aee710 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty
invalidate_list2[inode_in_use]: busy inode=ffffff80196b6690 i_count=1
i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty
VFS: [ffffff80008b5c00:10] 9 busy inodes after unmount. Self-destruct in 5
seconds.  Have a nice day...


Comment 11 Jeff Layton 2007-04-02 21:05:02 UTC
What I think is happening is that we still have some outstanding writes when
sys_umount is called. We do_umount, etc. Eventually a _mntput gets called from
path_release_on_umount, which eventually calls generic_shutdown_super. The
shrink_dcache functions get called, but d_count is still not 0 on this dentry,
so it doesn't get touched. Eventually fsync_super gets called which does the
final dput on this dentry, but it's too late. The shrink_dcache functions have
already run, and the dentry can no longer be reaped even though the d_count is 1.

I did a Q&D patch that just adds an extra fsync_super() before the shrink_dcache
functions. This seems to work around the problem, but is obviously less than
ideal since the fix is not specific to NFS4. I'll look at some less invasive
possibilities tomorrow.

I'm still not clear on why we don't see this on nfsv3. It doesn't seem very
different from v4 in this regard...

Comment 12 Jeff Layton 2007-04-03 13:51:23 UTC
Created attachment 151557 [details]
patch -- track incomplete opens and clean up in setattr

I believe this patch is better than the other one I posted earlier. With the
other patch, it's possible that other tasks could race in and clear the
incomplete_open info before the setattr could error out.

This patch adds a new list_head to the nfs4_state and a new nfs4_inc_open
struct. When a nfs4_atomic_open is done, then we track the info that was used
to do the open (including the current task ptr) and add it to the list. When
nfs4_proc_file_open is done we remove it from the list. If a setattr returns an
error then we try to clean up a dangling open for the task, if one is found
then we remove it and then do an extra nfs4_close_state to clean things up. The
list handling is all done under the i_lock for the inode. My thinking is that
there will rarely be more than a few dangling opens at a time, so any extra
time spent under the i_lock should be minimal.

Comment 13 Jeff Layton 2007-04-03 15:16:34 UTC
With this latest patch, I've not been able to generate a "busy inodes" panic at
all, and the "bad stateid/seqid" messages that I was occasionally getting are
also gone. So I'm thinking the "second" variety that I was seeing was due to
races between open calls and the error handling in setattrs. I'll let this test
run for several hours and see if I can generate another busy inodes message, and
if not I'll plan to propose this patch.


Comment 14 Jeff Layton 2007-04-03 15:34:23 UTC
Created attachment 151576 [details]
patch -- track incomplete opens and clean them up if a setattr errors out

This patch is identical to the last one, but removes an extraneous fsync_super
that I had added when testing before. It seems with this patch that that is not
needed.

Comment 16 Jeff Layton 2007-04-10 19:38:26 UTC
I've withdrawn the latest patch. It has a misplaced semicolon that makes all
opens error out silently. I've got an updated patch, but I think I need to
change a bit of the nfs4_proc_setattr function to make sure we don't miss any
dangling opens. I'll be posting that fairly soon...


Comment 17 Jeff Layton 2007-04-10 21:06:05 UTC
Created attachment 152181 [details]
patch -- track incomplete opens and clean up in setattr

This patch replaces the previous patch. It fixes a misplaced semicolon and also
should make sure that *any* incomplete open gets cleaned up. The previous patch
had some possible corner cases that it might have missed.

With this patch, I can still generate the occasional "busy inodes" problem, so
we'll need a second patch as well...

Comment 18 Jeff Layton 2007-04-10 21:08:29 UTC
Created attachment 152182 [details]
patch -- have open context keep a reference to the vfsmount

This patch makes it so that the open context carries a reference to the mount.
The open context can outlive the filp associated with it. This should prevent
the umount from occurring prematurely.

With both patches in place, I've not been able to reproduce a "busy inodes"
problem using my reproducer.

Comment 20 RHEL Program Management 2007-04-12 13:25:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 RHEL Program Management 2007-04-18 22:24:45 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 22 Don Howard 2007-07-17 19:53:08 UTC
A patch addressing this issue has been included in kernel-2.6.9-55.19.EL.

Comment 25 errata-xmlrpc 2007-11-15 16:24:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html


Comment 26 Sam Knuth 2008-02-28 21:52:33 UTC
Jeff - can you confirm if in fact this is fixed in RHEL 5? I'm not seeing the
exact bug in the changelog, though you mention a patch in comment 9