+++ This bug was initially created as a clone of Bug #209419 +++ Description of problem: Kernel crash, see the log from /var/log/messages below Oct 5 04:09:56 fieldarcher kernel: ------------[ cut here ]------------ Oct 5 04:09:56 fieldarcher kernel: kernel BUG at fs/nfs/inode.c:1650! Oct 5 04:09:56 fieldarcher kernel: invalid operand: 0000 [#1] Oct 5 04:09:56 fieldarcher kernel: SMP Oct 5 04:09:56 fieldarcher kernel: Modules linked in: nfs ide_scsi nfsd exportfs lockd nfs_acl md5 ipv6 i2c_dev i2c_core sunrpc ipt _LOG ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac pcnet32 mii floppy ext3 jbd mptsc sih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Oct 5 04:09:56 fieldarcher kernel: CPU: 0 Oct 5 04:09:56 fieldarcher kernel: EIP: 0060:[<f89ac6d8>] Not tainted VLI Oct 5 04:09:56 fieldarcher kernel: EFLAGS: 00010202 (2.6.9-42.0.2.ELsmp) Oct 5 04:09:56 fieldarcher kernel: EIP is at nfs4_clear_inode+0x6b/0x90 [nfs] Oct 5 04:09:56 fieldarcher kernel: eax: 00000001 ebx: ceafbe5c ecx: 00000000 edx: ceafbec4 Oct 5 04:09:56 fieldarcher kernel: esi: f7daf980 edi: ceafbd1c ebp: c032de78 esp: f7cd7eb4 Oct 5 04:09:56 fieldarcher kernel: ds: 007b es: 007b ss: 0068 Oct 5 04:09:56 fieldarcher kernel: Process kswapd0 (pid: 41, threadinfo=f7cd7000 task=f7d076f0) Oct 5 04:09:56 fieldarcher kernel: Stack: ceafbe5c f7cd7ee8 0000000e c0170d78 ceafbe5c c0170df5 ede74164 ede7416c Oct 5 04:09:56 fieldarcher kernel: 00000054 c017117a 00000080 00000080 00000080 e23b9404 e4e8dbcc 00000000 Oct 5 04:09:56 fieldarcher kernel: 00000083 00000000 f7ffe9a0 c01711e7 c01497f8 006aef00 00000000 0000002c Oct 5 04:09:56 fieldarcher kernel: Call Trace: Oct 5 04:09:56 fieldarcher kernel: [<c0170d78>] clear_inode+0xcc/0x102 Oct 5 04:09:56 fieldarcher kernel: [<c0170df5>] dispose_list+0x47/0x6d Oct 5 04:09:56 fieldarcher kernel: [<c017117a>] prune_icache+0x193/0x1ec Oct 5 04:09:56 fieldarcher kernel: [<c01711e7>] shrink_icache_memory+0x14/0x2b Oct 5 04:09:56 fieldarcher kernel: [<c01497f8>] shrink_slab+0xf8/0x161 Oct 5 04:09:56 fieldarcher kernel: [<c014aa93>] balance_pgdat+0x1e1/0x30e Oct 5 04:09:56 fieldarcher kernel: [<c02d26b1>] schedule+0x86d/0x8db Oct 5 04:09:56 fieldarcher kernel: [<c0120420>] prepare_to_wait+0x12/0x4c Oct 5 04:09:56 fieldarcher kernel: [<c014ac8a>] kswapd+0xca/0xcc Oct 5 04:09:56 fieldarcher kernel: [<c01204f5>] autoremove_wake_function+0x0/0x2d Oct 5 04:09:56 fieldarcher kernel: [<c02d46da>] ret_from_fork+0x6/0x14 Oct 5 04:09:56 fieldarcher kernel: [<c01204f5>] autoremove_wake_function+0x0/0x2d Oct 5 04:09:56 fieldarcher kernel: [<c014abc0>] kswapd+0x0/0xcc Oct 5 04:09:56 fieldarcher kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Oct 5 04:09:56 fieldarcher kernel: Code: 00 ff b3 c4 fe ff ff 05 60 01 00 00 ff b3 c0 fe ff ff 50 68 e5 84 9c f8 68 e2 98 9c f8 e8 cf 61 77 c7 83 c4 18 8b 46 60 48 74 08 <0f> 0b 72 06 74 94 9c f8 0f b7 56 5c 89 f0 e8 cd 7c 01 00 8d 87 Oct 5 04:09:56 fieldarcher kernel: <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): How reproducible: The machine has a nfsv4 over tcp mount from a netapp filer. The kernel has crasht 2 times with the same message. When the machine is rebooted, the problem occurs after a week approximately. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: -- Additional comment from staubach on 2006-12-08 09:07 EST -- I don't suppose that you have any idea what the machine was doing when it crashed these two times, do you? -- Additional comment from rene on 2006-12-08 09:21 EST -- The machine is running in vmware server, it runs mainly apache with php. The two crashes we not at the same time of day. We switched to nfsv3, and the server runs stable now. -- Additional comment from staubach on 2006-12-08 09:35 EST -- Well, I wouldn't know how to approach this without more information, so I will close it. If it reoccurs and there is more information for me to use, then please reopen this bz and I will look at it again then. -- Additional comment from tao on 2007-03-26 08:57 EST -- What happened was that the kernel hit this BUG() in nfs4_clear_inode: BUG_ON(atomic_read(&state->count) != 1); ...this corresponds to this section of code: 0xf96486d2 <nfs4_clear_inode+101>: mov 0x60(%esi),%eax 0xf96486d5 <nfs4_clear_inode+104>: dec %eax 0xf96486d6 <nfs4_clear_inode+105>: je 0xf96486e0 0xf96486d8 <nfs4_clear_inode+107>: ud2a <<<< CRASH HERE ...so we load %eax with the contents of the &state->count and then decrement it and compare to 0. So the count at the time here was 3 (since %eax is 2). %esi has the nfs4_state address and it bears this out: nreaders = 3, nwriters = 0, state = 1, count = { counter = 3 } The inode associated with this nfs4_state has a 0 reference count, however... i_count = { counter = 0 }, ...there are no remaining dentries attached to it, so we can't tell any of its filenames. The mode is: i_mode = 33279 = 0100777 ...a regular file but with 777 perms.The inode appears to be freeable and unused (from its reference counts), but the state owner lists 3 processes as still having a reading reference count against it. crash also confirms that the inode is being held open by no processes: crash> fuser 0xe53953fc No users of e53953fc found Essentially, this reference count gets increased when the inode is opened for reading, and decremented on close (it's a bit more complex than that, but that's the basic idea). So we're getting too many increments or not enough decrements (or both). Are they able to reproduce this or (even better) describe to us how to reproduce it? I see this as well running on cpu3 at the time: crash> bt PID: 585 TASK: f74699b0 CPU: 3 COMMAND: "vethrpc" #0 [f74d6f18] schedule at c02d268d #1 [f74d6f7c] veth_rpc_thread at f8e91d6a #2 [f74d6ff0] kernel_thread_helper at c01041f3 ...I'm presuming that this has nothing to do with ONS-RPC (that NFS is layered upon), but it would be good to have Egenera confirm this. This event sent from IssueTracker by jlayton issue 117012 -- Additional comment from jlayton on 2007-03-26 09:07 EST -- Reopening bug...Egenera has gotten a couple of these crashes and we have a core for it. Unfortunately, it doesn't tell us much about what happened leading up to the event. Peter, I'll go ahead and grab this since I've been doing a bit of nfs4 work anyway. Let me know if you'd rather work it, though... -- Additional comment from rene on 2007-03-26 09:10 EST -- I moved from nfs4 to nfs3, the problem hasn't occured again. -- Additional comment from jlayton on 2007-03-26 09:18 EST -- Thanks Rene. We've had another person report this and I'm hoping they have a reproducer, so I'm reopening this bug for that reason. -- Additional comment from jlayton on 2007-03-27 08:14 EST -- I've been chasing a not too dissimilar race in bug 228292. Now that I have a patch for that deadlock, it seems to have brought out at least one case of a crash similar to this one, and a "VFS: Busy inodes after umount" problem as well. My guess is that this is not due to the patch itself, but rather that clearing the deadlock there has made the race occur with more frequency. I'm going to keep pursuing that case for now, with the hope that I'll be able to track down the race that's occurring here as well. -- Additional comment from jlayton on 2007-03-27 14:14 EST -- This is the post I just made in bug 228292. My thinking is that that patch for that BZ just happens to make this bug occur more frequently, but is not responsible for it. I'll continue work on this race here. Note that I've seen what appears to be a number of reference count related races with that reproducer, and not all of them are alike. Still I think they may have a common root... I think I'm starting to see the problem and it is related to signal handling. In the most common case that I've seen it appears that open_namei ends up returning -ERESTARTSYS, likely due to the fact that I'm killing the process. I think that the way this is supposed to work is that the lookup ends up calling nfs4_do_open, and then later __dentry_open is supposed to initiate a sequence of calls that does a nfs4_proc_file_open. That does a nfs4_close_state and would likely clean up the i_count, etc. Because open_namei exits with an error, however, this is not occuring and that's causing this particular reference count leak. That said, I'm thinking this is a different bug and one that exists with or without the workqueue patch. I'll continue work on that problem in bug 209419. It's not yet clear to me what the correct solution for this problem is... -- Additional comment from jlayton on 2007-03-27 17:22 EST -- Replicating the post I just made to bug 228292: ----------[snip]----------- Actually, after looking over the code, the "VFS Busy inodes" problem may be introduced by the first patch in this set (the one to convert nfs4_close_state to do an async call). That patch adds an igrab/iput in nfs4_get/put_open_state. Essentially, an nfs4 open seems to be a 2 stage process. The first part is handled by a chain like this: filp_open open_namei __lookup_hash i_op->lookup (nfs_atomic_lookup) nfs4_atomic_open nfs4_do_open _nfs4_do_open nfs4_get_open_state nfs4_get_open_state does an igrab, and that bumps i_count by 1. _nfs4_do_open increments the nreaders and nwriters for the state as well (as appropriate for the flags it's passed). The second part of the open has a call chain something like this: filp_open __dentry_open f_op->open (nfs_file_open) rpc_ops->file_open (nfs4_proc_file_open) nfs4_close_state nfs4_do_close nfs4_close_done nfs4_put_open_state nfs4_put_open_state does an iput and decrements the i_count. nfs4_close_done decrements the nreaders/nwriters as appropriate as well. I believe, the problem is that if open_namei returns an error to filp_open, then the second call chain never happens and the i_count and nreaders/writers remains artificially high. I think this problem also may explain bug 209419 as well. When we send a signal to these processes (most likely, ones in the middle of an open() call), we sometimes make open_namei return a -ERESTARTSYS. I still I need to do a bit of investigation to see where the ERESTARTSYS is ultimately coming from. This seems like a sticky problem, however. It seems like we need a hook to get in a nfs4_close_state if open_namei is going to return an error after doing its lookup, but we don't have anything like that. We may have to do something to force it in the dput, or find a way to make sure that we don't return a -ERESTARTSYS at all. ----------[snip]----------- So anyone who's seen this particular crash, could you let us know definitively whether you had processes doing I/O on the filesystem that may have caught a signal? That may help confirm whether my speculation about the above situation is true. In the meantime, I'm doing more investigation to see if I can tell where the -ERESTARTSYS is ultimately coming from. -- Additional comment from mflint on 2007-03-28 08:31 EST -- it seems that no I/O was occuring in our case. The customer was runing automounter and it was umounting a filesystem at the time. We have the following messages prior to the panic NFS: mount path /auto/home/oracle does not exist! NFS: suggestion: try mounting '/' instead. ------------[ cut here ]------------ kernel BUG at fs/nfs/inode.c:1650! invalid operand: 0000 [#1] SMP Modules linked in: iptable_filter ip_tables nfsd exportfs nfs lockd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core hw_random nfs_acl sunrpc egenera_nmi(U) egenera_veth(U) sd_mod egenera_vscsi(U) scsi_mod egenera_vmdump(U) egenera_dumpdev(U) egenera_ipmi(U) egenera_base(U) egenera_virtual_bus(U) egenera_fs(U) dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod CPU: 0 EIP: 0060:[<f96486d8>] Tainted: PF VLI EFLAGS: 00010202 (2.6.9-42.0.3.ELsmp) EIP is at nfs4_clear_inode+0x6b/0x90 [nfs] eax: 00000002 ebx: e53953fc ecx: 000001c7 edx: 00000000 esi: f38f9d80 edi: e53952bc ebp: c032de78 esp: e7a67ee4 ds: 007b es: 007b ss: 0068 Process umount (pid: 20118, threadinfo=e7a67000 task=f74683b0) Stack: e53953fc e7a67f0c 0000002c c0170d78 e53953fc c0170df5 00000000 f00f7400 f00f7400 c0170f5e e539569c f1433e64 e7a67000 f00f7400 00000000 f9674a60 c0160396 00000011 f028d400 f9674ac0 e7a67000 c0160c13 f00f7400 f9648ea2 Call Trace: [<c0170d78>] clear_inode+0xcc/0x102 [<c0170df5>] dispose_list+0x47/0x6d [<c0170f5e>] invalidate_inodes+0x96/0xae [<c0160396>] generic_shutdown_super+0x88/0x150 [<c0160c13>] kill_anon_super+0x9/0x30 [<f9648ea2>] nfs4_kill_super+0x16/0x50 [nfs] [<c016025c>] deactivate_super+0x5b/0x70 [<c017389f>] sys_umount+0x65/0x6c [<c016314a>] sys_stat64+0xf/0x23 [<c01507f5>] unmap_vma_list+0xe/0x17 [<c0150b2f>] do_munmap+0x129/0x137 [<c01738b1>] sys_oldumount+0xb/0xe [<c02d47cb>] syscall_call+0x7/0xb -- Additional comment from jlayton on 2007-03-29 11:49 EST -- I wasn't so much concerned with I/O happening at the time of the umount (by definition there wouldnt be any, or you wouldn't be able to umount). My question was more along the lines of "did any application that was doing any I/O to this filesystem prior to the unmount end up being killed by a signal". Unfortunately, it may be hard to tell. I have a reproducer of sorts for this issue, but it involves a much larger NFSv4 patch that's still in progress. Essentially I run a hefty amount of filesystem activity via several concurrent threads and then kill off the processes doing it and unmount. After several attempts at this I end up with a panic that's similar to the one above. I believe the issue there is that the i_count for the inode is bad at the time that an iput is done and we end up trying to delete an inode that's still being actively worked with. There's a later patch that adds an igrab/iput in nfs4_get/put_open_state. That seems to correct the problems of panicing in nfs4_clear_inode for me, but I still see a "VFS: Busy inodes at umount" error, and some other panics that are related to reference counting. So the problem is not completely fixed by that. There seems to be a reference count leak that's occurring and I believe it's tied up with signal handling while in the middle of an open call, but tracking this down is tricky since it's pretty racy. -- Additional comment from jlayton on 2007-03-29 14:48 EST -- Instrumented a kernel to watch for the case where may_open/open_namei returns with an error and to print the inode address that it had been working with at the time. I also made it dump out the inode address that had been busy at the time. So in my reproducer, it appears that may_open returning with an error (-ERESTARTSYS in this case) is causing the problem: open_namei: returning -512, inode ffffff803baf4a20 left dangling invalidate_list2[inode_in_use]: busy inode=ffffff803baf4a20 i_count=1 i_state=0x0 i_debug=0x33, i_mode=0100666 i_dentry=empty VFS: [ffffff8000981400:10] 1 busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... ...I've yet to come up with an elegant way to "clean up" in this situation, however. We really need some way to call down into NFS from the VFS and clean up the state, but there are no such hooks that are suitable for that... -- Additional comment from jlayton on 2007-03-30 07:45 EST -- Created an attachment (id=151265) patch -- increment and decrement i_count when getting and putting open state I feel pretty confident that this patch will correct this panic. It's a backported piece of an upstream patch that converts the NFSv4 close call to be asynchronous: http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=09ca6dde884139015027ea102cbb560de5836eec;hp=cedf3820493f7fefec00da3699507bf7cdfd6544 This piece simply makes sure that we increase the inode reference count when we do a get on the open state and decrement it when we put the open state. I believe this will prevent the panic since it should prevent the code from ever clear_inode on the inode until the nfs4_state has been put for the last time. Unfortunately, in the testing I've done, this seems to simply trade this particular panic for a "VFS: Busy inodes at unmount" problem when the fs is eventually unmounted. I think that there are still some problems with reference counting with the nfs4_state, but this does cure the immediate problem and seems to be the right thing to do.
The patch for BZ 209419 seems to have traded the nfs4_clear_inode BUG() for a VFS Busy inodes at umount problem. My guess is that these are due to the same root cause (or causes) -- bad nfs4_state handing in the nfsv4 code. I'm opening this BZ for tracking down these issues (having looked over this, I think there's more than one).
Created attachment 151296 [details] patch -- if setattr call errors out, attempt to clean up incomplete open This patch seems to fix up one of the races here. This is a rambly explanation, but I hope it will explain it... The issue here is that nfs4 open calls are done in more than one stage. The first part of the open call is actually done as part of the lookup (via nfs_atomic_lookup which calls nfs4_do_open). This is when the actual open call goes out on the wire and when the nfs4_get_open_state is done (which does the igrab due to the patch for bug 209419. The second part of the open is actually done as a result of __dentry_open, which "puts" all of the dangling "gets" from the lookup. The problem there is that open_namei can do a successful lookup, but may still return error. If this occurs then the __dentry_open doesn't get done and these "gets" are left "unput". If a file is opened with O_CREAT, then open_namei will do the lookup (which is actually an open call), and then (via a series of intervening functions) will try to truncate the file. This involves doing a setattr call. If that setattr call is interrupted (perhaps because the process was killed), then it will pass an error back to open_namei, and filp_open will never call __dentry_open. This patch attempts to catch this problem. If a setattr call returns an error, then we'll check the nfs4 state to see if an nfs4_do_open call was made and left "dangling". If so, then we do an extra nfs4_close_state with the flags that were passed to the last open. I'm not completely thrilled with this fix. There may be situations where it falls down, but it does seem to fix this problem as far as I can tell and doesn't involve any VFS layer changes. A look at the upstream code seems to indicate that this problem may be there as well. I've not yet tested it on a more recent kernel in order to be sure. Unfortunately, there seems to be a different "busy inodes" race that is unrelated to this one. It seems to be less easy to trigger than this one and I'm still working on tracking it down.
For the record. My reproducer is this: -------[snip]------- #!/bin/sh FSSTRESS=/usr/lib/ltp/testcases/kernel/fs/fsstress/fsstress TESTDIR=/mnt/dantu/testdir while true; do mount /mnt/dantu nohup $FSSTRESS -d $TESTDIR/NFS1 -l 0 -n 1000 -p 4 -r & nohup $FSSTRESS -d $TESTDIR/NFS2 -l 0 -n 1000 -p 4 -r & sleep 3 fuser -k -m /mnt/dantu sleep 3 umount /mnt/dantu sleep 1 done -------[snip]------- fsstress is the one that comes with the ltp package and /mnt/dantu is a nfs4 mount (with default mount options). My test box is a RHEL4 paravirt xen guest.
Note, however that this reproducer will quickly deadlock with any stock RHEL4 kernel. In order to even make it run for any reasonable length of time, you need the patches from bug 228292.
Reproduced the second problem with the debug patch (thanks again, Dave Anderson!), and got a core. The debug patch forces a panic when there are busy inodes and dumps some info about the busy inodes on each. This seems to be the typical situation for the second problem: invalidate_list2[inode_in_use]: busy inode=ffffff803b436aa0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b75dba0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b1da480 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b125810 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b6bb690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b5ddba0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803daead20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty VFS: [ffffff8000992c00:10] 7 busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... So we have 7 busy inodes, one is a regular file and the rest are directories. I've not yet backtracked all of them, but the one with the 40755 mode is the root of the filesystem. I'm guessing that the busy file is nested within a set of directories whose inodes are also busy here. In any case, on the file, there is one dentry attached to the inode, here are some interesting bits: struct dentry { d_count = { counter = 0 }, d_flags = 8, d_lock = { lock = 1, magic = 3735899821 }, ... d_lru = { next = 0xffffff803bbc40c0, prev = 0xffffffff803351a0 }, ...also... $4 = (struct list_head *) 0xffffffff803351a0 ...so the d_count is 0, the DCACHE_REFERENCED flag is set and (I think) the lock is unlocked. So it looks like this dentry is ready to be reaped, but for some reason, it wasn't. Also, this is apparently at the very top of the dentry_unused list. It sort of looks like a dput must have occurred just after the shrink_dcache_anon.
Yes, in looking at the code, this is in dput: if (dentry->d_op && dentry->d_op->d_delete) { if (dentry->d_op->d_delete(dentry)) goto unhash_it; } ...but the d_delete function for nfs (nfs_dentry_delete) has this: if (!(dentry->d_sb->s_flags & MS_ACTIVE)) { /* Unhash it, so that ancestors of killed async unlink * files will be cleaned up during umount */ return 1; } return 0; ...but generic_shutdown_super doesn't clear the MS_ACTIVE flag until well after shrink_dcache_parent/shrink_dcache_anon has run. So there might be a race here, if a dput is done after the shrink_dcache functions and before the MS_ACTIVE flag is cleared. I need to consider this more closely, but I think something like this could explain it. Interestingly, this does not seem to be nfs4-specific.
Except that I don't seem to be able to trigger this on NFSv3, so it's either nfs4 specific, or the timing is such that v3 doesn't hit it.
Been taking a look at this on RHEL5. I don't seem to be able to reproduce this problem there, though my systemtap script seems to indicate that open_namei does occasionally return with a -ERESTARTSYS when the process is killed. So there seems to be some mechanism that's cleaning things up on RHEL5 that's not present on RHEL4.
I think the fix that went into RHEL5 was due to this patch: commit 02a913a73b52071e93f4b76db3e86138d19efffd Author: Trond Myklebust <Trond.Myklebust> Date: Tue Oct 18 14:20:17 2005 -0700 NFSv4: Eliminate nfsv4 open race... Make NFSv4 return the fully initialized file pointer with the stateid that it created in the lookup w/intent. Signed-off-by: Trond Myklebust <Trond.Myklebust> ... and I think that that patch is probably too much for RHEL4 (looks like it requires some generic VFS layer changes as well). I think we're probably stuck with fixing this up in a different way than upstream did.
The latest race seems to be due to a dput that occurs after shrink_dcache_parent and shrink_dcache_anon calls have already completed. I set up a debug kernel to catch a stack trace when this happens: dput: [umount:2484] dput while UMOUNT_IN_PROGRESS dentry=ffffff801bbdde98 inode=ffffff80191b7280 s_debug=f Call Trace:<ffffffff8018aaaf>{dput+242} <ffffffffa015be2e>{:nfs:put_nfs_open_context+54} <ffffffffa01600a0>{:nfs:nfs_release_request+68} <ffffffffa01638ef>{:nfs:nfs_wait_on_requests+131} <ffffffffa01653ea>{:nfs:nfs_writepages+218} <ffffffff80193e75>{__writeback_single_inode+252} <ffffffff80194502>{sync_sb_inodes+467} <ffffffff80194673>{sync_inodes_sb+135} <ffffffff801759d2>{fsync_super+159} <ffffffff8017a04e>{generic_shutdown_super+106} <ffffffff8017ad73>{kill_anon_super+9} <ffffffffa015e659>{:nfs:nfs4_kill_super+26} <ffffffff80179fcb>{deactivate_super+95} <ffffffff801901b9>{sys_umount+940} <ffffffff8017d4af>{sys_newstat+17} <ffffffff8010ddc3>{error_exit+0} <ffffffff8010d66e>{system_call+134} <ffffffff8010d5e8>{system_call+0} invalidate_list2[inode_in_use]: busy inode=ffffff80191b7280 i_count=1 i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80192dcb20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019056b20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80197c7400 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019a50400 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019af3d20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019618690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019aee710 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80196b6690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty VFS: [ffffff80008b5c00:10] 9 busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day...
What I think is happening is that we still have some outstanding writes when sys_umount is called. We do_umount, etc. Eventually a _mntput gets called from path_release_on_umount, which eventually calls generic_shutdown_super. The shrink_dcache functions get called, but d_count is still not 0 on this dentry, so it doesn't get touched. Eventually fsync_super gets called which does the final dput on this dentry, but it's too late. The shrink_dcache functions have already run, and the dentry can no longer be reaped even though the d_count is 1. I did a Q&D patch that just adds an extra fsync_super() before the shrink_dcache functions. This seems to work around the problem, but is obviously less than ideal since the fix is not specific to NFS4. I'll look at some less invasive possibilities tomorrow. I'm still not clear on why we don't see this on nfsv3. It doesn't seem very different from v4 in this regard...
Created attachment 151557 [details] patch -- track incomplete opens and clean up in setattr I believe this patch is better than the other one I posted earlier. With the other patch, it's possible that other tasks could race in and clear the incomplete_open info before the setattr could error out. This patch adds a new list_head to the nfs4_state and a new nfs4_inc_open struct. When a nfs4_atomic_open is done, then we track the info that was used to do the open (including the current task ptr) and add it to the list. When nfs4_proc_file_open is done we remove it from the list. If a setattr returns an error then we try to clean up a dangling open for the task, if one is found then we remove it and then do an extra nfs4_close_state to clean things up. The list handling is all done under the i_lock for the inode. My thinking is that there will rarely be more than a few dangling opens at a time, so any extra time spent under the i_lock should be minimal.
With this latest patch, I've not been able to generate a "busy inodes" panic at all, and the "bad stateid/seqid" messages that I was occasionally getting are also gone. So I'm thinking the "second" variety that I was seeing was due to races between open calls and the error handling in setattrs. I'll let this test run for several hours and see if I can generate another busy inodes message, and if not I'll plan to propose this patch.
Created attachment 151576 [details] patch -- track incomplete opens and clean them up if a setattr errors out This patch is identical to the last one, but removes an extraneous fsync_super that I had added when testing before. It seems with this patch that that is not needed.
I've withdrawn the latest patch. It has a misplaced semicolon that makes all opens error out silently. I've got an updated patch, but I think I need to change a bit of the nfs4_proc_setattr function to make sure we don't miss any dangling opens. I'll be posting that fairly soon...
Created attachment 152181 [details] patch -- track incomplete opens and clean up in setattr This patch replaces the previous patch. It fixes a misplaced semicolon and also should make sure that *any* incomplete open gets cleaned up. The previous patch had some possible corner cases that it might have missed. With this patch, I can still generate the occasional "busy inodes" problem, so we'll need a second patch as well...
Created attachment 152182 [details] patch -- have open context keep a reference to the vfsmount This patch makes it so that the open context carries a reference to the mount. The open context can outlive the filp associated with it. This should prevent the umount from occurring prematurely. With both patches in place, I've not been able to reproduce a "busy inodes" problem using my reproducer.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
A patch addressing this issue has been included in kernel-2.6.9-55.19.EL.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html
Jeff - can you confirm if in fact this is fixed in RHEL 5? I'm not seeing the exact bug in the changelog, though you mention a patch in comment 9