Bug 234587
Summary: | VFS: Busy inodes at umount on NFSv4 filesystem | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jeff Layton <jlayton> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | CC: | aviro, jbaron, lwoodman, mflint, rgautier, sfolkwil, staubach, steved |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0791 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-15 16:24:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 209419, 228292 | ||
Bug Blocks: | |||
Attachments: |
Description
Jeff Layton
2007-03-30 12:05:52 UTC
The patch for BZ 209419 seems to have traded the nfs4_clear_inode BUG() for a VFS Busy inodes at umount problem. My guess is that these are due to the same root cause (or causes) -- bad nfs4_state handing in the nfsv4 code. I'm opening this BZ for tracking down these issues (having looked over this, I think there's more than one). Created attachment 151296 [details] patch -- if setattr call errors out, attempt to clean up incomplete open This patch seems to fix up one of the races here. This is a rambly explanation, but I hope it will explain it... The issue here is that nfs4 open calls are done in more than one stage. The first part of the open call is actually done as part of the lookup (via nfs_atomic_lookup which calls nfs4_do_open). This is when the actual open call goes out on the wire and when the nfs4_get_open_state is done (which does the igrab due to the patch for bug 209419. The second part of the open is actually done as a result of __dentry_open, which "puts" all of the dangling "gets" from the lookup. The problem there is that open_namei can do a successful lookup, but may still return error. If this occurs then the __dentry_open doesn't get done and these "gets" are left "unput". If a file is opened with O_CREAT, then open_namei will do the lookup (which is actually an open call), and then (via a series of intervening functions) will try to truncate the file. This involves doing a setattr call. If that setattr call is interrupted (perhaps because the process was killed), then it will pass an error back to open_namei, and filp_open will never call __dentry_open. This patch attempts to catch this problem. If a setattr call returns an error, then we'll check the nfs4 state to see if an nfs4_do_open call was made and left "dangling". If so, then we do an extra nfs4_close_state with the flags that were passed to the last open. I'm not completely thrilled with this fix. There may be situations where it falls down, but it does seem to fix this problem as far as I can tell and doesn't involve any VFS layer changes. A look at the upstream code seems to indicate that this problem may be there as well. I've not yet tested it on a more recent kernel in order to be sure. Unfortunately, there seems to be a different "busy inodes" race that is unrelated to this one. It seems to be less easy to trigger than this one and I'm still working on tracking it down. For the record. My reproducer is this: -------[snip]------- #!/bin/sh FSSTRESS=/usr/lib/ltp/testcases/kernel/fs/fsstress/fsstress TESTDIR=/mnt/dantu/testdir while true; do mount /mnt/dantu nohup $FSSTRESS -d $TESTDIR/NFS1 -l 0 -n 1000 -p 4 -r & nohup $FSSTRESS -d $TESTDIR/NFS2 -l 0 -n 1000 -p 4 -r & sleep 3 fuser -k -m /mnt/dantu sleep 3 umount /mnt/dantu sleep 1 done -------[snip]------- fsstress is the one that comes with the ltp package and /mnt/dantu is a nfs4 mount (with default mount options). My test box is a RHEL4 paravirt xen guest. Note, however that this reproducer will quickly deadlock with any stock RHEL4 kernel. In order to even make it run for any reasonable length of time, you need the patches from bug 228292. Reproduced the second problem with the debug patch (thanks again, Dave Anderson!), and got a core. The debug patch forces a panic when there are busy inodes and dumps some info about the busy inodes on each. This seems to be the typical situation for the second problem: invalidate_list2[inode_in_use]: busy inode=ffffff803b436aa0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b75dba0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b1da480 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b125810 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b6bb690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803b5ddba0 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff803daead20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty VFS: [ffffff8000992c00:10] 7 busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... So we have 7 busy inodes, one is a regular file and the rest are directories. I've not yet backtracked all of them, but the one with the 40755 mode is the root of the filesystem. I'm guessing that the busy file is nested within a set of directories whose inodes are also busy here. In any case, on the file, there is one dentry attached to the inode, here are some interesting bits: struct dentry { d_count = { counter = 0 }, d_flags = 8, d_lock = { lock = 1, magic = 3735899821 }, ... d_lru = { next = 0xffffff803bbc40c0, prev = 0xffffffff803351a0 }, ...also... $4 = (struct list_head *) 0xffffffff803351a0 ...so the d_count is 0, the DCACHE_REFERENCED flag is set and (I think) the lock is unlocked. So it looks like this dentry is ready to be reaped, but for some reason, it wasn't. Also, this is apparently at the very top of the dentry_unused list. It sort of looks like a dput must have occurred just after the shrink_dcache_anon. Yes, in looking at the code, this is in dput: if (dentry->d_op && dentry->d_op->d_delete) { if (dentry->d_op->d_delete(dentry)) goto unhash_it; } ...but the d_delete function for nfs (nfs_dentry_delete) has this: if (!(dentry->d_sb->s_flags & MS_ACTIVE)) { /* Unhash it, so that ancestors of killed async unlink * files will be cleaned up during umount */ return 1; } return 0; ...but generic_shutdown_super doesn't clear the MS_ACTIVE flag until well after shrink_dcache_parent/shrink_dcache_anon has run. So there might be a race here, if a dput is done after the shrink_dcache functions and before the MS_ACTIVE flag is cleared. I need to consider this more closely, but I think something like this could explain it. Interestingly, this does not seem to be nfs4-specific. Except that I don't seem to be able to trigger this on NFSv3, so it's either nfs4 specific, or the timing is such that v3 doesn't hit it. Been taking a look at this on RHEL5. I don't seem to be able to reproduce this problem there, though my systemtap script seems to indicate that open_namei does occasionally return with a -ERESTARTSYS when the process is killed. So there seems to be some mechanism that's cleaning things up on RHEL5 that's not present on RHEL4. I think the fix that went into RHEL5 was due to this patch: commit 02a913a73b52071e93f4b76db3e86138d19efffd Author: Trond Myklebust <Trond.Myklebust> Date: Tue Oct 18 14:20:17 2005 -0700 NFSv4: Eliminate nfsv4 open race... Make NFSv4 return the fully initialized file pointer with the stateid that it created in the lookup w/intent. Signed-off-by: Trond Myklebust <Trond.Myklebust> ... and I think that that patch is probably too much for RHEL4 (looks like it requires some generic VFS layer changes as well). I think we're probably stuck with fixing this up in a different way than upstream did. The latest race seems to be due to a dput that occurs after shrink_dcache_parent and shrink_dcache_anon calls have already completed. I set up a debug kernel to catch a stack trace when this happens: dput: [umount:2484] dput while UMOUNT_IN_PROGRESS dentry=ffffff801bbdde98 inode=ffffff80191b7280 s_debug=f Call Trace:<ffffffff8018aaaf>{dput+242} <ffffffffa015be2e>{:nfs:put_nfs_open_context+54} <ffffffffa01600a0>{:nfs:nfs_release_request+68} <ffffffffa01638ef>{:nfs:nfs_wait_on_requests+131} <ffffffffa01653ea>{:nfs:nfs_writepages+218} <ffffffff80193e75>{__writeback_single_inode+252} <ffffffff80194502>{sync_sb_inodes+467} <ffffffff80194673>{sync_inodes_sb+135} <ffffffff801759d2>{fsync_super+159} <ffffffff8017a04e>{generic_shutdown_super+106} <ffffffff8017ad73>{kill_anon_super+9} <ffffffffa015e659>{:nfs:nfs4_kill_super+26} <ffffffff80179fcb>{deactivate_super+95} <ffffffff801901b9>{sys_umount+940} <ffffffff8017d4af>{sys_newstat+17} <ffffffff8010ddc3>{error_exit+0} <ffffffff8010d66e>{system_call+134} <ffffffff8010d5e8>{system_call+0} invalidate_list2[inode_in_use]: busy inode=ffffff80191b7280 i_count=1 i_state=0x0 i_debug=0x0, i_mode=0100666 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80192dcb20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019056b20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80197c7400 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019a50400 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019af3d20 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019618690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff8019aee710 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040777 i_dentry=not_empty invalidate_list2[inode_in_use]: busy inode=ffffff80196b6690 i_count=1 i_state=0x0 i_debug=0x0, i_mode=040755 i_dentry=not_empty VFS: [ffffff80008b5c00:10] 9 busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... What I think is happening is that we still have some outstanding writes when sys_umount is called. We do_umount, etc. Eventually a _mntput gets called from path_release_on_umount, which eventually calls generic_shutdown_super. The shrink_dcache functions get called, but d_count is still not 0 on this dentry, so it doesn't get touched. Eventually fsync_super gets called which does the final dput on this dentry, but it's too late. The shrink_dcache functions have already run, and the dentry can no longer be reaped even though the d_count is 1. I did a Q&D patch that just adds an extra fsync_super() before the shrink_dcache functions. This seems to work around the problem, but is obviously less than ideal since the fix is not specific to NFS4. I'll look at some less invasive possibilities tomorrow. I'm still not clear on why we don't see this on nfsv3. It doesn't seem very different from v4 in this regard... Created attachment 151557 [details]
patch -- track incomplete opens and clean up in setattr
I believe this patch is better than the other one I posted earlier. With the
other patch, it's possible that other tasks could race in and clear the
incomplete_open info before the setattr could error out.
This patch adds a new list_head to the nfs4_state and a new nfs4_inc_open
struct. When a nfs4_atomic_open is done, then we track the info that was used
to do the open (including the current task ptr) and add it to the list. When
nfs4_proc_file_open is done we remove it from the list. If a setattr returns an
error then we try to clean up a dangling open for the task, if one is found
then we remove it and then do an extra nfs4_close_state to clean things up. The
list handling is all done under the i_lock for the inode. My thinking is that
there will rarely be more than a few dangling opens at a time, so any extra
time spent under the i_lock should be minimal.
With this latest patch, I've not been able to generate a "busy inodes" panic at all, and the "bad stateid/seqid" messages that I was occasionally getting are also gone. So I'm thinking the "second" variety that I was seeing was due to races between open calls and the error handling in setattrs. I'll let this test run for several hours and see if I can generate another busy inodes message, and if not I'll plan to propose this patch. Created attachment 151576 [details]
patch -- track incomplete opens and clean them up if a setattr errors out
This patch is identical to the last one, but removes an extraneous fsync_super
that I had added when testing before. It seems with this patch that that is not
needed.
I've withdrawn the latest patch. It has a misplaced semicolon that makes all opens error out silently. I've got an updated patch, but I think I need to change a bit of the nfs4_proc_setattr function to make sure we don't miss any dangling opens. I'll be posting that fairly soon... Created attachment 152181 [details]
patch -- track incomplete opens and clean up in setattr
This patch replaces the previous patch. It fixes a misplaced semicolon and also
should make sure that *any* incomplete open gets cleaned up. The previous patch
had some possible corner cases that it might have missed.
With this patch, I can still generate the occasional "busy inodes" problem, so
we'll need a second patch as well...
Created attachment 152182 [details]
patch -- have open context keep a reference to the vfsmount
This patch makes it so that the open context carries a reference to the mount.
The open context can outlive the filp associated with it. This should prevent
the umount from occurring prematurely.
With both patches in place, I've not been able to reproduce a "busy inodes"
problem using my reproducer.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. A patch addressing this issue has been included in kernel-2.6.9-55.19.EL. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html Jeff - can you confirm if in fact this is fixed in RHEL 5? I'm not seeing the exact bug in the changelog, though you mention a patch in comment 9 |