+++ This bug was initially created as a clone of Bug #189918 +++
A: calls ext3_new_inode(), blocks in e.g. new_inode()
knfsd: gets an fhandle with inumber ext3_new_inode() will pick
knfsd: calls iget(sb, ino)
knfsd: allocates and hashes inode (locked, new), calls ext3_read_inode(), blocks
A: allocates inode, fills it, hashes, has it written to cache
knfsd: comes back, gets the data left by A, happily fills its struct inode
=> we have two in-core inodes with the same inode number, both in use (by normal
dcache and by anon dentry held by nfsd).
same, except that A fails in xattr allocation (after having inode inserted into
hash) and knfsd comes and finds the inode in a normal fashion - via icache.
inode is pinned down by nfsd (and has no ACL or selinux label, BTW).
knfsd: does iget(), gets preempted before it can call make_bad_inode()
A: allocates inode with the same inumber, hashes it
the latter gets evicted (not too hard - just rmdir on non-empty parent),
later normal lookup finds half-done one from knfsd and blocks (it's still locked).
knfsd: comes back, marks that puppy bad, unhashes and unlocks
normal lookup gets a pile of crap instead of inode.
Similar fun exists for other exportable filesystems. On top of that, failure
exits in ext3_new_inode() are leaking like hell - block quota not freed if we'd
allocated xattr block for ACL, etc., but that's ext3-specific (ext2 has
smaller-scale analog of that).
--- Additional comment from email@example.com on 2006-04-28 05:51:32 EDT ---
Proposed fix from Al Viro:
* have callers of find_inode()/find_inode_fast() check if the inode they've got
is still in hash after they'd finished wait_on_inode(). If it isn't (i.e. we'd
raced with ->read_inode() called by somebody before us unhashing the inode) -
act as if we hadn't found it at all.
* have iget() check if after ->read_inode() the sucker is unhashed, iput() and
return NULL if it is (that, BTW, simplifies life for export_iget()).
* have foo_new_inode() on affected filesystems use iget_locked() after they
figure out the inode number. That shall give us a (new,locked) in-core inode
_and_ guarantee that there won't be aliasing issues. Then we fill it and instead
of insert_inode_hash() do unlock_new_inode() in the very end. Or
make_bad_inode()/unlock_new_inode()/iput() on failure exit (with explicit
cleanup rather than relying of foo_delete_inode(); needed anyway since we need
to do cleanups after halfway-failed inode creation).
Luis, looks like these patches can be applied onto rt kernel directly.
[PATCH] Add an ERR_CAST() function to complement ERR_PTR and co.
[PATCH] bugfix: two read_inode() calls without clear_inode() call between
[PATCH] igrab() should check for I_CLEAR
[PATCH] iget: stop EXT2 from using iget() and read_inode()
[PATCH] iget: stop EXT3 from using iget() and read_inode()
[PATCH] iget: stop FAT from using iget() and read_inode()
[PATCH] iget: stop EXT4 from using iget() and read_inode()
[PATCH] iget: stop ISOFS from using read_inode
[PATCH] iget: stop JFFS2 from using iget() and read_inode()
[PATCH] preparation to jfs iget sanitizing
?? see rhel-4 patch
[PATCH] iget: stop JFS from using iget
[PATCH] iget: stop FUSE from using iget() and read_inode()
iget: stop PROCFS from using iget() and read_inode()
[PATCH] nfsd/create race fixes, infrastructure
[PATCH] fs: make sure data stored into inode is properly seen before
unlocking new inode
[PATCH] ext3/4 with synchronous writes gets wedged by Postfix
[PATCH] nfsd race fixes: ext2
[PATCH] nfsd race fixes: ext3
nfsd race fixes: ext4
[PATCH] nfsd race fixes: reiserfs
[PATCH] nfsd race fixes: jfs
Verified by code review. Found patch mentioned in comment #1 applied to the
22.214.171.124-149 src.rpm. Started a 96 hour load and stress test in the -149 kernel. Awaiting final results from this test to conclude.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.