A: calls ext3_new_inode(), blocks in e.g. new_inode()
knfsd: gets an fhandle with inumber ext3_new_inode() will pick
knfsd: calls iget(sb, ino)
knfsd: allocates and hashes inode (locked, new), calls ext3_read_inode(), blocks
A: allocates inode, fills it, hashes, has it written to cache
knfsd: comes back, gets the data left by A, happily fills its struct inode
=> we have two in-core inodes with the same inode number, both in use (by normal
dcache and by anon dentry held by nfsd).
same, except that A fails in xattr allocation (after having inode inserted into
hash) and knfsd comes and finds the inode in a normal fashion - via icache.
inode is pinned down by nfsd (and has no ACL or selinux label, BTW).
knfsd: does iget(), gets preempted before it can call make_bad_inode()
A: allocates inode with the same inumber, hashes it
the latter gets evicted (not too hard - just rmdir on non-empty parent),
later normal lookup finds half-done one from knfsd and blocks (it's still locked).
knfsd: comes back, marks that puppy bad, unhashes and unlocks
normal lookup gets a pile of crap instead of inode.
Similar fun exists for other exportable filesystems. On top of that, failure
exits in ext3_new_inode() are leaking like hell - block quota not freed if we'd
allocated xattr block for ACL, etc., but that's ext3-specific (ext2 has
smaller-scale analog of that).
Proposed fix from Al Viro:
* have callers of find_inode()/find_inode_fast() check if the inode they've got
is still in hash after they'd finished wait_on_inode(). If it isn't (i.e. we'd
raced with ->read_inode() called by somebody before us unhashing the inode) -
act as if we hadn't found it at all.
* have iget() check if after ->read_inode() the sucker is unhashed, iput() and
return NULL if it is (that, BTW, simplifies life for export_iget()).
* have foo_new_inode() on affected filesystems use iget_locked() after they
figure out the inode number. That shall give us a (new,locked) in-core inode
_and_ guarantee that there won't be aliasing issues. Then we fill it and instead
of insert_inode_hash() do unlock_new_inode() in the very end. Or
make_bad_inode()/unlock_new_inode()/iput() on failure exit (with explicit
cleanup rather than relying of foo_delete_inode(); needed anyway since we need
to do cleanups after halfway-failed inode creation).
[PATCH] preparation to jfs iget sanitizing
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Committed in 89.45.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
verified patch linux-2.6.9-fs-a-bunch-of-patches-to-fix-various-nfsd-iget-races.patch apply is sane.
and run fs_mark on i386 and x86_64 on local fs and NFS-exportable filesystems:
localhost:/tmp /mnt nfs rw,v3,rsize=32768,wsize=32768,hard,lock,proto=tcp,tcp,timeo=600,retrans=5,addr=localhost 0 0
[root@hp-xw9400-01 fs_mark]# ./fs_mark -d /mnt -s 51200 -n 4096 -l fill.log
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.