+++ This bug was initially created as a clone of Bug #189918 +++ Scenario 1: A: calls ext3_new_inode(), blocks in e.g. new_inode() knfsd: gets an fhandle with inumber ext3_new_inode() will pick knfsd: calls iget(sb, ino) knfsd: allocates and hashes inode (locked, new), calls ext3_read_inode(), blocks A: allocates inode, fills it, hashes, has it written to cache knfsd: comes back, gets the data left by A, happily fills its struct inode => we have two in-core inodes with the same inode number, both in use (by normal dcache and by anon dentry held by nfsd). Scenario 2: same, except that A fails in xattr allocation (after having inode inserted into hash) and knfsd comes and finds the inode in a normal fashion - via icache. inode is pinned down by nfsd (and has no ACL or selinux label, BTW). Scenario 3: knfsd: does iget(), gets preempted before it can call make_bad_inode() A: allocates inode with the same inumber, hashes it the latter gets evicted (not too hard - just rmdir on non-empty parent), later normal lookup finds half-done one from knfsd and blocks (it's still locked). knfsd: comes back, marks that puppy bad, unhashes and unlocks normal lookup gets a pile of crap instead of inode. Similar fun exists for other exportable filesystems. On top of that, failure exits in ext3_new_inode() are leaking like hell - block quota not freed if we'd allocated xattr block for ACL, etc., but that's ext3-specific (ext2 has smaller-scale analog of that). --- Additional comment from holtmann on 2006-04-28 05:51:32 EDT --- Proposed fix from Al Viro: * have callers of find_inode()/find_inode_fast() check if the inode they've got is still in hash after they'd finished wait_on_inode(). If it isn't (i.e. we'd raced with ->read_inode() called by somebody before us unhashing the inode) - act as if we hadn't found it at all. * have iget() check if after ->read_inode() the sucker is unhashed, iput() and return NULL if it is (that, BTW, simplifies life for export_iget()). * have foo_new_inode() on affected filesystems use iget_locked() after they figure out the inode number. That shall give us a (new,locked) in-core inode _and_ guarantee that there won't be aliasing issues. Then we fill it and instead of insert_inode_hash() do unlock_new_inode() in the very end. Or make_bad_inode()/unlock_new_inode()/iput() on failure exit (with explicit cleanup rather than relying of foo_delete_inode(); needed anyway since we need to do cleanups after halfway-failed inode creation).
Luis, looks like these patches can be applied onto rt kernel directly. [PATCH] Add an ERR_CAST() function to complement ERR_PTR and co. d1bc8e95445224276d7896b8b08cbb0b28a0ca80 v2.6.25-rc1~388 [PATCH] bugfix: two read_inode() calls without clear_inode() call between 4120db47198d21d8cd3b2cdbbe1ea6118a50bcd4 v2.6.13-rc3~53 [PATCH] igrab() should check for I_CLEAR 4a3b0a490d49ada8bbf3f426be1a0ace4dcd0a55 v2.6.21-rc1~91^2~363 [PATCH] iget: stop EXT2 from using iget() and read_inode() 52fcf7032935b33158e3998ed399cac97447ab8d v2.6.25-rc1~377 [PATCH] iget: stop EXT3 from using iget() and read_inode() 473043dcee1874aab99f66b0362b344618eb3790 v2.6.25-rc1~376 [PATCH] iget: stop FAT from using iget() and read_inode() 17f95a7b4416a2c61e35f51b29eaaf1818fb5d7d v2.6.25-rc1~374 [PATCH] iget: stop EXT4 from using iget() and read_inode() 1d1fe1ee02b9ac2660995b10e35dd41448fef011 v2.6.25-rc1~375 [PATCH] iget: stop ISOFS from using read_inode c4386c83bf849c56b1f49951595aeb7c9a719d21 v2.6.25-rc1~370 [PATCH] iget: stop JFFS2 from using iget() and read_inode() 5451f79f5f817880958ed063864ad268d94ccd1f v2.6.25-rc1~369 [PATCH] preparation to jfs iget sanitizing ?? see rhel-4 patch [PATCH] iget: stop JFS from using iget eab1df71a0ef6d333b9b826deaa0d0eb4b4f69dc v2.6.25-rc1~368 [PATCH] iget: stop FUSE from using iget() and read_inode() fa300b1914f892196acb385677047bc978466de7 v2.6.25-rc1~372 iget: stop PROCFS from using iget() and read_inode() a1d4aebbfa91c55a6b0c629a9ccf6369be0c6e95 v2.6.25-rc1~366 [PATCH] nfsd/create race fixes, infrastructure 261bca86ed4f7f391d1938167624e78da61dcc6b v2.6.29-rc1~549 [PATCH] fs: make sure data stored into inode is properly seen before unlocking new inode 580be0837a7a59b207c3d5c661d044d8dd0a6a30 v2.6.32-rc1~608 [PATCH] ext3/4 with synchronous writes gets wedged by Postfix 72a43d63cb51057393edfbcfc4596066205ad15d v2.6.30~18^2 [PATCH] nfsd race fixes: ext2 41080b5a240113328c607f22b849f653373db0ce v2.6.29-rc1~549 [PATCH] nfsd race fixes: ext3 c38012daa7ad902a39a4213ba2b3fe50e81157ea v2.6.29-rc1~549 nfsd race fixes: ext4 6b38e842bb832a3dbeb17e382404aef3c40ac5f9 v2.6.29-rc1~549 [PATCH] nfsd race fixes: reiserfs c1eaa26b671299b3ec01d40c6c71ee19a4f81517 v2.6.29-rc1~549 [PATCH] nfsd race fixes: jfs 1f3403fa640f9f7b135dee79f2d39d01c8ad4a08 v2.6.29-rc1~549
Verified by code review. Found patch mentioned in comment #1 applied to the 2.6.24.7-149 src.rpm. Started a 96 hour load and stress test in the -149 kernel. Awaiting final results from this test to conclude.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0161.html