Scenario 1: A: calls ext3_new_inode(), blocks in e.g. new_inode() knfsd: gets an fhandle with inumber ext3_new_inode() will pick knfsd: calls iget(sb, ino) knfsd: allocates and hashes inode (locked, new), calls ext3_read_inode(), blocks A: allocates inode, fills it, hashes, has it written to cache knfsd: comes back, gets the data left by A, happily fills its struct inode => we have two in-core inodes with the same inode number, both in use (by normal dcache and by anon dentry held by nfsd). Scenario 2: same, except that A fails in xattr allocation (after having inode inserted into hash) and knfsd comes and finds the inode in a normal fashion - via icache. inode is pinned down by nfsd (and has no ACL or selinux label, BTW). Scenario 3: knfsd: does iget(), gets preempted before it can call make_bad_inode() A: allocates inode with the same inumber, hashes it the latter gets evicted (not too hard - just rmdir on non-empty parent), later normal lookup finds half-done one from knfsd and blocks (it's still locked). knfsd: comes back, marks that puppy bad, unhashes and unlocks normal lookup gets a pile of crap instead of inode. Similar fun exists for other exportable filesystems. On top of that, failure exits in ext3_new_inode() are leaking like hell - block quota not freed if we'd allocated xattr block for ACL, etc., but that's ext3-specific (ext2 has smaller-scale analog of that).
Proposed fix from Al Viro: * have callers of find_inode()/find_inode_fast() check if the inode they've got is still in hash after they'd finished wait_on_inode(). If it isn't (i.e. we'd raced with ->read_inode() called by somebody before us unhashing the inode) - act as if we hadn't found it at all. * have iget() check if after ->read_inode() the sucker is unhashed, iput() and return NULL if it is (that, BTW, simplifies life for export_iget()). * have foo_new_inode() on affected filesystems use iget_locked() after they figure out the inode number. That shall give us a (new,locked) in-core inode _and_ guarantee that there won't be aliasing issues. Then we fill it and instead of insert_inode_hash() do unlock_new_inode() in the very end. Or make_bad_inode()/unlock_new_inode()/iput() on failure exit (with explicit cleanup rather than relying of foo_delete_inode(); needed anyway since we need to do cleanups after halfway-failed inode creation).
Upstream patches: d1bc8e95445224276d7896b8b08cbb0b28a0ca80 4120db47198d21d8cd3b2cdbbe1ea6118a50bcd4 4a3b0a490d49ada8bbf3f426be1a0ace4dcd0a55 52fcf7032935b33158e3998ed399cac97447ab8d 473043dcee1874aab99f66b0362b344618eb3790 17f95a7b4416a2c61e35f51b29eaaf1818fb5d7d 1d1fe1ee02b9ac2660995b10e35dd41448fef011 c4386c83bf849c56b1f49951595aeb7c9a719d21 5451f79f5f817880958ed063864ad268d94ccd1f [PATCH] preparation to jfs iget sanitizing eab1df71a0ef6d333b9b826deaa0d0eb4b4f69dc fa300b1914f892196acb385677047bc978466de7 a1d4aebbfa91c55a6b0c629a9ccf6369be0c6e95 261bca86ed4f7f391d1938167624e78da61dcc6b 580be0837a7a59b207c3d5c661d044d8dd0a6a30 72a43d63cb51057393edfbcfc4596066205ad15d 41080b5a240113328c607f22b849f653373db0ce c38012daa7ad902a39a4213ba2b3fe50e81157ea 6b38e842bb832a3dbeb17e382404aef3c40ac5f9 c1eaa26b671299b3ec01d40c6c71ee19a4f81517 1f3403fa640f9f7b135dee79f2d39d01c8ad4a08
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.45.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
verified patch linux-2.6.9-fs-a-bunch-of-patches-to-fix-various-nfsd-iget-races.patch apply is sane. and run fs_mark on i386 and x86_64 on local fs and NFS-exportable filesystems: i386: https://beaker.engineering.redhat.com/jobs/45969 x86_64: https://beaker.engineering.redhat.com/jobs/45972 localhost:/tmp /mnt nfs rw,v3,rsize=32768,wsize=32768,hard,lock,proto=tcp,tcp,timeo=600,retrans=5,addr=localhost 0 0 [root@hp-xw9400-01 fs_mark]# ./fs_mark -d /mnt -s 51200 -n 4096 -l fill.log
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html