Escalated to Bugzilla from IssueTracker
From User-Agent: XML-RPC I thought I found a race with iput() and prune_icache(). Maybe its a race?? Processor 0 - in iput(): __refile's inode onto inode_unused_pagecache list. does some stuff, then unlocks inode_lock. calls clear_inode. clear_inode tries to wait_on_inode, but doesn't need to b/c I_LOCK is not set. Processor 1 - in prune_icache() in CONFIG_HIMEM loop locks inode_lock, grab's inode, and sets I_LOCK. Processor 0 - in clear_inode() still set's inode->state == I_CLEAR, I_LOCK is now gone Processor 1 - in prune_icache() in CONFIG_HIMEM loop calls invalidate_inode_pages, which eventually calls refile_inode. - b/c inode->state == I_CLEAR, refile_inode refiles the inode. back in prune_icache(), b/c inode->state == I_CLEAR, __refile_inode() is called again. Based on the above, we put together a debug kernel with the attached patch and ran both dbench and cerebrus consecutively. After 9 hours, we hit the prune_icache() hang. Pretty much immediately before the hang, we saw this printk on the console. 2005-04-16 01:49:33 VFS: __refile_inode() called with i_state=I_CLEAR. 2005-04-16 01:49:33 inode=cd6dfa80, inode->i_ino=4324, inode->i_state=0x20 Note that we typoed the printk message, it should say "i_state has I_CLEAR set", we're not checking for equality in the patch. The additional printk's we put in the patch weren't output to the screen, so its an indication the maybe race I described above may not be causing our hangs/crashes. But we believe this indicates there is a race condition involving I_CLEAR?? As far as we can tell, a cleared inode should never be put onto a inode list. The fact that this check is in __refile_inode() if (inode->i_state & I_FREEING) return; right at the beginning of the function seems support this belief?? Another debug kernel is on the way. Please let me know any comments/ideas the SEG folks have on this. Al File uploaded: prune_icache_debug.patch This event sent from IssueTracker by peterm issue 70528 it_file 38555
From User-Agent: XML-RPC We added the following into a test kernel in __refile_inode if (inode->i_state & I_CLEAR) { printk(KERN_ERR "VFS: __refile_inode() called with inode->i_state=I_CLEAR,\n" " returning to avoid a race with clear_inode(). Please\n" " inform a CHAOS kernel developer.\n" " inode=%p, inode->i_ino=%lu, inode->i_state=0x%x\n\n", inode, inode->i_ino, inode->i_state); dump_stack(); return; } We saw the following on 6 nodes last night 2005-04-19 06:02:50 VFS: __refile_inode() called with inode->i_state=I_CLEAR, 2005-04-19 06:02:50 returning to avoid a race with clear_inode(). Please 2005-04-19 06:02:50 inform a CHAOS kernel developer. 2005-04-19 06:02:50 inode=ed27cd00, inode->i_ino=4330, inode->i_state=0x20 2005-04-19 06:02:50 2005-04-19 06:02:50 c587ff68 c587ffa0 c0182057 ed27cd00 00489148 c729b000 c587c55d c587ffc0 2005-04-19 06:02:50 c0134176 c587ff98 f6bbaf00 c587c000 c03f1240 c587c55d c587ffc0 c016a346 2005-04-19 06:02:50 c0134103 00000000 00000000 00000000 c587c000 c03f1240 c587ffec c016a8cd 2005-04-19 06:02:50 Call Trace: 2005-04-19 06:02:50 (0xc587ff68, 0xc587ffc8, 0xc587ff60, 0xc010c75d) 2005-04-19 06:02:50 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff3c) 2005-04-19 06:02:50 (0xc587ff68, 0xc016a8cd, 0x3, 0xed27cd00) 2005-04-19 06:02:50 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff4c) 2005-04-19 06:02:50 (0xc587ff68, 0xc587ffa0, 0xc0182057, 0xed27cd00) 2005-04-19 06:02:50 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff64) 2005-04-19 06:02:50 (0xed27cd00, 0x489148, 0xc729b000, 0xc587c55d) 2005-04-19 06:02:50 [<c0182057>] sync_unlocked_inodes [kernel] 0x2b4 (0xc587ff70) 2005-04-19 06:02:50 (0xc0134103, 0x0, 0x0, 0x0) 2005-04-19 06:02:50 [<c016a346>] sync_old_buffers [kernel] 0x54 (0xc587ffa4) 2005-04-19 06:02:50 (0xc0425940, 0x1, 0x0, 0x0) 2005-04-19 06:02:50 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4) This test kernel has both the U4 prune_icache patch and the inode_unuxed race/deletion patch for U5. This event sent from IssueTracker by kbaxley issue 70528
Can someone try out this patch to prune_icache()? It holds the inode I_LOCKED bit over the call to __refile_inode so that clear_inode wil be forced to wait until __refile_inode returns and prune_icache() wakes it up. -------------------------------------------------------------- --- linux-2.4.21/fs/inode.c.orig +++ linux-2.4.21/fs/inode.c @@ -892,11 +892,11 @@ void prune_icache(int goal) } /* release inode */ spin_lock(&inode_lock); - inode->i_state &= ~I_LOCK; if (invalidated_inode && !(inode->i_state & I_FREEING)) { /* wasn't called earlier because I_LOCK was set */ __refile_inode(inode); } + inode->i_state &= ~I_LOCK; /* wake up any potential waiters in wait_on_inode() */ wake_up(&inode->i_wait); } ---------------------------------------------------------------------------- Thanks, I dont have a way of testing this myself. Larry Woodman
Our stack traces show the call to __refile_inode() coming from __sync_one() and not prune_icache(). (The former is the only one documented though, it could be coming from prune_icache() too.) Would the same style fix make sense for __sync_one()?? static inline void __sync_one(struct inode *inode, int sync) { unsigned dirty; list_del(&inode->i_list); list_add(&inode->i_list, &inode->i_sb->s_locked_inodes); if (inode->i_state & (I_LOCK|I_FREEING)) BUG(); /* Set I_LOCK, reset I_DIRTY */ dirty = inode->i_state & I_DIRTY; inode->i_state |= I_LOCK; inode->i_state &= ~I_DIRTY; spin_unlock(&inode_lock); filemap_fdatasync(inode->i_mapping); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) write_inode(inode, sync); filemap_fdatawait(inode->i_mapping); spin_lock(&inode_lock); - inode->i_state &= ~I_LOCK; if (!(inode->i_state & I_FREEING)) __refile_inode(inode); + inode->i_state &= ~I_LOCK; wake_up(&inode->i_wait); }
Thinking about this a tad more, can this patch still leave a destroyed inode on the inode_unused list?? proc 0: iput(): remove inode off of inode->i_list proc 1: set I_LOCK, call __refile_inode proc 0: iput calls clear_inode() clear_inode() waits on I_LOCK proc 1: clear I_LOCK proc 0: finish up in clear_inode() iput calls destroy_inode()
Using the following additional patch in __sync_one(): @@ -360,6 +366,12 @@ filemap_fdatawait(inode->i_mapping); spin_lock(&inode_lock); + if (!(inode->i_state & (I_LOCK))) + printk(KERN_ERR + "VFS: __sync_one() inode->i_state!=I_LOCK.\n" + " inode=%p, inode->i_ino=%lu, inode->i_state=0x%lx\n\n", + inode, inode->i_ino, inode->i_state); + inode->i_state &= ~I_LOCK; if (!(inode->i_state & I_FREEING)) __refile_inode(inode); We've gotten the following on the console: 2005-04-20 22:19:02 VFS: __sync_one() inode->i_state!=I_LOCK. 2005-04-20 22:19:02 inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20 2005-04-20 22:19:02 2005-04-20 22:19:02 VFS: __refile_inode() called with inode->i_state=I_CLEAR, 2005-04-20 22:19:02 returning to avoid a race with clear_inode(). Please 2005-04-20 22:19:02 inform a CHAOS kernel developer. 2005-04-20 22:19:02 inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20 2005-04-20 22:19:02 2005-04-20 22:19:02 c587ff60 c587ffa0 c018216f e93ff800 e93ff800 000010e2 00000020 c729b000 2005-04-20 22:19:02 c587c55d c587ffc0 c0134176 c587ff98 00000296 c587c000 c03f1240 c587c55d 2005-04-20 22:19:02 c587ffc0 c016a346 c0134103 00000000 00000000 00000000 c587c000 c03f1240 2005-04-20 22:19:02 Call Trace: 2005-04-20 22:19:02 (0xc587ff60, 0xc587ffc0, 0xc587ff58, 0xc010c75d) 2005-04-20 22:19:03 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff34) 2005-04-20 22:19:03 (0xc587ff60, 0xc03f1240, 0x0, 0xe93ff800) 2005-04-20 22:19:03 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff44) 2005-04-20 22:19:03 (0xc587ff60, 0xc587ffa0, 0xc018216f, 0xe93ff800) 2005-04-20 22:19:03 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff5c) 2005-04-20 22:19:03 (0xe93ff800, 0xe93ff800, 0x10e2, 0x20) 2005-04-20 22:19:03 [<c018216f>] sync_unlocked_inodes [kernel] 0x2cc (0xc587ff68) 2005-04-20 22:19:03 (0xc0134103, 0x0, 0x0, 0x0) 2005-04-20 22:19:03 [<c016a346>] sync_old_buffers [kernel] 0x54 (0xc587ffa4) 2005-04-20 22:19:03 (0xc0425940, 0x1, 0x0, 0x0) 2005-04-20 22:19:03 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4) So the I_LOCK is definitely being cleared and set to I_CLEAR, b/c i_state==0x20 == I_CLEAR. So the race described above seems to be happening, but is happening with __sync_one() instead of __prune_icache(). This test kernel this was run on also had the inode symlink reference count patch planned for U5.
At this point I think we plan on including the following two patches to eliminate the race conditions seen while freeing and using inodes within the inode cache. Can you verify that these do fix the races seen at LLNL? Thanks, Larry Woodman ----------------------------------------------------------- --- linux-2.4.21/fs/inode.c.orig +++ linux-2.4.21/fs/inode.c @@ -636,7 +636,9 @@ void clear_inode(struct inode *inode) cdput(inode->i_cdev); inode->i_cdev = NULL; } + spin_lock(&inode_lock); inode->i_state = I_CLEAR; + spin_unlock(&inode_lock); } /* --- linux-2.4.21/fs/inode.c.orig +++ linux-2.4.21/fs/inode.c @@ -296,7 +296,7 @@ static inline void __refile_inode(struct { struct list_head *to; - if (inode->i_state & I_FREEING) + if (inode->i_state & (I_FREEING|I_CLEAR)) return; if (list_empty(&inode->i_hash)) return; ----------------------------------------------------------------
We've been testing with the logically equivalent patch posted above on 4-19-05 for over two weeks and that's solved our problem. We tried out the --- linux-2.4.21/fs/inode.c.orig +++ linux-2.4.21/fs/inode.c @@ -636,7 +636,9 @@ void clear_inode(struct inode *inode) cdput(inode->i_cdev); inode->i_cdev = NULL; } + spin_lock(&inode_lock); inode->i_state = I_CLEAR; + spin_unlock(&inode_lock); } for 3-4 days now and it hasn't caused any additional problems.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL).
*** Bug 149636 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html
*** Bug 167839 has been marked as a duplicate of this bug. ***