Bug 155289
| Summary: | [RHEL 3 U6]inode_lock deadlock/race? | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | Issue Tracker <tao> |
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | petrides, poelstra, sev, tao, villapla |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHSA-2005-663 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2005-09-28 14:58:10 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 156320 | ||
|
Description
Issue Tracker
2005-04-18 19:52:39 UTC
From User-Agent: XML-RPC
I thought I found a race with iput() and prune_icache(). Maybe its a
race??
Processor 0 - in iput():
__refile's inode onto inode_unused_pagecache list.
does some stuff, then unlocks inode_lock.
calls clear_inode.
clear_inode tries to wait_on_inode, but doesn't need to b/c I_LOCK is not
set.
Processor 1 - in prune_icache() in CONFIG_HIMEM loop
locks inode_lock, grab's inode, and sets I_LOCK.
Processor 0 - in clear_inode() still
set's inode->state == I_CLEAR, I_LOCK is now gone
Processor 1 - in prune_icache() in CONFIG_HIMEM loop
calls invalidate_inode_pages, which eventually calls refile_inode.
- b/c inode->state == I_CLEAR, refile_inode refiles the inode.
back in prune_icache(), b/c inode->state == I_CLEAR, __refile_inode() is
called again.
Based on the above, we put together a debug kernel with the attached patch
and ran both
dbench and cerebrus consecutively. After 9 hours, we hit the
prune_icache() hang. Pretty much
immediately before the hang, we saw this printk on the console.
2005-04-16 01:49:33 VFS: __refile_inode() called with i_state=I_CLEAR.
2005-04-16 01:49:33 inode=cd6dfa80, inode->i_ino=4324,
inode->i_state=0x20
Note that we typoed the printk message, it should say "i_state has I_CLEAR
set", we're not checking
for equality in the patch.
The additional printk's we put in the patch weren't output to the screen,
so its an indication the maybe race
I described above may not be causing our hangs/crashes. But we believe
this indicates there is a race
condition involving I_CLEAR?? As far as we can tell, a cleared inode
should never be put onto a inode list.
The fact that this check is in __refile_inode()
if (inode->i_state & I_FREEING)
return;
right at the beginning of the function seems support this belief??
Another debug kernel is on the way. Please let me know any comments/ideas
the SEG folks have on this.
Al
File uploaded: prune_icache_debug.patch
This event sent from IssueTracker by peterm
issue 70528
it_file 38555
From User-Agent: XML-RPC
We added the following into a test kernel in __refile_inode
if (inode->i_state & I_CLEAR) {
printk(KERN_ERR
"VFS: __refile_inode() called with
inode->i_state=I_CLEAR,\n"
" returning to avoid a race with clear_inode().
Please\n"
" inform a CHAOS kernel developer.\n"
" inode=%p, inode->i_ino=%lu,
inode->i_state=0x%x\n\n",
inode, inode->i_ino, inode->i_state);
dump_stack();
return;
}
We saw the following on 6 nodes last night
2005-04-19 06:02:50 VFS: __refile_inode() called with
inode->i_state=I_CLEAR,
2005-04-19 06:02:50 returning to avoid a race with clear_inode().
Please
2005-04-19 06:02:50 inform a CHAOS kernel developer.
2005-04-19 06:02:50 inode=ed27cd00, inode->i_ino=4330,
inode->i_state=0x20
2005-04-19 06:02:50
2005-04-19 06:02:50 c587ff68 c587ffa0 c0182057 ed27cd00 00489148 c729b000
c587c55d c587ffc0
2005-04-19 06:02:50 c0134176 c587ff98 f6bbaf00 c587c000 c03f1240
c587c55d c587ffc0 c016a346
2005-04-19 06:02:50 c0134103 00000000 00000000 00000000 c587c000
c03f1240 c587ffec c016a8cd
2005-04-19 06:02:50 Call Trace:
2005-04-19 06:02:50 (0xc587ff68, 0xc587ffc8, 0xc587ff60,
0xc010c75d)
2005-04-19 06:02:50 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff3c)
2005-04-19 06:02:50 (0xc587ff68, 0xc016a8cd, 0x3,
0xed27cd00)
2005-04-19 06:02:50 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff4c)
2005-04-19 06:02:50 (0xc587ff68, 0xc587ffa0, 0xc0182057,
0xed27cd00)
2005-04-19 06:02:50 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff64)
2005-04-19 06:02:50 (0xed27cd00, 0x489148, 0xc729b000,
0xc587c55d)
2005-04-19 06:02:50 [<c0182057>] sync_unlocked_inodes [kernel] 0x2b4
(0xc587ff70)
2005-04-19 06:02:50 (0xc0134103, 0x0, 0x0, 0x0)
2005-04-19 06:02:50 [<c016a346>] sync_old_buffers [kernel] 0x54
(0xc587ffa4)
2005-04-19 06:02:50 (0xc0425940, 0x1, 0x0, 0x0)
2005-04-19 06:02:50 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4)
This test kernel has both the U4 prune_icache patch and the inode_unuxed
race/deletion patch for U5.
This event sent from IssueTracker by kbaxley
issue 70528
Can someone try out this patch to prune_icache()? It holds the inode I_LOCKED
bit over the call to __refile_inode so that clear_inode wil be forced to wait
until __refile_inode returns and prune_icache() wakes it up.
--------------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -892,11 +892,11 @@ void prune_icache(int goal)
}
/* release inode */
spin_lock(&inode_lock);
- inode->i_state &= ~I_LOCK;
if (invalidated_inode && !(inode->i_state & I_FREEING)) {
/* wasn't called earlier because I_LOCK was set */
__refile_inode(inode);
}
+ inode->i_state &= ~I_LOCK;
/* wake up any potential waiters in wait_on_inode() */
wake_up(&inode->i_wait);
}
----------------------------------------------------------------------------
Thanks, I dont have a way of testing this myself.
Larry Woodman
Our stack traces show the call to __refile_inode() coming from __sync_one() and
not prune_icache(). (The former is the only one documented though, it could be
coming from prune_icache() too.) Would the same style fix make sense for
__sync_one()??
static inline void __sync_one(struct inode *inode, int sync)
{
unsigned dirty;
list_del(&inode->i_list);
list_add(&inode->i_list, &inode->i_sb->s_locked_inodes);
if (inode->i_state & (I_LOCK|I_FREEING))
BUG();
/* Set I_LOCK, reset I_DIRTY */
dirty = inode->i_state & I_DIRTY;
inode->i_state |= I_LOCK;
inode->i_state &= ~I_DIRTY;
spin_unlock(&inode_lock);
filemap_fdatasync(inode->i_mapping);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
write_inode(inode, sync);
filemap_fdatawait(inode->i_mapping);
spin_lock(&inode_lock);
- inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING))
__refile_inode(inode);
+ inode->i_state &= ~I_LOCK;
wake_up(&inode->i_wait);
}
Thinking about this a tad more, can this patch still leave a destroyed inode on the inode_unused list?? proc 0: iput(): remove inode off of inode->i_list proc 1: set I_LOCK, call __refile_inode proc 0: iput calls clear_inode() clear_inode() waits on I_LOCK proc 1: clear I_LOCK proc 0: finish up in clear_inode() iput calls destroy_inode() Using the following additional patch in __sync_one():
@@ -360,6 +366,12 @@
filemap_fdatawait(inode->i_mapping);
spin_lock(&inode_lock);
+ if (!(inode->i_state & (I_LOCK)))
+ printk(KERN_ERR
+ "VFS: __sync_one() inode->i_state!=I_LOCK.\n"
+ " inode=%p, inode->i_ino=%lu, inode->i_state=0x%lx\n\n",
+ inode, inode->i_ino, inode->i_state);
+
inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING))
__refile_inode(inode);
We've gotten the following on the console:
2005-04-20 22:19:02 VFS: __sync_one() inode->i_state!=I_LOCK.
2005-04-20 22:19:02 inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20
2005-04-20 22:19:02
2005-04-20 22:19:02 VFS: __refile_inode() called with inode->i_state=I_CLEAR,
2005-04-20 22:19:02 returning to avoid a race with clear_inode(). Please
2005-04-20 22:19:02 inform a CHAOS kernel developer.
2005-04-20 22:19:02 inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20
2005-04-20 22:19:02
2005-04-20 22:19:02 c587ff60 c587ffa0 c018216f e93ff800 e93ff800 000010e2
00000020 c729b000
2005-04-20 22:19:02 c587c55d c587ffc0 c0134176 c587ff98 00000296 c587c000
c03f1240 c587c55d 2005-04-20 22:19:02 c587ffc0 c016a346 c0134103 00000000
00000000 00000000 c587c000 c03f1240
2005-04-20 22:19:02 Call Trace: 2005-04-20 22:19:02 (0xc587ff60,
0xc587ffc0, 0xc587ff58, 0xc010c75d)
2005-04-20 22:19:03 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff34)
2005-04-20 22:19:03 (0xc587ff60, 0xc03f1240, 0x0, 0xe93ff800)
2005-04-20 22:19:03 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff44)
2005-04-20 22:19:03 (0xc587ff60, 0xc587ffa0, 0xc018216f, 0xe93ff800)
2005-04-20 22:19:03 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff5c)
2005-04-20 22:19:03 (0xe93ff800, 0xe93ff800, 0x10e2, 0x20)
2005-04-20 22:19:03 [<c018216f>] sync_unlocked_inodes [kernel] 0x2cc
(0xc587ff68) 2005-04-20 22:19:03 (0xc0134103, 0x0, 0x0, 0x0)
2005-04-20 22:19:03 [<c016a346>] sync_old_buffers [kernel] 0x54 (0xc587ffa4)
2005-04-20 22:19:03 (0xc0425940, 0x1, 0x0, 0x0)
2005-04-20 22:19:03 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4)
So the I_LOCK is definitely being cleared and set to I_CLEAR, b/c i_state==0x20
== I_CLEAR.
So the race described above seems to be happening, but is happening with
__sync_one() instead of __prune_icache().
This test kernel this was run on also had the inode symlink reference count
patch planned for U5.
At this point I think we plan on including the following two patches to
eliminate the race conditions seen while freeing and using inodes within the
inode cache. Can you verify that these do fix the races seen at LLNL?
Thanks, Larry Woodman
-----------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
cdput(inode->i_cdev);
inode->i_cdev = NULL;
}
+ spin_lock(&inode_lock);
inode->i_state = I_CLEAR;
+ spin_unlock(&inode_lock);
}
/*
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -296,7 +296,7 @@ static inline void __refile_inode(struct
{
struct list_head *to;
- if (inode->i_state & I_FREEING)
+ if (inode->i_state & (I_FREEING|I_CLEAR))
return;
if (list_empty(&inode->i_hash))
return;
----------------------------------------------------------------
We've been testing with the logically equivalent patch posted above on 4-19-05
for over two weeks and that's solved our problem.
We tried out the
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
cdput(inode->i_cdev);
inode->i_cdev = NULL;
}
+ spin_lock(&inode_lock);
inode->i_state = I_CLEAR;
+ spin_unlock(&inode_lock);
}
for 3-4 days now and it hasn't caused any additional problems.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL). *** Bug 149636 has been marked as a duplicate of this bug. *** *** Bug 149636 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html *** Bug 167839 has been marked as a duplicate of this bug. *** |