Bug 155289 - [RHEL 3 U6]inode_lock deadlock/race?
Summary: [RHEL 3 U6]inode_lock deadlock/race?
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Larry Woodman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 156320
TreeView+ depends on / blocked
 
Reported: 2005-04-18 19:52 UTC by Issue Tracker
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-28 14:58:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:663 0 qe-ready SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 6 2005-09-28 04:00:00 UTC

Description Issue Tracker 2005-04-18 19:52:39 UTC
Escalated to Bugzilla from IssueTracker

Comment 2 Issue Tracker 2005-04-18 19:52:54 UTC
From User-Agent: XML-RPC

I thought I found a race with iput() and prune_icache().  Maybe its a
race??



Processor 0 - in iput():

__refile's inode onto inode_unused_pagecache list.

does some stuff, then unlocks inode_lock.

calls clear_inode.

clear_inode tries to wait_on_inode, but doesn't need to b/c I_LOCK is not
set.



Processor 1 - in prune_icache() in CONFIG_HIMEM loop 

locks inode_lock, grab's inode, and sets I_LOCK.



Processor 0 - in clear_inode() still

set's inode->state == I_CLEAR, I_LOCK is now gone



Processor 1 - in prune_icache() in CONFIG_HIMEM loop

calls invalidate_inode_pages, which eventually calls refile_inode.

  - b/c inode->state == I_CLEAR, refile_inode refiles the inode.

back in prune_icache(), b/c inode->state == I_CLEAR, __refile_inode() is
called again.



Based on the above, we put together a debug kernel with the attached patch
and ran both

dbench and cerebrus consecutively.  After 9 hours, we hit the
prune_icache() hang.   Pretty much

immediately before the hang, we saw this printk on the console.



2005-04-16 01:49:33 VFS: __refile_inode() called with i_state=I_CLEAR.

2005-04-16 01:49:33       inode=cd6dfa80, inode->i_ino=4324,
inode->i_state=0x20



Note that we typoed the printk message, it should say "i_state has I_CLEAR
set", we're not checking

for equality in the patch.



The additional printk's we put in the patch weren't output to the screen,
so its an indication the maybe race

I described above may not be causing our hangs/crashes.  But we believe
this indicates there is a race

condition involving I_CLEAR??  As far as we can tell, a cleared inode
should never be put onto a inode list.

The fact that this check is in __refile_inode()



        if (inode->i_state & I_FREEING)

                return;



right at the beginning of the function seems support this belief??



Another debug kernel is on the way.  Please let me know any comments/ideas
the SEG folks have on this.



Al









File uploaded: prune_icache_debug.patch

This event sent from IssueTracker by peterm
 issue 70528
it_file 38555

Comment 4 Issue Tracker 2005-04-19 18:10:13 UTC
From User-Agent: XML-RPC

 We added the following into a test kernel in __refile_inode



       if (inode->i_state & I_CLEAR) {

               printk(KERN_ERR

               "VFS: __refile_inode() called with
inode->i_state=I_CLEAR,\n"

               "     returning to avoid a race with clear_inode(). 
Please\n"

               "     inform a CHAOS kernel developer.\n"

               "     inode=%p, inode->i_ino=%lu,
inode->i_state=0x%x\n\n",

               inode, inode->i_ino, inode->i_state);

               dump_stack();

               return;

       }









We saw the following on 6 nodes last night



2005-04-19 06:02:50 VFS: __refile_inode() called with
inode->i_state=I_CLEAR,

2005-04-19 06:02:50      returning to avoid a race with clear_inode(). 
Please

2005-04-19 06:02:50      inform a CHAOS kernel developer.

2005-04-19 06:02:50      inode=ed27cd00, inode->i_ino=4330,
inode->i_state=0x20

2005-04-19 06:02:50

2005-04-19 06:02:50 c587ff68 c587ffa0 c0182057 ed27cd00 00489148 c729b000
c587c55d c587ffc0

2005-04-19 06:02:50        c0134176 c587ff98 f6bbaf00 c587c000 c03f1240
c587c55d c587ffc0 c016a346

2005-04-19 06:02:50        c0134103 00000000 00000000 00000000 c587c000
c03f1240 c587ffec c016a8cd

2005-04-19 06:02:50 Call Trace:

2005-04-19 06:02:50              (0xc587ff68, 0xc587ffc8, 0xc587ff60,
0xc010c75d)

2005-04-19 06:02:50 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff3c)

2005-04-19 06:02:50              (0xc587ff68, 0xc016a8cd, 0x3,
0xed27cd00)

2005-04-19 06:02:50 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff4c)

2005-04-19 06:02:50              (0xc587ff68, 0xc587ffa0, 0xc0182057,
0xed27cd00)

2005-04-19 06:02:50 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff64)

2005-04-19 06:02:50              (0xed27cd00, 0x489148, 0xc729b000,
0xc587c55d)

2005-04-19 06:02:50 [<c0182057>] sync_unlocked_inodes [kernel] 0x2b4
(0xc587ff70)

2005-04-19 06:02:50              (0xc0134103, 0x0, 0x0, 0x0)

2005-04-19 06:02:50 [<c016a346>] sync_old_buffers [kernel] 0x54
(0xc587ffa4)

2005-04-19 06:02:50              (0xc0425940, 0x1, 0x0, 0x0)

2005-04-19 06:02:50 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4)



This test kernel has both the U4 prune_icache patch and the inode_unuxed
race/deletion patch for U5. 


This event sent from IssueTracker by kbaxley
 issue 70528

Comment 5 Larry Woodman 2005-04-20 14:39:14 UTC
Can someone try out this patch to prune_icache()?  It holds the inode I_LOCKED
bit over the call to __refile_inode so that clear_inode wil be forced to wait
until __refile_inode returns and prune_icache() wakes it up.

--------------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -892,11 +892,11 @@ void prune_icache(int goal)
                }
                /* release inode */
                spin_lock(&inode_lock);
-               inode->i_state &= ~I_LOCK;
                if (invalidated_inode && !(inode->i_state & I_FREEING)) {
                        /* wasn't called earlier because I_LOCK was set */
                        __refile_inode(inode);
                }
+               inode->i_state &= ~I_LOCK;
                /* wake up any potential waiters in wait_on_inode() */
                wake_up(&inode->i_wait);
        }
----------------------------------------------------------------------------


Thanks, I dont have a way of testing this myself.

Larry Woodman


Comment 6 Albert Chu 2005-04-20 15:57:50 UTC
Our stack traces show the call to __refile_inode() coming from __sync_one() and
not prune_icache().  (The former is the only one documented though, it could be
coming from prune_icache() too.)  Would the same style fix make sense for
__sync_one()??

static inline void __sync_one(struct inode *inode, int sync)
{
        unsigned dirty;
                                                                               
                             
        list_del(&inode->i_list);
        list_add(&inode->i_list, &inode->i_sb->s_locked_inodes);
                                                                               
                             
        if (inode->i_state & (I_LOCK|I_FREEING))
                BUG();
                                                                               
                             
        /* Set I_LOCK, reset I_DIRTY */
        dirty = inode->i_state & I_DIRTY;
        inode->i_state |= I_LOCK;
        inode->i_state &= ~I_DIRTY;
        spin_unlock(&inode_lock);
                                                                               
                             
        filemap_fdatasync(inode->i_mapping);
                                                                               
                             
        /* Don't write the inode if only I_DIRTY_PAGES was set */
        if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
                write_inode(inode, sync);
                                                                               
                             
        filemap_fdatawait(inode->i_mapping);
                                                                               
                             
        spin_lock(&inode_lock);
-       inode->i_state &= ~I_LOCK;
        if (!(inode->i_state & I_FREEING))
                __refile_inode(inode);
+       inode->i_state &= ~I_LOCK;
        wake_up(&inode->i_wait);
}


Comment 7 Albert Chu 2005-04-20 16:48:31 UTC
Thinking about this a tad more, can this patch still leave a destroyed inode on
the inode_unused list??

proc 0:
iput(): remove inode off of inode->i_list

proc 1:
set I_LOCK, call __refile_inode

proc 0:
iput calls clear_inode()
clear_inode() waits on I_LOCK

proc 1:
clear I_LOCK

proc 0:
finish up in clear_inode()
iput calls destroy_inode()

Comment 13 Albert Chu 2005-04-21 15:55:41 UTC
Using the following additional patch in __sync_one():

@@ -360,6 +366,12 @@
        filemap_fdatawait(inode->i_mapping);
                                                                               
                       
        spin_lock(&inode_lock);
+        if (!(inode->i_state & (I_LOCK)))
+                printk(KERN_ERR
+               "VFS: __sync_one() inode->i_state!=I_LOCK.\n"
+               "     inode=%p, inode->i_ino=%lu, inode->i_state=0x%lx\n\n",
+               inode, inode->i_ino, inode->i_state);
+
        inode->i_state &= ~I_LOCK;
        if (!(inode->i_state & I_FREEING))
                __refile_inode(inode);

We've gotten the following on the console:

2005-04-20 22:19:02 VFS: __sync_one() inode->i_state!=I_LOCK.
2005-04-20 22:19:02      inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20
2005-04-20 22:19:02
2005-04-20 22:19:02 VFS: __refile_inode() called with inode->i_state=I_CLEAR,
2005-04-20 22:19:02      returning to avoid a race with clear_inode().  Please
2005-04-20 22:19:02      inform a CHAOS kernel developer.
2005-04-20 22:19:02      inode=e93ff800, inode->i_ino=4322, inode->i_state=0x20
2005-04-20 22:19:02
2005-04-20 22:19:02 c587ff60 c587ffa0 c018216f e93ff800 e93ff800 000010e2
00000020 c729b000
2005-04-20 22:19:02        c587c55d c587ffc0 c0134176 c587ff98 00000296 c587c000
c03f1240 c587c55d 2005-04-20 22:19:02        c587ffc0 c016a346 c0134103 00000000
00000000 00000000 c587c000 c03f1240
2005-04-20 22:19:02 Call Trace: 2005-04-20 22:19:02              (0xc587ff60,
0xc587ffc0, 0xc587ff58, 0xc010c75d)
2005-04-20 22:19:03 [<c010c6be>] show_trace [kernel] 0x4c (0xc587ff34)
2005-04-20 22:19:03              (0xc587ff60, 0xc03f1240, 0x0, 0xe93ff800)
2005-04-20 22:19:03 [<c010c75d>] show_stack [kernel] 0x6c (0xc587ff44)
2005-04-20 22:19:03              (0xc587ff60, 0xc587ffa0, 0xc018216f, 0xe93ff800)
2005-04-20 22:19:03 [<c010c784>] dump_stack [kernel] 0x12 (0xc587ff5c)
2005-04-20 22:19:03              (0xe93ff800, 0xe93ff800, 0x10e2, 0x20)
2005-04-20 22:19:03 [<c018216f>] sync_unlocked_inodes [kernel] 0x2cc
(0xc587ff68) 2005-04-20 22:19:03              (0xc0134103, 0x0, 0x0, 0x0)
2005-04-20 22:19:03 [<c016a346>] sync_old_buffers [kernel] 0x54 (0xc587ffa4)
2005-04-20 22:19:03              (0xc0425940, 0x1, 0x0, 0x0)
2005-04-20 22:19:03 [<c016a8cd>] kupdate [kernel] 0x18a (0xc587ffc4) 

So the I_LOCK is definitely being cleared and set to I_CLEAR, b/c  i_state==0x20
== I_CLEAR.  

So the race described above seems to be happening, but is happening with
__sync_one() instead of __prune_icache().

This test kernel this was run on also had the inode symlink reference count
patch planned for U5.

Comment 16 Larry Woodman 2005-05-03 14:12:56 UTC
At this point I think we plan on including the following two patches to
eliminate the race conditions seen while freeing and using inodes within the
inode cache.  Can you verify that these do fix the races seen at LLNL?

Thanks, Larry Woodman

-----------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
                cdput(inode->i_cdev);
                inode->i_cdev = NULL;
        }
+       spin_lock(&inode_lock);
        inode->i_state = I_CLEAR;
+       spin_unlock(&inode_lock);
 }
                                                                               
                                     
 /*
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -296,7 +296,7 @@ static inline void __refile_inode(struct
 {
        struct list_head *to;
                                                                               
                                     
-       if (inode->i_state & I_FREEING)
+       if (inode->i_state & (I_FREEING|I_CLEAR))
                return;
        if (list_empty(&inode->i_hash))
                return;
----------------------------------------------------------------

Comment 18 Albert Chu 2005-05-03 15:10:26 UTC
We've been testing with the logically equivalent patch posted above on 4-19-05
for over two weeks and that's solved our problem.

We tried out the

--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
                cdput(inode->i_cdev);
                inode->i_cdev = NULL;
        }
+       spin_lock(&inode_lock);
        inode->i_state = I_CLEAR;
+       spin_unlock(&inode_lock);
 }

for 3-4 days now and it hasn't caused any additional problems.


Comment 20 Ernie Petrides 2005-05-14 05:22:38 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.4.EL).


Comment 23 Ernie Petrides 2005-05-26 22:12:38 UTC
*** Bug 149636 has been marked as a duplicate of this bug. ***

Comment 26 Ernie Petrides 2005-07-22 02:14:03 UTC
*** Bug 149636 has been marked as a duplicate of this bug. ***

Comment 31 Red Hat Bugzilla 2005-09-28 14:58:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html


Comment 32 Ernie Petrides 2005-10-10 21:52:56 UTC
*** Bug 167839 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.