Description of problem: After I operated "echo 1 > /proc/sys/vm/drop_caches" on the system which works ext3 filesystem, I experienced a deadlock because of the race between spin_lock(&inode_lock) and spin_lock(&journal->j_list_lock). The deadlock flow is the following: -------------------------------------------------------------------------------------------------- [process A] | [process B] | | | kjournald() | drop_pagecache() | | | | | V | V | journal_commit_transaction() | drop_pagecache_sb() | spin_lock(&journal->j_list_lock) *1 | | | | | V | V | invalidate_inodes_and_pages() | __journal_refile_buffer() | spin_lock(&inode_lock) *2 | | | | | V | V | __journal_unfile_buffer() | invalidate_list() | | | | | V | V | __journal_temp_unlink_buffer() | invalidate_all_mapping_pages() | | | | | V | V | mark_buffer_dirty() | invalidate_complete_page() | | | | | V | V | __set_page_dirty_nobuffers() | try_to_release_page() | | | | | V | V | __mark_inode_dirty() | ext3_releasepage() | spin_lock(&inode_lock) *2 | | | | V | | journal_try_to_free_buffers | | | | | V | | __journal_try_to_free_buffer | | spin_lock(&journal->j_list_lock) *1 V | (time) -------------------------------------------------------------------------------------------------- The filesystem which Process A does kjournald() with is the same as the one which Process B does drop_pagecache_sb() with. Therefore, the deadlock happens as follows: [Process A] Get j_list_lock (*1) -> Wait for inode_lock (*2) [Process B] Get inode_lock (*2) -> Wait for j_list_lock (*1) the backtrace of the process which caused this deadlock is the following: -------------------------------------------------------------------------------------------------- PID: 2453 TASK: f7346930 CPU: 0 COMMAND: "kjournald" #0 [f71bcd70] smp_call_function_interrupt at c0116cad #1 [f71bcd78] call_function_interrupt at c02de445 EAX: c03390f0 EBX: c03390f0 ECX: cf17a900 EDX: 00000004 EBP: f720c974 DS: 007b ESI: f720ca24 ES: 007b EDI: c10ea1a0 CS: 0060 EIP: c02dc54d ERR: fffffffb EFLAGS: 00000282 #2 [f71bcdac] _spin_lock at c02dc54d #3 [f71bcdb4] __mark_inode_dirty at c017a9d5 #4 [f71bcdd4] __set_page_dirty_nobuffers at c0146fab #5 [f71bcde4] __journal_unfile_buffer at f889c86c #6 [f71bcdec] journal_commit_transaction at f889e038 #7 [f71bcf68] kjournald at f889ff6c #8 [f71bcff0] kernel_thread_helper at c01041f3 PID: 5167 TASK: f73479b0 CPU: 2 COMMAND: "fuka.sh" #0 [cf015e10] smp_call_function_interrupt at c0116cad #1 [cf015e18] call_function_interrupt at c02de445 EAX: f7e424e4 EBX: f7e424e4 ECX: f31031b4 EDX: d363c424 EBP: d363c424 DS: 007b ESI: d363c424 ES: 007b EDI: f7e42400 CS: 0060 EIP: c02dc54a ERR: fffffffb EFLAGS: 00000286 #2 [cf015e4c] _spin_lock at c02dc54a #3 [cf015e54] __journal_try_to_free_buffer at f889c8f2 #4 [cf015e64] journal_try_to_free_buffers at f889c9da #5 [cf015e80] try_to_release_page at c015f92a #6 [cf015e8c] invalidate_complete_page at c014af96 #7 [cf015e98] invalidate_all_mapping_pages at c014b272 #8 [cf015eec] invalidate_list at c01742ff #9 [cf015f10] invalidate_inodes_and_pages at c01743ef #10 [cf015f38] drop_pagecache at c01617b7 #11 [cf015f44] drop_caches_sysctl_handler at c016182e #12 [cf015f4c] do_rw_proc at c0128b1c #13 [cf015f7c] proc_writesys at c0128b86 #14 [cf015f88] vfs_write at c015d59c #15 [cf015fa4] sys_write at c015d664 #16 [cf015fc0] system_call at c02dda08 EAX: 00000004 EBX: 00000001 ECX: b7cce000 EDX: 00000002 DS: 007b ESI: 00000002 ES: 007b EDI: b7cce000 SS: 007b ESP: bfec2970 EBP: bfec2990 CS: 0073 EIP: 00a627a2 ERR: 00000004 EFLAGS: 00000246 -------------------------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): - Red Hat Enterprise Linux Version Number: 4 - Release Number: 8 beta - Architecture: x86 - Kernel Version: 2.6.9-82.ELsmp - Related Package Version: None - Related Middleware / Application: None How reproducible: Sometimes Steps to Reproduce: 1. Put the loads of memory and I/O on the system which works ext3 filesystem. 2. Execute "echo 1 > /proc/sys/vm/drop_caches". Actual results: deadlock Expected results: Not deadlock Additional info: * The above details were provided by Fujitsu. * This bug was seen in RHEL5.1 and fixed in RHEL5.3 (bug 444961) * The upstream patch can not be applied directly because the relevant functions are in a different file in RHEL4.8. I will attach a backported patch. * The upstream patch is: ------------------------------------------------------------------------ commit eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb Author: Jan Kara <jack> Date: Tue Apr 29 00:59:37 2008 -0700 vfs: fix lock inversion in drop_pagecache_sb() Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock before calling __invalidate_mapping_pages(). We just have to make sure inode won't go away from under us by keeping reference to it and putting the reference only after we have safely resumed the scan of the inode list. A bit tricky but not too bad... Signed-off-by: Jan Kara <jack> Cc: Fengguang Wu <wfg.edu.cn> Cc: David Chinner <dgc> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> diff --git a/fs/drop_caches.c b/fs/drop_caches.c index e2c6b65..50f9087 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -14,15 +14,21 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb) { - struct inode *inode; + struct inode *inode, *toput_inode = NULL; spin_lock(&inode_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue; + __iget(inode); + spin_unlock(&inode_lock); __invalidate_mapping_pages(inode->i_mapping, 0, -1, true); + iput(toput_inode); + toput_inode = inode; + spin_lock(&inode_lock); } spin_unlock(&inode_lock); + iput(toput_inode); } static void drop_pagecache(void) ------------------------------------------------------------------------
I can reproduce the hang in RHEL4-U8 without this patch by placing artificial stalls in invalidate_all_mapping_pages(). This happens when kjournald gets to acquire the journal->j_list_lock and then spins on the global inode_lock and once invalidate_all_mapping_pages continues it later spins on the journal->j_list_lock, deadlocking the system. With this patch applied I can no longer get that deadlock becauce we drop the inode_lock before calling invalidate_all_mapping_pages(). --- linux-2.6.9/fs/inode.c.orig 2009-04-16 11:43:56.000000000 -0400 +++ linux-2.6.9/fs/inode.c 2009-04-16 12:00:25.000000000 -0400 @@ -314,9 +314,13 @@ static int invalidate_list(struct list_h inode = list_entry(tmp, struct inode, i_list); if (inode->i_sb != sb) continue; - if (pages) + if (pages) { + __iget(inode); + spin_unlock(&inode_lock); invalidate_all_mapping_pages(inode->i_mapping); - else { + iput(inode); + spin_lock(&inode_lock); + } else { invalidate_inode_buffers(inode); if (!atomic_read(&inode->i_count)) { hlist_del_init(&inode->i_hash);
Hi, We need to hold a reference to the inode till we can safely get the next inode. So the patch should look something like: ------------------------------------------------------------------- --- fs/inode.c.orig 2009-04-14 15:18:22.000000000 +1000 +++ fs/inode.c 2009-04-21 22:04:28.000000000 +1000 @@ -301,12 +301,13 @@ static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose, int pages) { struct list_head *next; + struct inode * inode; + struct inode * toput_inode = NULL; int busy = 0, count = 0; next = head->next; for (;;) { struct list_head * tmp = next; - struct inode * inode; next = next->next; if (tmp == head) @@ -314,9 +315,14 @@ inode = list_entry(tmp, struct inode, i_list); if (inode->i_sb != sb) continue; - if (pages) + if (pages) { + __iget(inode); + spin_unlock(&inode_lock); invalidate_all_mapping_pages(inode->i_mapping); - else { + iput(toput_inode); + toput_inode = inode; + spin_lock(&inode_lock); + } else { invalidate_inode_buffers(inode); if (!atomic_read(&inode->i_count)) { hlist_del_init(&inode->i_hash); @@ -328,6 +334,7 @@ } busy = 1; } + iput(toput_inode); /* only unused inodes may be cached with i_count zero */ inodes_stat.nr_unused -= count; return busy; ------------------------------------------------------------------- NOTE: this patch has not been tested nor compiled.
In my patch above, the last iput(toput_inode) is called whilst holding inode_lock. However, iput() needs to be called while inode_lock is *not* held. Unfortunately the caller of invalidate_list() first grabs the inode_lock and then drops it after invalidate_list() calls return. One option is to push the initial inode_lock grab and the final drop down into invalidate_list(). Then the iput(toput_inode) can be called immediately after inode_lock is dropped: invalidate_list(): ... spin_lock(&inode_lock); loop-start ... loop-finish ... spin_unlock(&inode_lock); iput(toput_inode); ...
Created attachment 345792 [details] Patch that fixed this problem The attached patch fixed this problem:
Committed in 89.2.EL
Event posted on 2009-06-10 18:24 EST by hjayasur Hi, I haven't looked at the core dump, but the log output is quite clear. The panic is due to the inode's i_state being I_CLEAR: ----------------------------------------------------------- 1136 void iput(struct inode *inode) 1137 { 1138 if (inode) { 1139 struct super_operations *op = inode->i_sb->s_op; 1140 1141 if (inode->i_state == I_CLEAR) => 1142 BUG(); ----------------------------------------------------------- invalidate_list() walks the inode list and will do an iput() without skipping inodes with an i_state of I_CLEAR: ----------------------------------------------------------- 301 static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose, int pages) 302 { 303 struct list_head *next; 304 int busy = 0, count = 0; 305 struct inode * inode; 306 struct inode * toput_inode = NULL; 307 308 next = head->next; 309 for (;;) { 310 struct list_head * tmp = next; 311 312 next = next->next; 313 if (tmp == head) 314 break; 315 inode = list_entry(tmp, struct inode, i_list); 316 if (inode->i_sb != sb) 317 continue; 318 if (pages) { 319 __iget(inode); 320 spin_unlock(&inode_lock); 321 invalidate_all_mapping_pages(inode->i_mapping); => 322 iput(toput_inode); ----------------------------------------------------------- RHEL 5.3 and TOT skip inodes with an i_state of I_CLEAR. Regards, Harshula This event sent from IssueTracker by hjayasur issue 282073
Event posted on 2009-06-10 18:31 EST by hjayasur Correction, RHEL 5.3 contains this bug (bz 500164), but it is fixed in kernel-2.6.18-152.el5 (RHEL 5.4). This event sent from IssueTracker by hjayasur issue 282073
Agreed, we need to skip the inodes that the i_state is I_CLEAR. I can not reproduce this problem, can you try out this patch and see if the probme still occurs??? Thanks, Larry Woodman ------------------------------------------------------------------------------ --- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400 +++ linux-2.6.9/fs/inode.c 2009-06-11 14:54:10.000000000 -0400 @@ -313,7 +313,7 @@ static int invalidate_list(struct list_h if (tmp == head) break; inode = list_entry(tmp, struct inode, i_list); - if (inode->i_sb != sb) + if (inode->i_sb != sb || inode->i_state == I_CLEAR) continue; if (pages) { __iget(inode); ------------------------------------------------------------------------------
Actually, we should emulate the upstream code and skip indoes that are I_FREEING or I_CLEAR or I_WILL_FREE. I'll build a kernel with this patch and place it on my people page so you can test it, OK? Larry Woodman ------------------------------------------------------------------------- --- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400 +++ linux-2.6.9/fs/inode.c 2009-06-11 15:01:53.000000000 -0400 @@ -313,7 +313,7 @@ static int invalidate_list(struct list_h if (tmp == head) break; inode = list_entry(tmp, struct inode, i_list); - if (inode->i_sb != sb) + if (inode->i_sb != sb || inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) continue; if (pages) { __iget(inode); -------------------------------------------------------------------------
Can someone try the kernel located here: >>>http://people.redhat.com/~lwoodman/RHEL4/ Larry Woodman
Hi Larry, This new RHEL 4 crash the customer has seen is more related to RHEL 5 bug 500164. I have requested that they open a new bug for this new crash. Can we please use this bug only for the original bug? Regards, Harshula
I made another quick change based on BZ500164 and rebuilt the kernel. Everything needed is located in http://people.redhat.com/~lwoodman/RHEL4/ Larry Woodman
Sorry, I forgot to show you the patch: ----------------------------------------------------------------------- --- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400 +++ linux-2.6.9/fs/inode.c 2009-06-12 10:09:23.000000000 -0400 @@ -316,6 +316,8 @@ static int invalidate_list(struct list_h if (inode->i_sb != sb) continue; if (pages) { + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) + continue; __iget(inode); spin_unlock(&inode_lock); invalidate_all_mapping_pages(inode->i_mapping); ------------------------------------------------------------------------
Ignore my Comment #15 regarding using a separate bug, we are only using this bug.
Hi, Since we can't backport the patch safely, feel free to close/wontfix this bug. Regards, Harshula
Cant safely fix this BUG. I backported the dropcaches functionality into RHEL4 but after several attemts gave up trying to get it totally safe from deadlocks with kjournald. This would require additional fields in the inode struct and therefore break the RHEL4 kABI. Larry Woodman