Bug 495861

Summary: A deadlock can occur between kjournald and drop_pagecache
Product: Red Hat Enterprise Linux 4 Reporter: Harshula Jayasuriya <harshula>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.8CC: dhoward, jolsa, ktokunag, mgahagan, moshiro, sghosh, tao, vmayatsk
Target Milestone: rcKeywords: ZStream
Target Release: 4.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-08 14:31:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 505649    
Attachments:
Description Flags
Patch that fixed this problem none

Description Harshula Jayasuriya 2009-04-15 08:04:55 UTC
Description of problem:

After I operated "echo 1 > /proc/sys/vm/drop_caches" on the system which works
ext3 filesystem, I experienced a deadlock because of the race between
spin_lock(&inode_lock) and spin_lock(&journal->j_list_lock).


The deadlock flow is the following:
--------------------------------------------------------------------------------------------------
               [process A]                    |                       [process B]
|                                             |
|  kjournald()                                | drop_pagecache()
|      |                                      |     |
|      V                                      |     V
|    journal_commit_transaction()             |   drop_pagecache_sb()
|      spin_lock(&journal->j_list_lock) *1    |        |
|          |                                  |        V
|          V                                  |      invalidate_inodes_and_pages()
|      __journal_refile_buffer()              |        spin_lock(&inode_lock) *2
|            |                                |          |
|            V                                |          V
|        __journal_unfile_buffer()            |        invalidate_list()
|              |                              |            |
|              V                              |            V
|          __journal_temp_unlink_buffer()     |        invalidate_all_mapping_pages()
|                |                            |              |
|                V                            |              V
|             mark_buffer_dirty()             |         invalidate_complete_page()
|                  |                          |                |
|                  V                          |                V
|            __set_page_dirty_nobuffers()     |           try_to_release_page()
|                    |                        |                  |
|                    V                        |                  V
|               __mark_inode_dirty()          |             ext3_releasepage()
|                   spin_lock(&inode_lock) *2 |                    |
|                                             |                    V
|                                             |               journal_try_to_free_buffers
|                                             |                      |
|                                             |                      V
|                                             |                 __journal_try_to_free_buffer
|                                             |                     spin_lock(&journal->j_list_lock) *1
V                                             |
(time)
--------------------------------------------------------------------------------------------------
The filesystem which Process A does kjournald() with is the same as the one
which Process B does drop_pagecache_sb() with. Therefore, the deadlock
happens as follows:

[Process A]
  Get j_list_lock (*1) -> Wait for inode_lock (*2)

[Process B]
  Get inode_lock (*2)  -> Wait for j_list_lock (*1)

the backtrace of the process which caused this deadlock is the following:
--------------------------------------------------------------------------------------------------
PID: 2453   TASK: f7346930  CPU: 0   COMMAND: "kjournald"
#0 [f71bcd70] smp_call_function_interrupt at c0116cad
#1 [f71bcd78] call_function_interrupt at c02de445
   EAX: c03390f0  EBX: c03390f0  ECX: cf17a900  EDX: 00000004  EBP: f720c974
   DS:  007b      ESI: f720ca24  ES:  007b      EDI: c10ea1a0
   CS:  0060      EIP: c02dc54d  ERR: fffffffb  EFLAGS: 00000282
#2 [f71bcdac] _spin_lock at c02dc54d
#3 [f71bcdb4] __mark_inode_dirty at c017a9d5
#4 [f71bcdd4] __set_page_dirty_nobuffers at c0146fab
#5 [f71bcde4] __journal_unfile_buffer at f889c86c
#6 [f71bcdec] journal_commit_transaction at f889e038
#7 [f71bcf68] kjournald at f889ff6c
#8 [f71bcff0] kernel_thread_helper at c01041f3

PID: 5167   TASK: f73479b0  CPU: 2   COMMAND: "fuka.sh"
#0 [cf015e10] smp_call_function_interrupt at c0116cad
#1 [cf015e18] call_function_interrupt at c02de445
   EAX: f7e424e4  EBX: f7e424e4  ECX: f31031b4  EDX: d363c424  EBP: d363c424
   DS:  007b      ESI: d363c424  ES:  007b      EDI: f7e42400
   CS:  0060      EIP: c02dc54a  ERR: fffffffb  EFLAGS: 00000286
#2 [cf015e4c] _spin_lock at c02dc54a
#3 [cf015e54] __journal_try_to_free_buffer at f889c8f2
#4 [cf015e64] journal_try_to_free_buffers at f889c9da
#5 [cf015e80] try_to_release_page at c015f92a
#6 [cf015e8c] invalidate_complete_page at c014af96
#7 [cf015e98] invalidate_all_mapping_pages at c014b272
#8 [cf015eec] invalidate_list at c01742ff
#9 [cf015f10] invalidate_inodes_and_pages at c01743ef
#10 [cf015f38] drop_pagecache at c01617b7
#11 [cf015f44] drop_caches_sysctl_handler at c016182e
#12 [cf015f4c] do_rw_proc at c0128b1c
#13 [cf015f7c] proc_writesys at c0128b86
#14 [cf015f88] vfs_write at c015d59c
#15 [cf015fa4] sys_write at c015d664
#16 [cf015fc0] system_call at c02dda08
   EAX: 00000004  EBX: 00000001  ECX: b7cce000  EDX: 00000002
   DS:  007b      ESI: 00000002  ES:  007b      EDI: b7cce000
   SS:  007b      ESP: bfec2970  EBP: bfec2990
   CS:  0073      EIP: 00a627a2  ERR: 00000004  EFLAGS: 00000246
--------------------------------------------------------------------------------------------------


Version-Release number of selected component (if applicable):
- Red Hat Enterprise Linux Version Number: 4
- Release Number: 8 beta
- Architecture: x86
- Kernel Version: 2.6.9-82.ELsmp
- Related Package Version: None
- Related Middleware / Application: None

How reproducible:
Sometimes

Steps to Reproduce:
1. Put the loads of memory and I/O on the system which works ext3 filesystem.
2. Execute "echo 1 > /proc/sys/vm/drop_caches".  

Actual results:
deadlock

Expected results:
Not deadlock

Additional info:
* The above details were provided by Fujitsu.
* This bug was seen in RHEL5.1 and fixed in RHEL5.3 (bug 444961)
* The upstream patch can not be applied directly because the relevant functions are in a different file in RHEL4.8. I will attach a backported patch.
* The upstream patch is:
------------------------------------------------------------------------
commit eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb
Author: Jan Kara <jack>
Date:   Tue Apr 29 00:59:37 2008 -0700

    vfs: fix lock inversion in drop_pagecache_sb()
    
    Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
    before calling __invalidate_mapping_pages().  We just have to make sure inode
    won't go away from under us by keeping reference to it and putting the
    reference only after we have safely resumed the scan of the inode list.  A bit
    tricky but not too bad...
    
    Signed-off-by: Jan Kara <jack>
    Cc: Fengguang Wu <wfg.edu.cn>
    Cc: David Chinner <dgc>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index e2c6b65..50f9087 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -14,15 +14,21 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE))
 			continue;
+		__iget(inode);
+		spin_unlock(&inode_lock);
 		__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
+		iput(toput_inode);
+		toput_inode = inode;
+		spin_lock(&inode_lock);
 	}
 	spin_unlock(&inode_lock);
+	iput(toput_inode);
 }
 
 static void drop_pagecache(void)
------------------------------------------------------------------------

Comment 2 Larry Woodman 2009-04-16 18:58:24 UTC
I can reproduce the hang in RHEL4-U8 without this patch by placing artificial stalls in invalidate_all_mapping_pages(). This happens when kjournald gets to acquire the journal->j_list_lock and then spins on the global inode_lock and once invalidate_all_mapping_pages continues it later spins on the 
journal->j_list_lock, deadlocking the system.  With this patch applied I can no longer get that deadlock becauce we drop the inode_lock before calling invalidate_all_mapping_pages().

--- linux-2.6.9/fs/inode.c.orig 2009-04-16 11:43:56.000000000 -0400
+++ linux-2.6.9/fs/inode.c      2009-04-16 12:00:25.000000000 -0400
@@ -314,9 +314,13 @@ static int invalidate_list(struct list_h
                inode = list_entry(tmp, struct inode, i_list);
                if (inode->i_sb != sb)
                        continue;
-               if (pages)
+               if (pages) {
+                       __iget(inode);
+                       spin_unlock(&inode_lock);
                        invalidate_all_mapping_pages(inode->i_mapping);
-               else {
+                       iput(inode);
+                       spin_lock(&inode_lock);
+               } else {
                        invalidate_inode_buffers(inode);
                        if (!atomic_read(&inode->i_count)) {
                                hlist_del_init(&inode->i_hash);

Comment 3 Harshula Jayasuriya 2009-04-21 14:33:02 UTC
Hi,

We need to hold a reference to the inode till we can safely get the next inode. So the patch should look something like:
-------------------------------------------------------------------
--- fs/inode.c.orig	2009-04-14 15:18:22.000000000 +1000
+++ fs/inode.c	2009-04-21 22:04:28.000000000 +1000
@@ -301,12 +301,13 @@
 static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose, int pages)
 {
 	struct list_head *next;
+	struct inode * inode;
+	struct inode * toput_inode = NULL;
 	int busy = 0, count = 0;
 
 	next = head->next;
 	for (;;) {
 		struct list_head * tmp = next;
-		struct inode * inode;
 
 		next = next->next;
 		if (tmp == head)
@@ -314,9 +315,14 @@
 		inode = list_entry(tmp, struct inode, i_list);
 		if (inode->i_sb != sb)
 			continue;
-		if (pages)
+		if (pages) {
+			__iget(inode);
+			spin_unlock(&inode_lock);
 			invalidate_all_mapping_pages(inode->i_mapping);
-		else {
+			iput(toput_inode);
+			toput_inode = inode;
+			spin_lock(&inode_lock);
+		} else {
 			invalidate_inode_buffers(inode);
 			if (!atomic_read(&inode->i_count)) {
 				hlist_del_init(&inode->i_hash);
@@ -328,6 +334,7 @@
 		}
 		busy = 1;
 	}
+	iput(toput_inode);
 	/* only unused inodes may be cached with i_count zero */
 	inodes_stat.nr_unused -= count;
 	return busy;
-------------------------------------------------------------------

NOTE: this patch has not been tested nor compiled.

Comment 4 Harshula Jayasuriya 2009-04-22 02:06:14 UTC
In my patch above, the last iput(toput_inode) is called whilst holding inode_lock. However, iput() needs to be called while inode_lock is *not* held.

Unfortunately the caller of invalidate_list() first grabs the inode_lock and then drops it after invalidate_list() calls return. One option is to push the initial inode_lock grab and the final drop down into invalidate_list(). Then the iput(toput_inode) can be called immediately after inode_lock is dropped:

invalidate_list():
...
  spin_lock(&inode_lock);
  loop-start
...
  loop-finish
...
  spin_unlock(&inode_lock);
  iput(toput_inode);
...

Comment 7 Larry Woodman 2009-05-28 16:53:41 UTC
Created attachment 345792 [details]
Patch that fixed this problem


The attached patch fixed this problem:

Comment 8 Vivek Goyal 2009-06-03 13:57:05 UTC
Committed in 89.2.EL

Comment 10 Issue Tracker 2009-06-10 08:24:35 UTC
Event posted on 2009-06-10 18:24 EST by hjayasur

Hi,

I haven't looked at the core dump, but the log output is quite clear.

The panic is due to the inode's i_state being I_CLEAR:
-----------------------------------------------------------
1136 void iput(struct inode *inode)
1137 {  
1138         if (inode) {
1139                 struct super_operations *op = inode->i_sb->s_op;
1140            
1141                 if (inode->i_state == I_CLEAR)
=> 1142                         BUG();
-----------------------------------------------------------

invalidate_list() walks the inode list and will do an iput() without
skipping inodes with an i_state of I_CLEAR:
-----------------------------------------------------------
 301 static int invalidate_list(struct list_head *head, struct super_block
* sb,      struct list_head * dispose, int pages)
 302 {
 303         struct list_head *next;
 304         int busy = 0, count = 0;
 305         struct inode * inode;
 306         struct inode * toput_inode = NULL;
 307 
 308         next = head->next;
 309         for (;;) {
 310                 struct list_head * tmp = next;
 311 
 312                 next = next->next;
 313                 if (tmp == head)
 314                         break;
 315                 inode = list_entry(tmp, struct inode, i_list);
 316                 if (inode->i_sb != sb)
 317                         continue;
 318                 if (pages) {
 319                         __iget(inode);
 320                         spin_unlock(&inode_lock);
 321                        
invalidate_all_mapping_pages(inode->i_mapping);
=> 322                         iput(toput_inode);
-----------------------------------------------------------

RHEL 5.3 and TOT skip inodes with an i_state of I_CLEAR.

Regards,
Harshula


This event sent from IssueTracker by hjayasur 
 issue 282073

Comment 11 Issue Tracker 2009-06-10 08:31:54 UTC
Event posted on 2009-06-10 18:31 EST by hjayasur

Correction, RHEL 5.3 contains this bug (bz 500164), but it is fixed in
kernel-2.6.18-152.el5 (RHEL 5.4).


This event sent from IssueTracker by hjayasur 
 issue 282073

Comment 12 Larry Woodman 2009-06-11 18:56:19 UTC

Agreed, we need to skip the inodes that the i_state is I_CLEAR.  I can not reproduce this problem, can you try out this patch and see if the probme still occurs???

Thanks, Larry Woodman

------------------------------------------------------------------------------
--- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400
+++ linux-2.6.9/fs/inode.c      2009-06-11 14:54:10.000000000 -0400
@@ -313,7 +313,7 @@ static int invalidate_list(struct list_h
                if (tmp == head)
                        break;
                inode = list_entry(tmp, struct inode, i_list);
-               if (inode->i_sb != sb)
+               if (inode->i_sb != sb || inode->i_state == I_CLEAR)
                        continue;
                if (pages) {
                        __iget(inode);
------------------------------------------------------------------------------

Comment 13 Larry Woodman 2009-06-11 19:03:44 UTC
Actually, we should emulate the upstream code and skip indoes that are I_FREEING or I_CLEAR or I_WILL_FREE.  I'll build a kernel with this patch and place it on my  people page so you can test it, OK?

Larry Woodman

-------------------------------------------------------------------------
--- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400
+++ linux-2.6.9/fs/inode.c      2009-06-11 15:01:53.000000000 -0400
@@ -313,7 +313,7 @@ static int invalidate_list(struct list_h
                if (tmp == head)
                        break;
                inode = list_entry(tmp, struct inode, i_list);
-               if (inode->i_sb != sb)
+               if (inode->i_sb != sb || inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
                        continue;
                if (pages) {
                        __iget(inode);
-------------------------------------------------------------------------

Comment 14 Larry Woodman 2009-06-11 20:42:17 UTC
Can someone try the kernel located here:

>>>http://people.redhat.com/~lwoodman/RHEL4/


Larry Woodman

Comment 15 Harshula Jayasuriya 2009-06-12 05:20:00 UTC
Hi Larry,

This new RHEL 4 crash the customer has seen is more related to RHEL 5 bug 500164. I have requested that they open a new bug for this new crash.

Can we please use this bug only for the original bug?

Regards,
Harshula

Comment 16 Larry Woodman 2009-06-12 16:01:38 UTC
I made another quick change based on BZ500164 and rebuilt the kernel.  Everything needed is located in http://people.redhat.com/~lwoodman/RHEL4/

Larry Woodman

Comment 17 Larry Woodman 2009-06-12 16:03:36 UTC

Sorry, I forgot to show you the patch:

-----------------------------------------------------------------------
--- linux-2.6.9/fs/inode.c.orig 2009-06-11 14:54:05.000000000 -0400
+++ linux-2.6.9/fs/inode.c      2009-06-12 10:09:23.000000000 -0400
@@ -316,6 +316,8 @@ static int invalidate_list(struct list_h
                if (inode->i_sb != sb)
                        continue;
                if (pages) {
+                       if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+                               continue;
                        __iget(inode);
                        spin_unlock(&inode_lock);
                        invalidate_all_mapping_pages(inode->i_mapping);
------------------------------------------------------------------------

Comment 19 Harshula Jayasuriya 2009-06-15 15:06:15 UTC
Ignore my Comment #15 regarding using a separate bug, we are only using this bug.

Comment 30 Harshula Jayasuriya 2010-02-08 06:23:56 UTC
Hi,

Since we can't backport the patch safely, feel free to close/wontfix this bug.

Regards,
Harshula

Comment 31 Larry Woodman 2010-02-08 14:31:34 UTC
Cant safely fix this BUG.  I backported the dropcaches functionality into RHEL4 but after several attemts gave up trying to get it totally safe from deadlocks with kjournald.  This would require additional fields in the inode struct and therefore break the RHEL4 kABI.

Larry Woodman