Bug 699549 - Deadlock due to race condition between rotate_reclaimable_page() and __page_cache_release()
Summary: Deadlock due to race condition between rotate_reclaimable_page() and __page_c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jerome Marchand
QA Contact: Zhouping Liu
URL:
Whiteboard:
: 837883 (view as bug list)
Depends On:
Blocks: 836232
TreeView+ depends on / blocked
 
Reported: 2011-04-25 22:28 UTC by Rafael Aquini
Modified: 2018-11-29 19:20 UTC (History)
7 users (show)

Fixed In Version: kernel-2.6.18-286.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-21 03:46:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backport patch (2.65 KB, patch)
2011-04-25 22:40 UTC, Rafael Aquini
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 151193 0 None None None Never
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Rafael Aquini 2011-04-25 22:28:14 UTC
The symptoms from the taken vmcore suggest that a race has occurred between the
threads on CPU 1 and on CPU 11, which caused an AB <-> BA deadlock

Threads, are:
A)
PID: 1243   TASK: ffff81181fac1100  CPU: 1   COMMAND: "kswapd0"
 #0 [ffff810c1fc4af20] crash_nmi_callback at ffffffff8007a3bf
 #1 [ffff810c1fc4af40] do_nmi at ffffffff8006585a
 #2 [ffff810c1fc4af50] nmi at ffffffff80064ebf
    [exception RIP: __write_lock_failed+15]
    RIP: ffffffff80062157  RSP: ffff81181ffc7910  RFLAGS: 00000087
    RAX: 0000000000000082  RBX: ffff81010de45b70  RCX: ffff811809b57340
    RDX: ffff810088bd2000  RSI: ffff81010d5bb8d0  RDI: ffff8111a5f78698
    RBP: ffff8111a5f78680   R8: ffff811809b57340   R9: ffffffffffffffff
    R10: ffff8117f8848420  R11: ffffffff80044f50  R12: 0000000000000002
    R13: 0000000000000000  R14: 0000000000008000  R15: 0000000000008000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff81181ffc7910] __write_lock_failed at ffffffff80062157
 #4 [ffff81181ffc7910] _write_lock_irqsave at ffffffff80064b73
 #5 [ffff81181ffc7918] test_clear_page_writeback at ffffffff8001ac99
 #6 [ffff81181ffc7948] rotate_reclaimable_page at ffffffff80010955
 #7 [ffff81181ffc7968] end_page_writeback at ffffffff8003a0d8
 #8 [ffff81181ffc7978] vx_async_pvn_pldone at ffffffff884f8525
 #9 [ffff81181ffc79a8] vx_bc_biodone at ffffffff883d90e8
....


B)
PID: 18928  TASK: ffff810c00b56820  CPU: 11  COMMAND: "db2sysc"
.....
 #6 [ffff8100bd25ba20] .text.lock.spinlock at ffffffff80064bd8 (via _spin_lock_irqsave)
 #7 [ffff8100bd25ba20] __page_cache_release at ffffffff800c8658
 #8 [ffff8100bd25ba40] find_lock_page at ffffffff80013a1c
 #9 [ffff8100bd25ba70] find_or_create_page at ffffffff800259e7
#10 [ffff8100bd25bab0] vx_segmap_getmap at ffffffff884f7fb7
#11 [ffff8100bd25bb70] vx_write1_fast at ffffffff885158d2
#12 [ffff8100bd25bc40] vx_write_common_fast at ffffffff88516891
#13 [ffff8100bd25bc70] vx_write_common at ffffffff88516c2a
#14 [ffff8100bd25bd60] vx_vop_write at ffffffff884cf935
#15 [ffff8100bd25bd80] vx_write at ffffffff884d274e
#16 [ffff8100bd25be40] vx_naio_write at ffffffff884bb3e9
#17 [ffff8100bd25be80] aio_pwrite at ffffffff800ed1c8
#18 [ffff8100bd25bea0] aio_run_iocb at ffffffff800edaac
#19 [ffff8100bd25bec0] io_submit_one at ffffffff800ee581
#20 [ffff8100bd25bef0] sys_io_submit at ffffffff800eeac4


Scenario:

 thread on CPU 1 acquires lock A -   - thread on CPU 11 acquires lock B
                                  \ /
                                   \
                                  / \
 thread on CPU 1    wants lock B -   - thread on CPU 11    wants lock A


 The thread on CPU 1 cannot release lock A because it is spinning on
 lock B which is being held by the thread on CPU 11.

 The thread on CPU 11 cannot release lock B because it is spinning on
 lock A which is being held by the thread on CPU 1.


Here, lock A is the lru lock of the 'normal' zone (spinlock_t) and
      lock B is the 'tree lock' (rwlock_t) of a struct address_space.


'kswapd0' on CPU 1                  | 'db2sysc' on CPU 11
------------------------------------+------------------------------------
                                    |find_or_create_page()
rotate_reclaimable_page()           |
                                    | find_lock_page
 // acquire lru lock                |
 // of 'normal' zone                |  // acquire tree lock (shared)
 zone = page_zone(page)             |  read_lock_irq
 spin_lock_irqsave(&zone->lru_lock) |   (&mapping->tree_lock)
                                    |
  ...                               |   ...
                                    |
 test_clear_page_writeback          |  // page_cache_release/put_page
                                    |  __page_cache_release
  // try to acquire tree lock       |
  // (exclusive)                    |   // try to acquire lru lock
  write_lock_irqsave                |   // of 'normal' zone
   (&mapping->tree_lock)            |   zone = page_zone(page)
                                    |   spin_lock_irqsave(&zone->lru_lock)

==============

Comment 2 Rafael Aquini 2011-04-25 22:40:34 UTC
Created attachment 494773 [details]
backport patch

Comment 5 RHEL Program Management 2011-08-16 13:10:11 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Jerome Marchand 2011-08-29 12:55:23 UTC
-int rotate_reclaimable_page(struct page *page)
+void rotate_reclaimable_page(struct page *page)
 {
 	struct zone *zone;
 	unsigned long flags;
 
-	if (PageLocked(page))
-		return 1;
-	if (PageDirty(page))
-		return 1;
-	if (PageActive(page))
-		return 1;
-	if (!PageLRU(page))
-		return 1;
-
-	zone = page_zone(page);
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (PageLRU(page) && !PageActive(page)) {
         ^^^^
Here, the old code test again PageLRU and PageActive after the lock has been taken
The new code doesn't. Why is that. It looks racy.

+	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+	    PageLRU(page)) {
+		zone = page_zone(page);
+		spin_lock_irqsave(&zone->lru_lock, flags);
 		list_move_tail(&page->lru, &zone->inactive_list);
 		__count_vm_event(PGROTATED);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
-	if (!test_clear_page_writeback(page))
-		BUG();
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
-	return 0;
 }
 
 /*

Comment 7 Rafael Aquini 2011-08-29 14:05:11 UTC
(In reply to comment #6)
Howdy Jerome,

While simplifying the function code, the new introduced if-clause respects the same logic shown by the prior (eliminated) if-clause-return-1 series, and zone->lru_lock is only grabbed if the following condition is satisfied:
a. page is not locked;
b. page is not dirty;
c. page is not active; and
d. page is LRU

The source of the race described on this BZ is the call to test_clear_page_writeback(), while still holding the zone->lru_lock, from rotate_reclaimable_page().

This call to test_clear_page_writeback() was also eliminated from rotate_reclaimable_page() is only called from end_page_writeback, which also performs a call to test_clear_page_writeback().

Cheers!
--aquini

Comment 8 Rafael Aquini 2011-08-29 14:14:51 UTC
(In reply to comment #7)
The last paragraph, from my last update, should be fixed to:

This call to test_clear_page_writeback() was also eliminated from rotate_reclaimable_page(), because this later is only called from end_page_writeback() which also performs a call to test_clear_page_writeback().

Comment 9 Jerome Marchand 2011-08-29 14:32:01 UTC
(In reply to comment #7)
> (In reply to comment #6)
> Howdy Jerome,
> 
> While simplifying the function code, the new introduced if-clause respects the
> same logic shown by the prior (eliminated) if-clause-return-1 series, and
> zone->lru_lock is only grabbed if the following condition is satisfied:
> a. page is not locked;
> b. page is not dirty;
> c. page is not active; and
> d. page is LRU

Yes. But then, after the lock has been taken, the old code retested points c. and d., which makes me think
that the state of the page may have change in the meantime (i.e. page now active or not LRU anymore).

The original patch applies on top of 902aaed0d983dfd459fcb2b678608d4584782200 which changed the locking mechanism in rotate_reclaimable_page().

> 
> The source of the race described on this BZ is the call to
> test_clear_page_writeback(), while still holding the zone->lru_lock, from
> rotate_reclaimable_page().
> 
> This call to test_clear_page_writeback() was also eliminated from
> rotate_reclaimable_page() is only called from end_page_writeback, which also
> performs a call to test_clear_page_writeback().
> 
> Cheers!
> --aquini

Comment 10 Rafael Aquini 2011-08-29 17:22:09 UTC
(In reply to comment #9)
Howdy Jerome,

I see your point. Thanks for providing me such enlightenment. 

So, perhaps a better patch here, would just be:

diff --git a/mm/swap.c b/mm/swap.c
index 6898238..f17e5b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -118,8 +118,6 @@ int rotate_reclaimable_page(struct page *page)
                list_move_tail(&page->lru, &zone->inactive_list);
                __count_vm_event(PGROTATED);
        }
-       if (!test_clear_page_writeback(page))
-               BUG();
        spin_unlock_irqrestore(&zone->lru_lock, flags);
        return 0;
 }


I think this way we will not make a lot of mess on RHEL5's rotate_reclaimable_page() -- and avoid another probable race. However, we will avoid the nesting condition which leads to the deadlock described on this BZ.

What do you think?

Cheers!
--aquini

Comment 11 Rik van Riel 2011-08-29 18:03:18 UTC
We do want to clear PG_writeback on writeback pages. Simply removing the call does not seem like a great idea.

I see Jerome's point, but that should have a much easier solution.

+ if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+     PageLRU(page)) {
+  zone = page_zone(page);
+  spin_lock_irqsave(&zone->lru_lock, flags);
   list_move_tail(&page->lru, &zone->inactive_list);
   __count_vm_event(PGROTATED);
+  spin_unlock_irqrestore(&zone->lru_lock, flags);
  }

We can simply move the spin_lock & spin_unlock to outside the if condition and test.  In other words:

zone = page_zone(page);
spin_lock_irqsave(&zone->lru_lock, flags);
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && PageLRU(page)) {
        list_move_tail(&page->lru, &zone->inactive_list);
        __count_vm_event(PGROTATED);
}
spin_unlock_irqrestore(&zone->lru_lock, flags);

Jerome, does that look better?

Comment 12 Rafael Aquini 2011-08-29 19:37:45 UTC
(In reply to comment #11)
Rik,

After reading your last update, I realize my last change proposal would have failed to clean PG_writeback. Thanks for pointing that out.


As we have discussed a little bit over IRC, I'm a little concerned that just moving the spinlock outside the if-clause might cause unnecessary zone->lru_lock contention, so I'd like to propose we keep the original backport patch, but reinstate the if-clause retesting for !PageActive(page) && PageLRU(page) after we grab the lock:


+       if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+           PageLRU(page)) {
+               zone = page_zone(page);
+               spin_lock_irqsave(&zone->lru_lock, flags);
+               /* retest page flags, to avoid races while trying to grab
+                * zone->lru_lock
+                */
+               if (PageLRU(page) && !PageActive(page)) {
+                       list_move_tail(&page->lru, &zone->inactive_list);
+                       __count_vm_event(PGROTATED);
+               }
+               spin_unlock_irqrestore(&zone->lru_lock, flags);


This way, besides being (sort of) replicating what pagevec_move_tail() does to avoid the race Jerome has pointed out, we will be performing the function cleanup and avoiding the lock nesting thing, I think.


What do you guys think?

Cheers!
--aquini

Comment 13 Zhouping Liu 2011-08-30 01:46:36 UTC
hi, guys

Do you know how to reproduce the deadlock?

Thanks,
Zhouping Liu

Comment 14 Jerome Marchand 2011-08-30 09:10:29 UTC
(In reply to comment #12)
> (In reply to comment #11)
> Rik,
> 
> After reading your last update, I realize my last change proposal would have
> failed to clean PG_writeback. Thanks for pointing that out.
> 
> 
> As we have discussed a little bit over IRC, I'm a little concerned that just
> moving the spinlock outside the if-clause might cause unnecessary
> zone->lru_lock contention,

Me too. I assume that if it was done that way upstream, it is for a reason.

> so I'd like to propose we keep the original backport
> patch, but reinstate the if-clause retesting for !PageActive(page) &&
> PageLRU(page) after we grab the lock:
> 
> 
> +       if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> +           PageLRU(page)) {
> +               zone = page_zone(page);
> +               spin_lock_irqsave(&zone->lru_lock, flags);
> +               /* retest page flags, to avoid races while trying to grab
> +                * zone->lru_lock
> +                */
> +               if (PageLRU(page) && !PageActive(page)) {
> +                       list_move_tail(&page->lru, &zone->inactive_list);
> +                       __count_vm_event(PGROTATED);
> +               }
> +               spin_unlock_irqrestore(&zone->lru_lock, flags);
> 
> 
> This way, besides being (sort of) replicating what pagevec_move_tail() does to
> avoid the race Jerome has pointed out, we will be performing the function
> cleanup and avoiding the lock nesting thing, I think.
> 

Thats what I had in mind.

> 
> What do you guys think?
> 
> Cheers!
> --aquini

Comment 15 Rafael Aquini 2011-09-06 21:15:11 UTC
(In reply to comment #13)
> hi, guys
> 
> Do you know how to reproduce the deadlock?

Howdy Zhouping,

Sorry for this late reply. I got myself swamped on some hot tickets, and I ended up not paying enough attention to this question of yours.

This seems to be one of those "rare" races, and quite difficult to reproduce. The customer who has reported it could not reproduce it at will, and only got this 'sole' hit. I'm still trying a way to have it reproduced on lab, unfortunately, unsuccessfully by now (if I manage to reach a reproducer, I'll post here).

Originally, it took place on a VXFS cluster node, under heavy memory and I/O (write) pressure.

> 
> Thanks,
> Zhouping Liu

Cheers!
--aquini

Comment 19 Jarod Wilson 2011-10-03 15:16:18 UTC
Patch(es) available in kernel-2.6.18-286.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 21 Zhouping Liu 2011-10-19 07:11:21 UTC
No reproducer, I did heavy memory through v7 test suite on kernel-2.6.18-290.el5.
installed v7 test suite, and run memory stress test:
# yum install v7
# v7 --test memory run
also I did IO stress test:
# dd if=/dev/zero of=/tmp/tmp.img
till it used up all free disk space
and I couldn't find any error

depend on these, I'll remove the status to VERIFIED, does anyone have any comments?

Comment 22 Zhouping Liu 2011-12-05 08:18:06 UTC
depend on the comment 21, move the status to VERIFIED.

Comment 23 errata-xmlrpc 2012-02-21 03:46:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html

Comment 24 Jacob Tanenbaum 2013-05-21 20:38:50 UTC
*** Bug 837883 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.