The symptoms from the taken vmcore suggest that a race has occurred between the threads on CPU 1 and on CPU 11, which caused an AB <-> BA deadlock Threads, are: A) PID: 1243 TASK: ffff81181fac1100 CPU: 1 COMMAND: "kswapd0" #0 [ffff810c1fc4af20] crash_nmi_callback at ffffffff8007a3bf #1 [ffff810c1fc4af40] do_nmi at ffffffff8006585a #2 [ffff810c1fc4af50] nmi at ffffffff80064ebf [exception RIP: __write_lock_failed+15] RIP: ffffffff80062157 RSP: ffff81181ffc7910 RFLAGS: 00000087 RAX: 0000000000000082 RBX: ffff81010de45b70 RCX: ffff811809b57340 RDX: ffff810088bd2000 RSI: ffff81010d5bb8d0 RDI: ffff8111a5f78698 RBP: ffff8111a5f78680 R8: ffff811809b57340 R9: ffffffffffffffff R10: ffff8117f8848420 R11: ffffffff80044f50 R12: 0000000000000002 R13: 0000000000000000 R14: 0000000000008000 R15: 0000000000008000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff81181ffc7910] __write_lock_failed at ffffffff80062157 #4 [ffff81181ffc7910] _write_lock_irqsave at ffffffff80064b73 #5 [ffff81181ffc7918] test_clear_page_writeback at ffffffff8001ac99 #6 [ffff81181ffc7948] rotate_reclaimable_page at ffffffff80010955 #7 [ffff81181ffc7968] end_page_writeback at ffffffff8003a0d8 #8 [ffff81181ffc7978] vx_async_pvn_pldone at ffffffff884f8525 #9 [ffff81181ffc79a8] vx_bc_biodone at ffffffff883d90e8 .... B) PID: 18928 TASK: ffff810c00b56820 CPU: 11 COMMAND: "db2sysc" ..... #6 [ffff8100bd25ba20] .text.lock.spinlock at ffffffff80064bd8 (via _spin_lock_irqsave) #7 [ffff8100bd25ba20] __page_cache_release at ffffffff800c8658 #8 [ffff8100bd25ba40] find_lock_page at ffffffff80013a1c #9 [ffff8100bd25ba70] find_or_create_page at ffffffff800259e7 #10 [ffff8100bd25bab0] vx_segmap_getmap at ffffffff884f7fb7 #11 [ffff8100bd25bb70] vx_write1_fast at ffffffff885158d2 #12 [ffff8100bd25bc40] vx_write_common_fast at ffffffff88516891 #13 [ffff8100bd25bc70] vx_write_common at ffffffff88516c2a #14 [ffff8100bd25bd60] vx_vop_write at ffffffff884cf935 #15 [ffff8100bd25bd80] vx_write at ffffffff884d274e #16 [ffff8100bd25be40] vx_naio_write at ffffffff884bb3e9 #17 [ffff8100bd25be80] aio_pwrite at ffffffff800ed1c8 #18 [ffff8100bd25bea0] aio_run_iocb at ffffffff800edaac #19 [ffff8100bd25bec0] io_submit_one at ffffffff800ee581 #20 [ffff8100bd25bef0] sys_io_submit at ffffffff800eeac4 Scenario: thread on CPU 1 acquires lock A - - thread on CPU 11 acquires lock B \ / \ / \ thread on CPU 1 wants lock B - - thread on CPU 11 wants lock A The thread on CPU 1 cannot release lock A because it is spinning on lock B which is being held by the thread on CPU 11. The thread on CPU 11 cannot release lock B because it is spinning on lock A which is being held by the thread on CPU 1. Here, lock A is the lru lock of the 'normal' zone (spinlock_t) and lock B is the 'tree lock' (rwlock_t) of a struct address_space. 'kswapd0' on CPU 1 | 'db2sysc' on CPU 11 ------------------------------------+------------------------------------ |find_or_create_page() rotate_reclaimable_page() | | find_lock_page // acquire lru lock | // of 'normal' zone | // acquire tree lock (shared) zone = page_zone(page) | read_lock_irq spin_lock_irqsave(&zone->lru_lock) | (&mapping->tree_lock) | ... | ... | test_clear_page_writeback | // page_cache_release/put_page | __page_cache_release // try to acquire tree lock | // (exclusive) | // try to acquire lru lock write_lock_irqsave | // of 'normal' zone (&mapping->tree_lock) | zone = page_zone(page) | spin_lock_irqsave(&zone->lru_lock) ==============
Created attachment 494773 [details] backport patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
-int rotate_reclaimable_page(struct page *page) +void rotate_reclaimable_page(struct page *page) { struct zone *zone; unsigned long flags; - if (PageLocked(page)) - return 1; - if (PageDirty(page)) - return 1; - if (PageActive(page)) - return 1; - if (!PageLRU(page)) - return 1; - - zone = page_zone(page); - spin_lock_irqsave(&zone->lru_lock, flags); - if (PageLRU(page) && !PageActive(page)) { ^^^^ Here, the old code test again PageLRU and PageActive after the lock has been taken The new code doesn't. Why is that. It looks racy. + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && + PageLRU(page)) { + zone = page_zone(page); + spin_lock_irqsave(&zone->lru_lock, flags); list_move_tail(&page->lru, &zone->inactive_list); __count_vm_event(PGROTATED); + spin_unlock_irqrestore(&zone->lru_lock, flags); } - if (!test_clear_page_writeback(page)) - BUG(); - spin_unlock_irqrestore(&zone->lru_lock, flags); - return 0; } /*
(In reply to comment #6) Howdy Jerome, While simplifying the function code, the new introduced if-clause respects the same logic shown by the prior (eliminated) if-clause-return-1 series, and zone->lru_lock is only grabbed if the following condition is satisfied: a. page is not locked; b. page is not dirty; c. page is not active; and d. page is LRU The source of the race described on this BZ is the call to test_clear_page_writeback(), while still holding the zone->lru_lock, from rotate_reclaimable_page(). This call to test_clear_page_writeback() was also eliminated from rotate_reclaimable_page() is only called from end_page_writeback, which also performs a call to test_clear_page_writeback(). Cheers! --aquini
(In reply to comment #7) The last paragraph, from my last update, should be fixed to: This call to test_clear_page_writeback() was also eliminated from rotate_reclaimable_page(), because this later is only called from end_page_writeback() which also performs a call to test_clear_page_writeback().
(In reply to comment #7) > (In reply to comment #6) > Howdy Jerome, > > While simplifying the function code, the new introduced if-clause respects the > same logic shown by the prior (eliminated) if-clause-return-1 series, and > zone->lru_lock is only grabbed if the following condition is satisfied: > a. page is not locked; > b. page is not dirty; > c. page is not active; and > d. page is LRU Yes. But then, after the lock has been taken, the old code retested points c. and d., which makes me think that the state of the page may have change in the meantime (i.e. page now active or not LRU anymore). The original patch applies on top of 902aaed0d983dfd459fcb2b678608d4584782200 which changed the locking mechanism in rotate_reclaimable_page(). > > The source of the race described on this BZ is the call to > test_clear_page_writeback(), while still holding the zone->lru_lock, from > rotate_reclaimable_page(). > > This call to test_clear_page_writeback() was also eliminated from > rotate_reclaimable_page() is only called from end_page_writeback, which also > performs a call to test_clear_page_writeback(). > > Cheers! > --aquini
(In reply to comment #9) Howdy Jerome, I see your point. Thanks for providing me such enlightenment. So, perhaps a better patch here, would just be: diff --git a/mm/swap.c b/mm/swap.c index 6898238..f17e5b4 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -118,8 +118,6 @@ int rotate_reclaimable_page(struct page *page) list_move_tail(&page->lru, &zone->inactive_list); __count_vm_event(PGROTATED); } - if (!test_clear_page_writeback(page)) - BUG(); spin_unlock_irqrestore(&zone->lru_lock, flags); return 0; } I think this way we will not make a lot of mess on RHEL5's rotate_reclaimable_page() -- and avoid another probable race. However, we will avoid the nesting condition which leads to the deadlock described on this BZ. What do you think? Cheers! --aquini
We do want to clear PG_writeback on writeback pages. Simply removing the call does not seem like a great idea. I see Jerome's point, but that should have a much easier solution. + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && + PageLRU(page)) { + zone = page_zone(page); + spin_lock_irqsave(&zone->lru_lock, flags); list_move_tail(&page->lru, &zone->inactive_list); __count_vm_event(PGROTATED); + spin_unlock_irqrestore(&zone->lru_lock, flags); } We can simply move the spin_lock & spin_unlock to outside the if condition and test. In other words: zone = page_zone(page); spin_lock_irqsave(&zone->lru_lock, flags); if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && PageLRU(page)) { list_move_tail(&page->lru, &zone->inactive_list); __count_vm_event(PGROTATED); } spin_unlock_irqrestore(&zone->lru_lock, flags); Jerome, does that look better?
(In reply to comment #11) Rik, After reading your last update, I realize my last change proposal would have failed to clean PG_writeback. Thanks for pointing that out. As we have discussed a little bit over IRC, I'm a little concerned that just moving the spinlock outside the if-clause might cause unnecessary zone->lru_lock contention, so I'd like to propose we keep the original backport patch, but reinstate the if-clause retesting for !PageActive(page) && PageLRU(page) after we grab the lock: + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && + PageLRU(page)) { + zone = page_zone(page); + spin_lock_irqsave(&zone->lru_lock, flags); + /* retest page flags, to avoid races while trying to grab + * zone->lru_lock + */ + if (PageLRU(page) && !PageActive(page)) { + list_move_tail(&page->lru, &zone->inactive_list); + __count_vm_event(PGROTATED); + } + spin_unlock_irqrestore(&zone->lru_lock, flags); This way, besides being (sort of) replicating what pagevec_move_tail() does to avoid the race Jerome has pointed out, we will be performing the function cleanup and avoiding the lock nesting thing, I think. What do you guys think? Cheers! --aquini
hi, guys Do you know how to reproduce the deadlock? Thanks, Zhouping Liu
(In reply to comment #12) > (In reply to comment #11) > Rik, > > After reading your last update, I realize my last change proposal would have > failed to clean PG_writeback. Thanks for pointing that out. > > > As we have discussed a little bit over IRC, I'm a little concerned that just > moving the spinlock outside the if-clause might cause unnecessary > zone->lru_lock contention, Me too. I assume that if it was done that way upstream, it is for a reason. > so I'd like to propose we keep the original backport > patch, but reinstate the if-clause retesting for !PageActive(page) && > PageLRU(page) after we grab the lock: > > > + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && > + PageLRU(page)) { > + zone = page_zone(page); > + spin_lock_irqsave(&zone->lru_lock, flags); > + /* retest page flags, to avoid races while trying to grab > + * zone->lru_lock > + */ > + if (PageLRU(page) && !PageActive(page)) { > + list_move_tail(&page->lru, &zone->inactive_list); > + __count_vm_event(PGROTATED); > + } > + spin_unlock_irqrestore(&zone->lru_lock, flags); > > > This way, besides being (sort of) replicating what pagevec_move_tail() does to > avoid the race Jerome has pointed out, we will be performing the function > cleanup and avoiding the lock nesting thing, I think. > Thats what I had in mind. > > What do you guys think? > > Cheers! > --aquini
(In reply to comment #13) > hi, guys > > Do you know how to reproduce the deadlock? Howdy Zhouping, Sorry for this late reply. I got myself swamped on some hot tickets, and I ended up not paying enough attention to this question of yours. This seems to be one of those "rare" races, and quite difficult to reproduce. The customer who has reported it could not reproduce it at will, and only got this 'sole' hit. I'm still trying a way to have it reproduced on lab, unfortunately, unsuccessfully by now (if I manage to reach a reproducer, I'll post here). Originally, it took place on a VXFS cluster node, under heavy memory and I/O (write) pressure. > > Thanks, > Zhouping Liu Cheers! --aquini
Posted; http://post-office.corp.redhat.com/archives/rhkernel-list/2011-September/msg00208.html
Patch(es) available in kernel-2.6.18-286.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
No reproducer, I did heavy memory through v7 test suite on kernel-2.6.18-290.el5. installed v7 test suite, and run memory stress test: # yum install v7 # v7 --test memory run also I did IO stress test: # dd if=/dev/zero of=/tmp/tmp.img till it used up all free disk space and I couldn't find any error depend on these, I'll remove the status to VERIFIED, does anyone have any comments?
depend on the comment 21, move the status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html
*** Bug 837883 has been marked as a duplicate of this bug. ***