Bug 463504

Summary:

Hang in shrink_zone during swap pressure, due to direct reclaim threads

Product:

Red Hat Enterprise Linux 5

Reporter:

John Sobecki <john.sobecki>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED WONTFIX

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.2

CC:

chris.mason, greg.marsden, john.sobecki, lwoodman, mdavis, riel

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-02 13:23:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
2.6.18 proposed patch (backported from mainline)	none
Run N+1 threads (where N=#CPUs) using ./bigmalloc 2000 &	none
2.6.9 patch	none

Description John Sobecki 2008-09-23 18:54:56 UTC

Description of problem:

During periods of heavy memory load that forces the system to swap, all CPUs will become consumed by shrink_zone/shrink_inactive_list processing, and the machine becomes unresponsive.  

This results in RAC evictions when the machine comes out of the hang (sometimes hours later).  Usually the customer forces a vmcore and if you look at the cpu stacks, you'll see something like this (induced by ocfs2 file truncation, in this example):

 All 4 CPUs are contending in the VM in shrink_inactive_list().
@ .
@ PID: 3746   TASK: ffff81023d0d6080  CPU: 0   COMMAND: "sshd"
@ --- <exception stack> ---
@  #3 [ffff81022dd79a18] shrink_inactive_list at ffffffff800bdb1a
@  #4 [ffff81022dd79b60] ktime_get_ts at ffffffff8009d66f
@  #5 [ffff81022dd79b80] delayacct_end at ffffffff800b5a24
@  #6 [ffff81022dd79c00] shrink_zone at ffffffff80012912
@  #7 [ffff81022dd79c40] try_to_free_pages at ffffffff800be2e5
@  #8 [ffff81022dd79cc0] __alloc_pages at ffffffff8000ef9a
@  #9 [ffff81022dd79d20] find_or_create_page at ffffffff80025636
@ #10 [ffff81022dd79d50] cont_prepare_write at ffffffff800d1f4f
@ #11 [ffff81022dd79db0] ocfs2_prepare_write at ffffffff8843709d
@ #12 [ffff81022dd79df0] ocfs2_zero_extend at ffffffff88443855
@ #13 [ffff81022dd79e20] ocfs2_setattr at ffffffff884469e0
@ #14 [ffff81022dd79e80] notify_change at ffffffff8002c4b5
@ #15 [ffff81022dd79ee0] do_truncate at ffffffff800d0326
@ #16 [ffff81022dd79ef0] audit_syscall_entry at ffffffff800b1ce9
@ #17 [ffff81022dd79f50] sys_ftruncate at ffffffff8004a699
@ #18 [ffff81022dd79f80] tracesys at ffffffff8005b2c1
@ .
@ PID: 3134   TASK: ffff810233ef80c0  CPU: 1   COMMAND: "crond"
@ --- <exception stack> ---
@  #3 [ffff810231ce3a18] unlock_page at ffffffff80017818
@  #4 [ffff810231ce3a20] shrink_inactive_list at ffffffff800bd97d
@  #5 [ffff810231ce3c10] shrink_zone at ffffffff80012912
@  #6 [ffff810231ce3c50] try_to_free_pages at ffffffff800be2e5
@  #7 [ffff810231ce3cd0] __alloc_pages at ffffffff8000ef9a
@  #8 [ffff810231ce3d30] read_swap_cache_async at ffffffff80031f66
@  #9 [ffff810231ce3d70] swapin_readahead at ffffffff800bf661
@ #10 [ffff810231ce3dc0] __handle_mm_fault at ffffffff80008f3a
@ #11 [ffff810231ce3e60] do_page_fault at ffffffff800645a7
@ .
@ PID: 2989   TASK: ffff81023df297a0  CPU: 2   COMMAND: "automount"
@ --- <exception stack> ---
@  #3 [ffff8102328d7968] shrink_inactive_list at ffffffff800bdb1a
@  #4 [ffff8102328d7a10] isolate_lru_pages at ffffffff800bcf21
@  #5 [ffff8102328d7b50] shrink_zone at ffffffff80012912
@  #6 [ffff8102328d7b90] try_to_free_pages at ffffffff800be2e5
@  #7 [ffff8102328d7c10] __alloc_pages at ffffffff8000ef9a
@  #8 [ffff8102328d7c70] __do_page_cache_readahead at ffffffff80012685
@  #9 [ffff8102328d7cb0] getnstimeofday at ffffffff80058bf0
@ #10 [ffff8102328d7cd0] ktime_get_ts at ffffffff8009d66f
@ #11 [ffff8102328d7cf0] delayacct_end at ffffffff800b5a24
@ #12 [ffff8102328d7d60] filemap_nopage at ffffffff80013000
@ #13 [ffff8102328d7dc0] __handle_mm_fault at ffffffff800087e0
@ #14 [ffff8102328d7e60] do_page_fault at ffffffff800645a7
@ #15 [ffff8102328d7f50] error_exit at ffffffff8005be1d
@ .
@ PID: 3825   TASK: ffff81022dcf97e0  CPU: 3   COMMAND: "fio"
@ .
@ --- <exception stack> ---
@  #3 [ffff81022dd79a18] shrink_inactive_list at ffffffff800bdb1a
@  #4 [ffff81022dd79b60] ktime_get_ts at ffffffff8009d66f
@  #5 [ffff81022dd79b80] delayacct_end at ffffffff800b5a24
@  #6 [ffff81022dd79c00] shrink_zone at ffffffff80012912
@  #7 [ffff81022dd79c40] try_to_free_pages at ffffffff800be2e5
@  #8 [ffff81022dd79cc0] __alloc_pages at ffffffff8000ef9a
@  #9 [ffff81022dd79d20] find_or_create_page at ffffffff80025636
@ #10 [ffff81022dd79d50] cont_prepare_write at ffffffff800d1f4f
@ #11 [ffff81022dd79db0] ocfs2_prepare_write at ffffffff8843709d
@ #12 [ffff81022dd79df0] ocfs2_zero_extend at ffffffff88443855
@ #13 [ffff81022dd79e20] ocfs2_setattr at ffffffff884469e0
@ #14 [ffff81022dd79e80] notify_change at ffffffff8002c4b5
@ #15 [ffff81022dd79ee0] do_truncate at ffffffff800d0326

Version-Release number of selected component (if applicable):

RHEL 5.2

How reproducible:

100%

Steps to Reproduce:
1.  Load a machine with N threads of a small C program that uses 2000MB
2.  Make sure N > #CPUs 
3.  Monitor with top for a hang
  
Actual results:

Machine hangs and only recovers because the testcase frees the memory it
allocated after 10 minutes.  Swap usage rarely exceeds 30%. 

Expected results:

Machine should be able to swap to 100% before hanging.  The attached patch,
backported from mainline, allows 100% swap utilization and the machine
stays very responsive under heavy load. 
  
Additional info:

Comment 1 John Sobecki 2008-09-23 18:56:08 UTC

Created attachment 317511 [details]
2.6.18 proposed patch (backported from mainline)

Comment 2 John Sobecki 2008-09-23 20:14:21 UTC

Created attachment 317531 [details]
Run N+1 threads (where N=#CPUs) using ./bigmalloc 2000 &

Comment 3 John Sobecki 2008-10-03 14:10:06 UTC

Created attachment 319367 [details]
2.6.9 patch

Tested hard with both Oracle DB and non-DB stress loads, and is now in production on Oracle Global Email.

Comment 4 Larry Woodman 2008-10-03 15:32:45 UTC

Thanks john, will take care of this.

On quick question the comment says "direct reclaiming for contiguous pages", does this mean order>0 ??? 

+		/*
+		 * If we are direct reclaiming for contiguous pages and we do
+		 * not reclaim everything in the list, try again and wait
+		 * for IO to complete. This will stall high-order allocations
+		 * but that should be acceptable to the caller
+		 */


Larry

Comment 5 John Sobecki 2009-05-08 20:43:30 UTC

Hi Larry,

No check for order, just for nr_freed < nr_taken.  

Seems like similar was backported from mainline in BZ 495442 - check that out and see if a similar mainline patch to what I posted in https://bugzilla.redhat.com/attachment.cgi?id=317511 is being used. 

Thanks,
John

Comment 6 Rik van Riel 2009-07-10 20:45:55 UTC

2.6.18 does not have lumpy reclaim, so the comment in the patch makes little sense.

John, what exactly are you trying to achieve with this patch?

Also, how does the patch achieve what you want to achieve?

Comment 7 Rik van Riel 2009-07-10 20:50:33 UTC

Also, why are you forcefully deactivating pages that were activated by shrink_page_list?  FIFO page replacement was proven to be a bad idea in the 1960's and not a mistake to repeat 40 decades later.

Obviously your patch fixes something and achieves it in some way.  Lets get to the bottom of what it really does, so we can get the bug fixed without the bad side effects.

Comment 8 Rik van Riel 2009-07-10 20:55:14 UTC

Btw, I suspect the bug may already have disappeared in RHEL 5.4, due to never reclaiming more than 32 pages in direct reclaim - that should get the worst excesses of parallel direct reclaim out of the picture alltogether.

Comment 9 John Sobecki 2009-07-15 20:10:44 UTC

5.4 beta 2.6.18-155 is doing:

                if (nr_reclaimed > swap_cluster_max &&
                        priority < DEF_PRIORITY && !current_is_kswapd())
                        break;

The break doesn't relieve the machine from experiencing 'scheduling brownouts' as seen by other software (timing sensitive). 

My patch was a backport from mainline, where they are calling congestion_wait to force the direct reclaim threads to come up for air, and hence let some other processes get a bit of CPU time.  The box is already under heavy mem/swap pressure so punishing the direct reclaimers seemed to be a fair way to keep the machine someone responsive, and avoids clusterware evictions. 

From 2.6-30.1:

                if (nr_freed < nr_taken && !current_is_kswapd() &&
                                        sc->order > PAGE_ALLOC_COSTLY_ORDER) {
                        congestion_wait(WRITE, HZ/10);

Not perfect, by any means.  

Thanks, John

Comment 10 Rik van Riel 2009-07-15 21:12:19 UTC

Your backport is not only "not perfect", it is also totally unacceptable because it would cause the code to fall through to always reclaiming any page (regardless of whether it is recently referenced), not just for higher order allocations like upstream.

If doing just the congestion_wait helps things, that is something worth considering for a RHEL backport.

However, I believe that such a congestion wait should only be done if we are already at priority < DEF_PRIORITY, because it is normal that not all pages are reclaimed - we _want_ recently referenced pages to be retained, not reclaimed.

Comment 11 RHEL Program Management 2014-03-07 13:56:52 UTC

This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 12 RHEL Program Management 2014-06-02 13:23:51 UTC

Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).

Comment 13 John Sobecki 2014-12-08 16:44:23 UTC

This is already closed, and this fix most likely addressed the contention
problem: 

- [mm] vmscan: bail out of direct reclaim after max pages (Rik van Riel ) [495442]

Fixed in 2.6.18-371 and higher.