Description of problem: During periods of heavy memory load that forces the system to swap, all CPUs will become consumed by shrink_zone/shrink_inactive_list processing, and the machine becomes unresponsive. This results in RAC evictions when the machine comes out of the hang (sometimes hours later). Usually the customer forces a vmcore and if you look at the cpu stacks, you'll see something like this (induced by ocfs2 file truncation, in this example): All 4 CPUs are contending in the VM in shrink_inactive_list(). @ . @ PID: 3746 TASK: ffff81023d0d6080 CPU: 0 COMMAND: "sshd" @ --- <exception stack> --- @ #3 [ffff81022dd79a18] shrink_inactive_list at ffffffff800bdb1a @ #4 [ffff81022dd79b60] ktime_get_ts at ffffffff8009d66f @ #5 [ffff81022dd79b80] delayacct_end at ffffffff800b5a24 @ #6 [ffff81022dd79c00] shrink_zone at ffffffff80012912 @ #7 [ffff81022dd79c40] try_to_free_pages at ffffffff800be2e5 @ #8 [ffff81022dd79cc0] __alloc_pages at ffffffff8000ef9a @ #9 [ffff81022dd79d20] find_or_create_page at ffffffff80025636 @ #10 [ffff81022dd79d50] cont_prepare_write at ffffffff800d1f4f @ #11 [ffff81022dd79db0] ocfs2_prepare_write at ffffffff8843709d @ #12 [ffff81022dd79df0] ocfs2_zero_extend at ffffffff88443855 @ #13 [ffff81022dd79e20] ocfs2_setattr at ffffffff884469e0 @ #14 [ffff81022dd79e80] notify_change at ffffffff8002c4b5 @ #15 [ffff81022dd79ee0] do_truncate at ffffffff800d0326 @ #16 [ffff81022dd79ef0] audit_syscall_entry at ffffffff800b1ce9 @ #17 [ffff81022dd79f50] sys_ftruncate at ffffffff8004a699 @ #18 [ffff81022dd79f80] tracesys at ffffffff8005b2c1 @ . @ PID: 3134 TASK: ffff810233ef80c0 CPU: 1 COMMAND: "crond" @ --- <exception stack> --- @ #3 [ffff810231ce3a18] unlock_page at ffffffff80017818 @ #4 [ffff810231ce3a20] shrink_inactive_list at ffffffff800bd97d @ #5 [ffff810231ce3c10] shrink_zone at ffffffff80012912 @ #6 [ffff810231ce3c50] try_to_free_pages at ffffffff800be2e5 @ #7 [ffff810231ce3cd0] __alloc_pages at ffffffff8000ef9a @ #8 [ffff810231ce3d30] read_swap_cache_async at ffffffff80031f66 @ #9 [ffff810231ce3d70] swapin_readahead at ffffffff800bf661 @ #10 [ffff810231ce3dc0] __handle_mm_fault at ffffffff80008f3a @ #11 [ffff810231ce3e60] do_page_fault at ffffffff800645a7 @ . @ PID: 2989 TASK: ffff81023df297a0 CPU: 2 COMMAND: "automount" @ --- <exception stack> --- @ #3 [ffff8102328d7968] shrink_inactive_list at ffffffff800bdb1a @ #4 [ffff8102328d7a10] isolate_lru_pages at ffffffff800bcf21 @ #5 [ffff8102328d7b50] shrink_zone at ffffffff80012912 @ #6 [ffff8102328d7b90] try_to_free_pages at ffffffff800be2e5 @ #7 [ffff8102328d7c10] __alloc_pages at ffffffff8000ef9a @ #8 [ffff8102328d7c70] __do_page_cache_readahead at ffffffff80012685 @ #9 [ffff8102328d7cb0] getnstimeofday at ffffffff80058bf0 @ #10 [ffff8102328d7cd0] ktime_get_ts at ffffffff8009d66f @ #11 [ffff8102328d7cf0] delayacct_end at ffffffff800b5a24 @ #12 [ffff8102328d7d60] filemap_nopage at ffffffff80013000 @ #13 [ffff8102328d7dc0] __handle_mm_fault at ffffffff800087e0 @ #14 [ffff8102328d7e60] do_page_fault at ffffffff800645a7 @ #15 [ffff8102328d7f50] error_exit at ffffffff8005be1d @ . @ PID: 3825 TASK: ffff81022dcf97e0 CPU: 3 COMMAND: "fio" @ . @ --- <exception stack> --- @ #3 [ffff81022dd79a18] shrink_inactive_list at ffffffff800bdb1a @ #4 [ffff81022dd79b60] ktime_get_ts at ffffffff8009d66f @ #5 [ffff81022dd79b80] delayacct_end at ffffffff800b5a24 @ #6 [ffff81022dd79c00] shrink_zone at ffffffff80012912 @ #7 [ffff81022dd79c40] try_to_free_pages at ffffffff800be2e5 @ #8 [ffff81022dd79cc0] __alloc_pages at ffffffff8000ef9a @ #9 [ffff81022dd79d20] find_or_create_page at ffffffff80025636 @ #10 [ffff81022dd79d50] cont_prepare_write at ffffffff800d1f4f @ #11 [ffff81022dd79db0] ocfs2_prepare_write at ffffffff8843709d @ #12 [ffff81022dd79df0] ocfs2_zero_extend at ffffffff88443855 @ #13 [ffff81022dd79e20] ocfs2_setattr at ffffffff884469e0 @ #14 [ffff81022dd79e80] notify_change at ffffffff8002c4b5 @ #15 [ffff81022dd79ee0] do_truncate at ffffffff800d0326 Version-Release number of selected component (if applicable): RHEL 5.2 How reproducible: 100% Steps to Reproduce: 1. Load a machine with N threads of a small C program that uses 2000MB 2. Make sure N > #CPUs 3. Monitor with top for a hang Actual results: Machine hangs and only recovers because the testcase frees the memory it allocated after 10 minutes. Swap usage rarely exceeds 30%. Expected results: Machine should be able to swap to 100% before hanging. The attached patch, backported from mainline, allows 100% swap utilization and the machine stays very responsive under heavy load. Additional info:
Created attachment 317511 [details] 2.6.18 proposed patch (backported from mainline)
Created attachment 317531 [details] Run N+1 threads (where N=#CPUs) using ./bigmalloc 2000 &
Created attachment 319367 [details] 2.6.9 patch Tested hard with both Oracle DB and non-DB stress loads, and is now in production on Oracle Global Email.
Thanks john, will take care of this. On quick question the comment says "direct reclaiming for contiguous pages", does this mean order>0 ??? + /* + * If we are direct reclaiming for contiguous pages and we do + * not reclaim everything in the list, try again and wait + * for IO to complete. This will stall high-order allocations + * but that should be acceptable to the caller + */ Larry
Hi Larry, No check for order, just for nr_freed < nr_taken. Seems like similar was backported from mainline in BZ 495442 - check that out and see if a similar mainline patch to what I posted in https://bugzilla.redhat.com/attachment.cgi?id=317511 is being used. Thanks, John
2.6.18 does not have lumpy reclaim, so the comment in the patch makes little sense. John, what exactly are you trying to achieve with this patch? Also, how does the patch achieve what you want to achieve?
Also, why are you forcefully deactivating pages that were activated by shrink_page_list? FIFO page replacement was proven to be a bad idea in the 1960's and not a mistake to repeat 40 decades later. Obviously your patch fixes something and achieves it in some way. Lets get to the bottom of what it really does, so we can get the bug fixed without the bad side effects.
Btw, I suspect the bug may already have disappeared in RHEL 5.4, due to never reclaiming more than 32 pages in direct reclaim - that should get the worst excesses of parallel direct reclaim out of the picture alltogether.
5.4 beta 2.6.18-155 is doing: if (nr_reclaimed > swap_cluster_max && priority < DEF_PRIORITY && !current_is_kswapd()) break; The break doesn't relieve the machine from experiencing 'scheduling brownouts' as seen by other software (timing sensitive). My patch was a backport from mainline, where they are calling congestion_wait to force the direct reclaim threads to come up for air, and hence let some other processes get a bit of CPU time. The box is already under heavy mem/swap pressure so punishing the direct reclaimers seemed to be a fair way to keep the machine someone responsive, and avoids clusterware evictions. From 2.6-30.1: if (nr_freed < nr_taken && !current_is_kswapd() && sc->order > PAGE_ALLOC_COSTLY_ORDER) { congestion_wait(WRITE, HZ/10); Not perfect, by any means. Thanks, John
Your backport is not only "not perfect", it is also totally unacceptable because it would cause the code to fall through to always reclaiming any page (regardless of whether it is recently referenced), not just for higher order allocations like upstream. If doing just the congestion_wait helps things, that is something worth considering for a RHEL backport. However, I believe that such a congestion wait should only be done if we are already at priority < DEF_PRIORITY, because it is normal that not all pages are reclaimed - we _want_ recently referenced pages to be retained, not reclaimed.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
This is already closed, and this fix most likely addressed the contention problem: - [mm] vmscan: bail out of direct reclaim after max pages (Rik van Riel ) [495442] Fixed in 2.6.18-371 and higher.