kswapd appears to enter an infinite loop around blk_congestion_wait() when memory is fragmented and there is a large order allocation. This was seen on a 64GB machine with 57Gb free, swap usage at 7% and pagecache at 14MB. This was observed on a machine with a 3rd party filesystem. The machine was cored: crash> bt PID: 849 TASK: ffff810fefaed7a0 CPU: 5 COMMAND: "kswapd0" #0 [ffff810fee559c30] schedule at ffffffff80063f7f #1 [ffff810fee559d08] schedule_timeout at ffffffff800648a6 #2 [ffff810fee559d58] io_schedule_timeout at ffffffff80064230 #3 [ffff810fee559d88] blk_congestion_wait at ffffffff8003bcec #4 [ffff810fee559dd8] kswapd at ffffffff8005866d #5 [ffff810fee559ee8] kthread at ffffffff8003353e #6 [ffff810fee559f48] kernel_thread at ffffffff8005efb1 Looking at the core revealed kswapd stuck in a loop attempting to satisfy on order 10 allocation in zone normal. It appears we are hitting the same issue as what this thread discusses: http://kerneltrap.org/mailarchive/linux-kernel/2008/12/31/4392777/thread We put together a reproducer that replicates the fragmentation and does order 10 allocations and are able to reproduce this consistently. The patch we have tested is a combination of: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=73ce02e96fe34a983199a9855b2ae738f960a6ee http://lkml.org/lkml/2007/9/5/289 The second has a minor change in it.
Created attachment 428080 [details] reproducer
Created attachment 428082 [details] patch
There are 2 parts to this patch: ------------------------------------------------------------------------ + /* + * We put equal pressure on every zone, unless one + * zone has way too many pages free already. + */ + if (!zone_watermark_ok(zone, order, + 8*zone->pages_high, end_zone, 0)) + shrink_zone(priority, zone, &sc); ----------------------------------------------------------------------- AND ----------------------------------------------------------------------- + + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = 0; + ----------------------------------------------------------------------- I already added the first part in RHEL5: ----------------------------------------------------------------------- * Thu Jul 22 2010 Jarod Wilson <jarod> [2.6.18-208.el5] ... - [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779] ----------------------------------------------------------------------- I will add the seconf part now. Larry ...
There are 2 parts to this patch: ------------------------------------------------------------------------ + /* + * We put equal pressure on every zone, unless one + * zone has way too many pages free already. + */ + if (!zone_watermark_ok(zone, order, + 8*zone->pages_high, end_zone, 0)) + shrink_zone(priority, zone, &sc); ----------------------------------------------------------------------- AND ----------------------------------------------------------------------- + + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = 0; + ----------------------------------------------------------------------- I already added the first part in RHEL5: ----------------------------------------------------------------------- * Thu Jul 22 2010 Jarod Wilson <jarod> [2.6.18-208.el5] ... - [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779] ----------------------------------------------------------------------- I will add the second part now. Larry ...
Posted this patch to rhkernel-list: ------------------------------------------------------------- diff --git a/mm/vmscan.c b/mm/vmscan.c index 517023a..42aa6b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1278,6 +1278,24 @@ out: } if (!all_zones_ok) { cond_resched(); + + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = 0; + goto loop_again; } -------------------------------------------------------------------
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-227.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html