Bug 609668 - kswapd hung in D state with fragmented memory and large order allocations
Summary: kswapd hung in D state with fragmented memory and large order allocations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Zhouping Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-30 19:48 UTC by Jon Thomas
Modified: 2018-11-14 19:01 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:39:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer (20.00 KB, application/octet-stream)
2010-06-30 19:54 UTC, Jon Thomas
no flags Details
patch (1.53 KB, patch)
2010-06-30 19:56 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Jon Thomas 2010-06-30 19:48:17 UTC
kswapd appears to enter an infinite loop around blk_congestion_wait() when memory is fragmented and there is a large order allocation. This was seen on a 64GB machine with 57Gb free, swap usage at 7% and pagecache at 14MB. This was observed on a machine with a 3rd party filesystem.

The machine was cored:

crash> bt
PID: 849    TASK: ffff810fefaed7a0  CPU: 5   COMMAND: "kswapd0"
#0 [ffff810fee559c30] schedule at ffffffff80063f7f
#1 [ffff810fee559d08] schedule_timeout at ffffffff800648a6
#2 [ffff810fee559d58] io_schedule_timeout at ffffffff80064230
#3 [ffff810fee559d88] blk_congestion_wait at ffffffff8003bcec
#4 [ffff810fee559dd8] kswapd at ffffffff8005866d
#5 [ffff810fee559ee8] kthread at ffffffff8003353e
#6 [ffff810fee559f48] kernel_thread at ffffffff8005efb1

Looking at the core revealed kswapd stuck in a loop attempting to satisfy on order 10 allocation in zone normal.

It appears we are hitting the same issue as what this thread discusses:
http://kerneltrap.org/mailarchive/linux-kernel/2008/12/31/4392777/thread

We put together a reproducer that replicates the fragmentation and does order 10 allocations and are able to reproduce this consistently. The patch we have tested is a combination of:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=73ce02e96fe34a983199a9855b2ae738f960a6ee

http://lkml.org/lkml/2007/9/5/289

The second has a minor change in it.

Comment 1 Jon Thomas 2010-06-30 19:54:58 UTC
Created attachment 428080 [details]
reproducer

Comment 2 Jon Thomas 2010-06-30 19:56:10 UTC
Created attachment 428082 [details]
patch

Comment 8 Larry Woodman 2010-09-29 19:46:04 UTC
There are 2 parts to this patch:

------------------------------------------------------------------------
+			/*
+			* We put equal pressure on every zone, unless one
+			* zone has way too many pages free already.
+			*/
+			if (!zone_watermark_ok(zone, order,
+					8*zone->pages_high, end_zone, 0))
+				shrink_zone(priority, zone, &sc);
-----------------------------------------------------------------------

AND

-----------------------------------------------------------------------
+
+		/*
+		* Fragmentation may mean that the system cannot be
+		* rebalanced for high-order allocations in all zones.
+		* At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+		* it means the zones have been fully scanned and are still
+		* not balanced. For high-order allocations, there is
+		* little point trying all over again as kswapd may
+		* infinite loop.
+		*
+		* Instead, recheck all watermarks at order-0 as they
+		* are the most important. If watermarks are ok, kswapd will go
+		* back to sleep. High-order users can still perform direct 
+		* reclaim if they wish.
+		*/
+		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = 0;
+
-----------------------------------------------------------------------

I already added the first part in RHEL5:

-----------------------------------------------------------------------
* Thu Jul 22 2010 Jarod Wilson <jarod> [2.6.18-208.el5]
...
- [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779]
-----------------------------------------------------------------------

I will add the seconf part now.

Larry

...

Comment 9 Larry Woodman 2010-09-29 19:46:30 UTC
There are 2 parts to this patch:

------------------------------------------------------------------------
+			/*
+			* We put equal pressure on every zone, unless one
+			* zone has way too many pages free already.
+			*/
+			if (!zone_watermark_ok(zone, order,
+					8*zone->pages_high, end_zone, 0))
+				shrink_zone(priority, zone, &sc);
-----------------------------------------------------------------------

AND

-----------------------------------------------------------------------
+
+		/*
+		* Fragmentation may mean that the system cannot be
+		* rebalanced for high-order allocations in all zones.
+		* At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+		* it means the zones have been fully scanned and are still
+		* not balanced. For high-order allocations, there is
+		* little point trying all over again as kswapd may
+		* infinite loop.
+		*
+		* Instead, recheck all watermarks at order-0 as they
+		* are the most important. If watermarks are ok, kswapd will go
+		* back to sleep. High-order users can still perform direct 
+		* reclaim if they wish.
+		*/
+		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = 0;
+
-----------------------------------------------------------------------

I already added the first part in RHEL5:

-----------------------------------------------------------------------
* Thu Jul 22 2010 Jarod Wilson <jarod> [2.6.18-208.el5]
...
- [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779]
-----------------------------------------------------------------------

I will add the second part now.

Larry

...

Comment 10 Larry Woodman 2010-09-29 20:13:16 UTC
Posted this patch to rhkernel-list:

-------------------------------------------------------------
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 517023a..42aa6b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1278,6 +1278,24 @@ out:
        }
        if (!all_zones_ok) {
                cond_resched();
+
+               /*
+                * Fragmentation may mean that the system cannot be
+                * rebalanced for high-order allocations in all zones.
+                * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+                * it means the zones have been fully scanned and are still
+                * not balanced. For high-order allocations, there is
+                * little point trying all over again as kswapd may
+                * infinite loop.
+                *
+                * Instead, recheck all watermarks at order-0 as they
+                * are the most important. If watermarks are ok, kswapd will go
+                * back to sleep. High-order users can still perform direct
+                * reclaim if they wish.
+                */
+               if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+                       order = 0;
+
                goto loop_again;
        }

-------------------------------------------------------------------

Comment 12 RHEL Program Management 2010-10-11 18:50:02 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Jarod Wilson 2010-10-14 14:02:31 UTC
in kernel-2.6.18-227.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 19 errata-xmlrpc 2011-01-13 21:39:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.