Bug 609668 - kswapd hung in D state with fragmented memory and large order allocations
kswapd hung in D state with fragmented memory and large order allocations
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
high Severity urgent
: rc
: ---
Assigned To: Larry Woodman
Zhouping Liu
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-30 15:48 EDT by Jon Thomas
Modified: 2014-01-12 19:00 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 16:39:59 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
reproducer (20.00 KB, application/octet-stream)
2010-06-30 15:54 EDT, Jon Thomas
no flags Details
patch (1.53 KB, patch)
2010-06-30 15:56 EDT, Jon Thomas
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 05:37:42 EST

  None (edit)
Description Jon Thomas 2010-06-30 15:48:17 EDT
kswapd appears to enter an infinite loop around blk_congestion_wait() when memory is fragmented and there is a large order allocation. This was seen on a 64GB machine with 57Gb free, swap usage at 7% and pagecache at 14MB. This was observed on a machine with a 3rd party filesystem.

The machine was cored:

crash> bt
PID: 849    TASK: ffff810fefaed7a0  CPU: 5   COMMAND: "kswapd0"
#0 [ffff810fee559c30] schedule at ffffffff80063f7f
#1 [ffff810fee559d08] schedule_timeout at ffffffff800648a6
#2 [ffff810fee559d58] io_schedule_timeout at ffffffff80064230
#3 [ffff810fee559d88] blk_congestion_wait at ffffffff8003bcec
#4 [ffff810fee559dd8] kswapd at ffffffff8005866d
#5 [ffff810fee559ee8] kthread at ffffffff8003353e
#6 [ffff810fee559f48] kernel_thread at ffffffff8005efb1

Looking at the core revealed kswapd stuck in a loop attempting to satisfy on order 10 allocation in zone normal.

It appears we are hitting the same issue as what this thread discusses:
http://kerneltrap.org/mailarchive/linux-kernel/2008/12/31/4392777/thread

We put together a reproducer that replicates the fragmentation and does order 10 allocations and are able to reproduce this consistently. The patch we have tested is a combination of:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=73ce02e96fe34a983199a9855b2ae738f960a6ee

http://lkml.org/lkml/2007/9/5/289

The second has a minor change in it.
Comment 1 Jon Thomas 2010-06-30 15:54:58 EDT
Created attachment 428080 [details]
reproducer
Comment 2 Jon Thomas 2010-06-30 15:56:10 EDT
Created attachment 428082 [details]
patch
Comment 8 Larry Woodman 2010-09-29 15:46:04 EDT
There are 2 parts to this patch:

------------------------------------------------------------------------
+			/*
+			* We put equal pressure on every zone, unless one
+			* zone has way too many pages free already.
+			*/
+			if (!zone_watermark_ok(zone, order,
+					8*zone->pages_high, end_zone, 0))
+				shrink_zone(priority, zone, &sc);
-----------------------------------------------------------------------

AND

-----------------------------------------------------------------------
+
+		/*
+		* Fragmentation may mean that the system cannot be
+		* rebalanced for high-order allocations in all zones.
+		* At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+		* it means the zones have been fully scanned and are still
+		* not balanced. For high-order allocations, there is
+		* little point trying all over again as kswapd may
+		* infinite loop.
+		*
+		* Instead, recheck all watermarks at order-0 as they
+		* are the most important. If watermarks are ok, kswapd will go
+		* back to sleep. High-order users can still perform direct 
+		* reclaim if they wish.
+		*/
+		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = 0;
+
-----------------------------------------------------------------------

I already added the first part in RHEL5:

-----------------------------------------------------------------------
* Thu Jul 22 2010 Jarod Wilson <jarod@redhat.com> [2.6.18-208.el5]
...
- [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779]
-----------------------------------------------------------------------

I will add the seconf part now.

Larry

...
Comment 9 Larry Woodman 2010-09-29 15:46:30 EDT
There are 2 parts to this patch:

------------------------------------------------------------------------
+			/*
+			* We put equal pressure on every zone, unless one
+			* zone has way too many pages free already.
+			*/
+			if (!zone_watermark_ok(zone, order,
+					8*zone->pages_high, end_zone, 0))
+				shrink_zone(priority, zone, &sc);
-----------------------------------------------------------------------

AND

-----------------------------------------------------------------------
+
+		/*
+		* Fragmentation may mean that the system cannot be
+		* rebalanced for high-order allocations in all zones.
+		* At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+		* it means the zones have been fully scanned and are still
+		* not balanced. For high-order allocations, there is
+		* little point trying all over again as kswapd may
+		* infinite loop.
+		*
+		* Instead, recheck all watermarks at order-0 as they
+		* are the most important. If watermarks are ok, kswapd will go
+		* back to sleep. High-order users can still perform direct 
+		* reclaim if they wish.
+		*/
+		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = 0;
+
-----------------------------------------------------------------------

I already added the first part in RHEL5:

-----------------------------------------------------------------------
* Thu Jul 22 2010 Jarod Wilson <jarod@redhat.com> [2.6.18-208.el5]
...
- [mm] fix excessive memory reclaim from zones w/lots free (Larry Woodman) [604779]
-----------------------------------------------------------------------

I will add the second part now.

Larry

...
Comment 10 Larry Woodman 2010-09-29 16:13:16 EDT
Posted this patch to rhkernel-list:

-------------------------------------------------------------
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 517023a..42aa6b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1278,6 +1278,24 @@ out:
        }
        if (!all_zones_ok) {
                cond_resched();
+
+               /*
+                * Fragmentation may mean that the system cannot be
+                * rebalanced for high-order allocations in all zones.
+                * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+                * it means the zones have been fully scanned and are still
+                * not balanced. For high-order allocations, there is
+                * little point trying all over again as kswapd may
+                * infinite loop.
+                *
+                * Instead, recheck all watermarks at order-0 as they
+                * are the most important. If watermarks are ok, kswapd will go
+                * back to sleep. High-order users can still perform direct
+                * reclaim if they wish.
+                */
+               if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
+                       order = 0;
+
                goto loop_again;
        }

-------------------------------------------------------------------
Comment 12 RHEL Product and Program Management 2010-10-11 14:50:02 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 14 Jarod Wilson 2010-10-14 10:02:31 EDT
in kernel-2.6.18-227.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.
Comment 19 errata-xmlrpc 2011-01-13 16:39:59 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.