Bug 591283

Summary: __alloc_pages_nodemask might schedule even if __GFP_WAIT not set in gfp_mask, leading to deadlock
Product: Red Hat Enterprise Linux 6 Reporter: Dan Hecht <dhecht>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED CURRENTRELEASE QA Contact: Qian Cai <qcai>
Severity: high Docs Contact:
Priority: low    
Version: 6.0CC: aarcange, akataria, jsavanyo, qcai
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-11 16:13:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Hecht 2010-05-11 19:38:00 UTC
Description of problem:

__alloc_pages_nodemask can call schedule even if __GFP_WAIT not set in gfp_mask, leading to deadlock.

The path __alloc_pages_nodemask -> __alloc_pages_slowpath -> get_pages_from_freelist -> cpuset_zone_allowed_softwall can schedule, even if __GFP_WAIT is not set.  The problem seems to have been introduced by the patch listed below.

Prior to this patch, the alloc_flags computed by gfp_to_alloc_flags call in __alloc_pages_slowpath would have cleared ALLOC_CPUSET if __GFP_WAIT is not set.  That prevents get_page_from_freelist from calling cpuset_zone_allowed_softwall, which might schedule if __GFP_HARDWALL is not set, which it won't be when called from this slowpath.

After the patch, ALLOC_CPUSET is only cleared if both __GFP_WAIT is not set and __GFP_NOMEMALLOC is not set.  So, in the case __GFP_NOMEMALLOC was set, the code can now go down a path that might schedule even though __GFP_WAIT was clear.

This is the patch that seems to have introduced the problem:

From: Andrea Arcangeli <aarcange>
Date: Mon, 1 Feb 2010 15:17:24 -0500
Subject: [mm] dont alloc harder for gfp nomemalloc even if nowait
Message-id: <20100201152040.198156184>
Patchwork-id: 23035
O-Subject: [RHEL6 27/37] dont alloc harder for gfp nomemalloc even if nowait
Bugzilla: 556572
RH-Acked-by: Larry Woodman <lwoodman>

From: Andrea Arcangeli <aarcange>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange>
Signed-off-by: Aristeu Rozanski <arozansk>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec9b70d..86aa0af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1762,7 +1762,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
         */
        alloc_flags |= (gfp_mask & __GFP_HIGH);

- if (!wait) {
+ /*
+ * Not worth trying to allocate harder for __GFP_NOMEMALLOC
+ * even if it can't schedule.
+ */
+ if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
                alloc_flags |= ALLOC_HARDER;
                /*
                 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.


Version-Release number of selected component (if applicable): 2.6.32-19.el6

How reproducible: 100%


Steps to Reproduce:
1. Run disk intensive workload for a few hours.

See the attached core file for an example deadlock caused by this bug.
  
Actual results: Host hangs due to this deadlock.


Expected results: Host does not hang.


Additional info:

Comment 1 Dan Hecht 2010-05-11 19:41:27 UTC
The gzip'ed core file was rejected as an attachment because it was too large (74MB).  If you want the core, let me know where you'd like it sent.

Comment 3 RHEL Program Management 2010-05-11 21:56:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 Larry Woodman 2010-05-20 15:15:45 UTC
This patch is being removed in RHEL6-Beta2 as part of a total replacement fo the Transparent Hugepage patch set.

----------------------------------------------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec9b70d..86aa0af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1762,7 +1762,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
         */
        alloc_flags |= (gfp_mask & __GFP_HIGH);

- if (!wait) {
+ /*
+ * Not worth trying to allocate harder for __GFP_NOMEMALLOC
+ * even if it can't schedule.
+ */
+ if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
                alloc_flags |= ALLOC_HARDER;
                /*
                 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-----------------------------------------------------------------------------

Larry Woodman

Comment 7 Aristeu Rozanski 2010-05-25 17:43:55 UTC
Patch(es) available on kernel-2.6.32-29.el6

Comment 10 Alok Kataria 2010-06-08 23:37:28 UTC
Aristeu, where are the rpm's available for this kernel ?

Comment 12 Subhendu Ghosh 2010-07-22 02:45:02 UTC
Alok, this should be covered in the public beta 2 referesh released today.

Comment 13 Alok Kataria 2010-08-02 17:09:14 UTC
Yep this seems to be fixed with the beta2 release. Thanks.

Please feel free to close it.

Comment 14 releng-rhel@redhat.com 2010-11-11 16:13:42 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.