Bug 131140

Summary: Unable to allocate ZONE_DMA mem on systems with CONFIG_HIGHMEM64GB set
Product: Red Hat Enterprise Linux 3 Reporter: Greg Marsden <greg.marsden>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: bill.irwin, greg.marsden, peterm, petrides, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-04 21:19:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ZONE_DMA fallback patch (trivial, removes a break;) none

Description Greg Marsden 2004-08-27 21:15:58 UTC
Description of problem:
When lowmem is fragmented, kernel allocations do not fall back to
DMA_ZONE memory (see trivial patch attached). This causes systems to
fail when they run out of memory for kernel structure allocations,
even though there is still 16 megs of unfragmented memory available.

Version-Release number of selected component (if applicable):
18.EL

How reproducible:
Always

Steps to Reproduce:
1. Run mempressure to allocate all 64kB chunks of lowmem (from
oss.oracle.com/projects/codefragments/src/trunk) 
2. Attempt NFS
3. Check alt-sysrq-m, note lowfree is all used up, but dma zone is
untouched (4*4086kB free)
    

Additional info:

Affects only systems with high memory enabled

( dont have sysrq-m traces on hand, will update when I have a chance
to rerun this scenario )

Comment 1 Greg Marsden 2004-08-27 21:16:39 UTC
Created attachment 103189 [details]
ZONE_DMA fallback patch (trivial, removes a break;)

Comment 2 Greg Marsden 2004-08-27 21:21:22 UTC
Taking discussion into bugzilla.

Comment 5 Larry Woodman 2004-09-08 19:33:24 UTC
Greg, I am still having trouble understanding exactly why this patch
is going to help.  Since this patch is only for a hugemem kernel which
has almost 4GB of lowmem, what good will providing an additional 16MB
of do if we have already consumed almost 4GB of lowmem in the
slabcache, etc?
If we arent reclaiming buffer headers, etc. when there is almost 4GB
worth of them allocated, will 16MB more really help or should we
really go after the try_to_reclaim_buffers path?

Thanks, Larry Woodman


Comment 6 Greg Marsden 2004-09-08 20:31:18 UTC
There's no reason why this patch only affects the hugemem kernel...
there is clearly highmem in the -smp kernels as well...

[1] gmarsden@ca-build2:/build/gmarsden/2.4.21-20.EL/SOURCES$ grep
HIGHMEM64 kernel-2.4.21-i686-smp.config
CONFIG_HIGHMEM64G=y
[0] gmarsden@ca-build2:/build/gmarsden/2.4.21-20.EL/SOURCES$ grep
HIGHMEM64 kernel-2.4.21-i686-hugemem.config
CONFIG_HIGHMEM64G=y

and of course the original code is:
#ifdef CONFIG_HIGHMEM64G
                               break;
#endif

So clearly this patch applies to -smp kernels, where it does make a
significant difference.

Greg





Comment 7 Larry Woodman 2004-09-09 11:35:28 UTC
You are correct, the CONFIG_HIGHMEM64G option is included as well.
However, this patch will allow callers of __alloc_page() that specify
GFP_HIGHMEM and if both the Higmmem and Normal zones are exhausted
fall all the way back down to the DMA zone.  We want to reserve
fallback to the DMA zone to GFP_KERNEL allocations which start in the
Normal zone.

Something like this OK with you?

***********************************************************************
--- linux-2.4.21/mm/page_alloc.c.orig   2004-09-08 18:08:38.000000000
-0400
+++ linux-2.4.21/mm/page_alloc.c        2004-09-08 18:08:41.000000000
-0400
@@ -1030,6 +1030,7 @@
                        k = ZONE_DMA;
  
                switch (k) {
+                       int has_highmem = 0;
                        default:
                                BUG();
                        /*
@@ -1042,14 +1043,14 @@
                                        BUG();
 #endif
                                        zonelist->zones[j++] = zone;
+                                       has_highmem = 1;
                                }
                        case ZONE_NORMAL:
                                zone = pgdat->node_zones + ZONE_NORMAL;
                                if (zone->size)
                                        zonelist->zones[j++] = zone;
-#ifdef CONFIG_HIGHMEM64G
+                       if (k == ZONE_HIGHMEM && has_highmem)
                                break;
-#endif
                        case ZONE_DMA:
                                zone = pgdat->node_zones + ZONE_DMA;
                                if (zone->size)


Comment 8 Greg Marsden 2004-09-09 21:15:02 UTC
has_highmem is probably not necessary here, seeing as we're talking
about RPC allocations and not highmem. I have no issue with the extra
GFP check, that would probably improve performance once we hit this
state...

Comment 9 Rik van Riel 2004-09-09 21:32:28 UTC
Greg, the thing is that we would like to reserve the DMA zone for just
kernel allocations (at least, on systems with lots of memory).

This would give the kernel an extra 16MB space that user and page
table allocations can't take. Having user allocations eat up those
16MB wouldn't help at all when trying to improve the reliability of
kernel memory allocations...

Comment 10 Greg Marsden 2004-09-09 22:05:48 UTC
This is a NFS issue with being unable to allocate memory. I still
havent been able to find an alt-sysrq m for the problem, but it's not
that hard to reproduce, I just haven't had the time. You just use
Manish's mempressure module to eat up most of lowmem (doing
kmallocs...the module is at
http://oss.oracle.com/projects/codefragments/src/trunk/mempressure/
Then do some NFS intense operations, and you'll find that you still
have 16 MB of lowfree available, and running into hangs as the kernel
runs out of memory (fragmentation) but doesnt realize it.

I'm not talking about user allocs at all.

Bill is a bit more eloquent in his description of the issue:

        *** WIRWIN  05/14/04 02:20 pm ***
        sysrq m shows that zone fallback logic in addition to
fragmentation are
        involved. ZONE_DMA has sufficient contiguous memory to satisfy the
        allocations, yet on account of *lacking* memory pressure on
ZONE_NORMAL,
        fallback of the allocation is forbidden. In turn, the
allocating process
        sees that the memory is available but the request failed
anyway, and so
        sleeps temporarily before retrying, which process may be repeated
        indefinitely. While a satisfactory solution to the high-level
"NFS broke"
        issue may not come of it, the algorithm may be made to stop
livelocking
        by correcting the above.

Comment 11 Greg Marsden 2004-09-13 20:43:13 UTC
Here's an alt sysrq-m from when the problem hits. Too bad this was our
build box ;)
Note fragmentation problem below:

Sep 13 13:40:11 ca-build2 kernel: SysRq : Show Memory
Sep 13 13:40:11 ca-build2 kernel: Mem-info:
Sep 13 13:40:11 ca-build2 kernel: Zone:DMA freepages:  2935 min:     0
low:
0 high:     0
Sep 13 13:40:11 ca-build2 kernel: Zone:Normal freepages: 38928 min:  
766 low:
4031 high:  5791
Sep 13 13:40:11 ca-build2 kernel: Zone:HighMem freepages:   490 min: 
 252 low:
  504 high:   756
Sep 13 13:40:11 ca-build2 kernel: Free pages:       42353 (   490 HighMem)
Sep 13 13:40:11 ca-build2 kernel: ( Active: 186042/4935,
inactive_laundry: 1321, inactive_clean: 209, free: 42353 )
Sep 13 13:40:11 ca-build2 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2935
Sep 13 13:40:11 ca-build2 kernel:   aa:1958 ac:159561 id:84 il:0 ic:0
fr:38928
Sep 13 13:40:11 ca-build2 kernel:   aa:6330 ac:18193 id:4851 il:1321
ic:209 fr:490
Sep 13 13:40:11 ca-build2 kernel: 1*4kB 3*8kB 4*16kB 2*32kB 1*64kB
2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11740kB)
Sep 13 13:40:11 ca-build2 kernel: 25604*4kB 6252*8kB 205*16kB 0*32kB
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 155712kB)
Sep 13 13:40:11 ca-build2 kernel: 176*4kB 29*8kB 0*16kB 2*32kB 1*64kB
1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1960kB)
Sep 13 13:40:11 ca-build2 kernel: Swap cache: add 0, delete 0, find
0/0, race 0+0


Comment 18 Ernie Petrides 2004-11-15 21:01:40 UTC
Hi, Greg.  We believe that a U4 fix of Larry's on 23-Sep-2004 (in kernel
version 2.4.21-20.11.EL) along with his later follow-up on 18-Oct-2004
(in kernel version 2.4.21-22.EL) addresses this problem.  Thus, I'm
tentatively putting this bug into MODIFIED state.

If you find that the latest RHEL3 U4 beta kernel (2.4.21-25.EL), or at
least the latest one in the RHN beta channel (2.4.21-23.EL), still has
not resolved the problem, then please revert this to ASSIGNED state.


Comment 20 John Flanagan 2004-12-20 20:56:04 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html


Comment 21 Greg Marsden 2005-06-15 20:26:27 UTC
encountered on 27.EL 

0*4kB 1*8kB 2*16kB 3*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB
2*4096kB = 11656kB)
0*4kB 1*8kB 1232*16kB 66*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 21832kB)
4*4kB 4311*8kB 630*16kB 1*32kB 3*64kB 0*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 45576kB) 


Comment 22 Greg Marsden 2005-06-15 20:27:58 UTC
sysrq-m from affected system:
SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2914 min:     0 low:     0 high:     0
Zone:Normal freepages:  5458 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages: 11394 min:   255 low: 22016 high: 33024
Free pages:       19766 ( 11394 HighMem)
( Active: 1122314/272157, inactive_laundry: 45777, inactive_clean: 22888,
free: 19766 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2914
  aa:44473 ac:62436 id:32350 il:3852 ic:1289 fr:5458
  aa:434652 ac:580753 id:239807 il:41925 ic:21599 fr:11394
0*4kB 1*8kB 2*16kB 3*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB
2*4096kB = 11656kB)
0*4kB 1*8kB 1232*16kB 66*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 21832kB)
4*4kB 4311*8kB 630*16kB 1*32kB 3*64kB 0*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 45576kB)
Swap cache: add 42963655, delete 42939479, find 13182644/20538112, race 0+783
41143 pages of slabcache
1018 pages of kernel stacks
0 lowmem pagetables, 11480 highmem pagetables
Free swap:       9981644kB
1638400 pages of RAM
1343472 pages of HIGHMEM
96964 reserved pages
1107283 pages shared
24184 pages swap cached
[aime@stajh03 ~]$ uname -a
Linux stajh03 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686 i686
i386 GNU/Linux 


Comment 23 Issue Tracker 2005-06-15 20:45:22 UTC
From User-Agent: XML-RPC

During the conf call, lwoodman agreed to take a look at this bugzilla and
assess Greg's request.


This event sent from IssueTracker by martinez
 issue 74551

Comment 27 Larry Woodman 2005-06-29 19:05:54 UTC
Greg, I have built a test kernel that will display the memory stats and do a
kernel stack backtrace when __alloc_pages fails so I can see exactly what code
is causing the memory allocation failures.  Please install this test kernel so
we can try to fix the client that is not decreasing the size of the allocation.
 Removing the break; in build_zonelists cause other problems like OOMkills, DMA
zone allocation failures and kswapd run-away due to the DMA zone getting
exhausted with kernel data structres in the slabcache.  This test kernel is
located here:

>>>http://people.redhat.com/~lwoodman/.for_oracle/

Comment 28 Greg Marsden 2005-07-02 01:01:57 UTC
removing the break from the build zone lists code actually causes the problem to
surface sooner, so this is clearly not the answer. looking for more detailed
information, but we're not going to be able to deploy the above kernel until
next week

Comment 29 Greg Marsden 2005-08-16 00:50:01 UTC
results finally...it's failing on a 0-order alloc:
Aug 11 18:59:52 stajh01 kernel: Mem-info:
Aug 11 18:59:58 stajh01 kernel: Zone:DMA freepages:  2913 min:     0 low:
0 high:     0
Aug 11 18:59:59 stajh01 kernel: Zone:Normal freepages:     0 min:  1279 low:
4544 high:  6304
Aug 11 18:59:59 stajh01 kernel: Zone:HighMem freepages:   290 min:   255
low: 22016 high: 33024
Aug 11 19:00:00 stajh01 kernel: Free pages:        3203 (   290 HighMem)
Aug 11 19:00:01 stajh01 kernel: ( Active: 322313/902009, inactive_laundry:
235580, inactive_clean: 27008, free: 3203 )
Aug 11 19:00:02 stajh01 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2913
Aug 11 19:00:03 stajh01 kernel:   aa:34128 ac:86303 id:22350 il:4666 ic:3133
fr:0
Aug 11 19:00:03 stajh01 kernel:   aa:49418 ac:152464 id:879659 il:230914
ic:23875 fr:290
Aug 11 19:00:04 stajh01 kernel: 1*4kB 2*8kB 3*16kB 4*32kB 3*64kB 2*128kB
1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11652kB)
Aug 11 19:00:05 stajh01 kernel: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB)
Aug 11 19:00:05 stajh01 kernel: 14*4kB 12*8kB 1*16kB 1*32kB 1*64kB 1*128kB
1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1160kB)
Aug 11 19:00:06 stajh01 kernel: Swap cache: add 0, delete 0, find 0/0, race
0+0
Aug 11 19:00:06 stajh01 kernel: 33170 pages of slabcache
Aug 11 19:00:07 stajh01 kernel: 424 pages of kernel stacks
Aug 11 19:00:08 stajh01 kernel: 0 lowmem pagetables, 5109 highmem pagetables
Aug 11 19:00:09 stajh01 kernel: Free swap:       10288440kB
Aug 11 19:00:10 stajh01 kernel: 1638400 pages of RAM
Aug 11 19:00:11 stajh01 kernel: 1343472 pages of HIGHMEM
Aug 11 19:00:11 stajh01 kernel: 96966 reserved pages
Aug 11 19:00:12 stajh01 kernel: 587042 pages shared
Aug 11 19:00:12 stajh01 kernel: 0 pages swap cached
Aug 11 19:00:14 stajh01 kernel: __alloc_pages: 0-order allocation failed.


Comment 30 Greg Marsden 2005-08-16 00:59:07 UTC
Note the logs don't mention the unable to allocate rpc buffer, so i'm not sure
we're seeing the same problem we were, or if this is just alloc pages failing on
some other alloc (more likely). Waiting to get the full logs.

Comment 31 Ernie Petrides 2005-10-10 23:13:31 UTC
Greg, does this bugzilla need to remain private?  If not, please uncheck
the "Oracle Confidential Group" box below.  Thanks.

Comment 32 Larry Woodman 2005-10-24 15:29:26 UTC
The problem here is that the caller to __alloc_pages is passing GFP_ATOMIC which
prevents the allocator from using inactive clean pages.  Greg was trying to
reproduce this problem with a debug kernel that printed out the kernel stack
backtrace so we could see exactly who is making the call and determine if that
call can be changed to pass GFP_KERNEL.  This would allow the allocator to use
pages on the inactive clean list rather than just free pages which are totally
depleted.

aa:34128 ac:86303 id:22350 il:4666 ic:3133 fr:0


Larry Woodman


Comment 36 Greg Marsden 2006-01-04 21:05:20 UTC
moving development for the reopened bug over to bug 176849

Comment 37 Marizol Martinez 2006-01-04 21:19:53 UTC
This bug was fixed on RHEL3 U4 as mentioned in Comment #20. The issue reported
in Comments #21 and following is a separate problem, see bug 176849 for further
details.