Bug 131140
Summary: | Unable to allocate ZONE_DMA mem on systems with CONFIG_HIGHMEM64GB set | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Greg Marsden <greg.marsden> | ||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | bill.irwin, greg.marsden, peterm, petrides, riel, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-01-04 21:19:53 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Greg Marsden
2004-08-27 21:15:58 UTC
Created attachment 103189 [details]
ZONE_DMA fallback patch (trivial, removes a break;)
Taking discussion into bugzilla. Greg, I am still having trouble understanding exactly why this patch is going to help. Since this patch is only for a hugemem kernel which has almost 4GB of lowmem, what good will providing an additional 16MB of do if we have already consumed almost 4GB of lowmem in the slabcache, etc? If we arent reclaiming buffer headers, etc. when there is almost 4GB worth of them allocated, will 16MB more really help or should we really go after the try_to_reclaim_buffers path? Thanks, Larry Woodman There's no reason why this patch only affects the hugemem kernel... there is clearly highmem in the -smp kernels as well... [1] gmarsden@ca-build2:/build/gmarsden/2.4.21-20.EL/SOURCES$ grep HIGHMEM64 kernel-2.4.21-i686-smp.config CONFIG_HIGHMEM64G=y [0] gmarsden@ca-build2:/build/gmarsden/2.4.21-20.EL/SOURCES$ grep HIGHMEM64 kernel-2.4.21-i686-hugemem.config CONFIG_HIGHMEM64G=y and of course the original code is: #ifdef CONFIG_HIGHMEM64G break; #endif So clearly this patch applies to -smp kernels, where it does make a significant difference. Greg You are correct, the CONFIG_HIGHMEM64G option is included as well. However, this patch will allow callers of __alloc_page() that specify GFP_HIGHMEM and if both the Higmmem and Normal zones are exhausted fall all the way back down to the DMA zone. We want to reserve fallback to the DMA zone to GFP_KERNEL allocations which start in the Normal zone. Something like this OK with you? *********************************************************************** --- linux-2.4.21/mm/page_alloc.c.orig 2004-09-08 18:08:38.000000000 -0400 +++ linux-2.4.21/mm/page_alloc.c 2004-09-08 18:08:41.000000000 -0400 @@ -1030,6 +1030,7 @@ k = ZONE_DMA; switch (k) { + int has_highmem = 0; default: BUG(); /* @@ -1042,14 +1043,14 @@ BUG(); #endif zonelist->zones[j++] = zone; + has_highmem = 1; } case ZONE_NORMAL: zone = pgdat->node_zones + ZONE_NORMAL; if (zone->size) zonelist->zones[j++] = zone; -#ifdef CONFIG_HIGHMEM64G + if (k == ZONE_HIGHMEM && has_highmem) break; -#endif case ZONE_DMA: zone = pgdat->node_zones + ZONE_DMA; if (zone->size) has_highmem is probably not necessary here, seeing as we're talking about RPC allocations and not highmem. I have no issue with the extra GFP check, that would probably improve performance once we hit this state... Greg, the thing is that we would like to reserve the DMA zone for just kernel allocations (at least, on systems with lots of memory). This would give the kernel an extra 16MB space that user and page table allocations can't take. Having user allocations eat up those 16MB wouldn't help at all when trying to improve the reliability of kernel memory allocations... This is a NFS issue with being unable to allocate memory. I still havent been able to find an alt-sysrq m for the problem, but it's not that hard to reproduce, I just haven't had the time. You just use Manish's mempressure module to eat up most of lowmem (doing kmallocs...the module is at http://oss.oracle.com/projects/codefragments/src/trunk/mempressure/ Then do some NFS intense operations, and you'll find that you still have 16 MB of lowfree available, and running into hangs as the kernel runs out of memory (fragmentation) but doesnt realize it. I'm not talking about user allocs at all. Bill is a bit more eloquent in his description of the issue: *** WIRWIN 05/14/04 02:20 pm *** sysrq m shows that zone fallback logic in addition to fragmentation are involved. ZONE_DMA has sufficient contiguous memory to satisfy the allocations, yet on account of *lacking* memory pressure on ZONE_NORMAL, fallback of the allocation is forbidden. In turn, the allocating process sees that the memory is available but the request failed anyway, and so sleeps temporarily before retrying, which process may be repeated indefinitely. While a satisfactory solution to the high-level "NFS broke" issue may not come of it, the algorithm may be made to stop livelocking by correcting the above. Here's an alt sysrq-m from when the problem hits. Too bad this was our build box ;) Note fragmentation problem below: Sep 13 13:40:11 ca-build2 kernel: SysRq : Show Memory Sep 13 13:40:11 ca-build2 kernel: Mem-info: Sep 13 13:40:11 ca-build2 kernel: Zone:DMA freepages: 2935 min: 0 low: 0 high: 0 Sep 13 13:40:11 ca-build2 kernel: Zone:Normal freepages: 38928 min: 766 low: 4031 high: 5791 Sep 13 13:40:11 ca-build2 kernel: Zone:HighMem freepages: 490 min: 252 low: 504 high: 756 Sep 13 13:40:11 ca-build2 kernel: Free pages: 42353 ( 490 HighMem) Sep 13 13:40:11 ca-build2 kernel: ( Active: 186042/4935, inactive_laundry: 1321, inactive_clean: 209, free: 42353 ) Sep 13 13:40:11 ca-build2 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2935 Sep 13 13:40:11 ca-build2 kernel: aa:1958 ac:159561 id:84 il:0 ic:0 fr:38928 Sep 13 13:40:11 ca-build2 kernel: aa:6330 ac:18193 id:4851 il:1321 ic:209 fr:490 Sep 13 13:40:11 ca-build2 kernel: 1*4kB 3*8kB 4*16kB 2*32kB 1*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11740kB) Sep 13 13:40:11 ca-build2 kernel: 25604*4kB 6252*8kB 205*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 155712kB) Sep 13 13:40:11 ca-build2 kernel: 176*4kB 29*8kB 0*16kB 2*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1960kB) Sep 13 13:40:11 ca-build2 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Hi, Greg. We believe that a U4 fix of Larry's on 23-Sep-2004 (in kernel version 2.4.21-20.11.EL) along with his later follow-up on 18-Oct-2004 (in kernel version 2.4.21-22.EL) addresses this problem. Thus, I'm tentatively putting this bug into MODIFIED state. If you find that the latest RHEL3 U4 beta kernel (2.4.21-25.EL), or at least the latest one in the RHN beta channel (2.4.21-23.EL), still has not resolved the problem, then please revert this to ASSIGNED state. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html encountered on 27.EL 0*4kB 1*8kB 2*16kB 3*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11656kB) 0*4kB 1*8kB 1232*16kB 66*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21832kB) 4*4kB 4311*8kB 630*16kB 1*32kB 3*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 45576kB) sysrq-m from affected system: SysRq : Show Memory Mem-info: Zone:DMA freepages: 2914 min: 0 low: 0 high: 0 Zone:Normal freepages: 5458 min: 1279 low: 4544 high: 6304 Zone:HighMem freepages: 11394 min: 255 low: 22016 high: 33024 Free pages: 19766 ( 11394 HighMem) ( Active: 1122314/272157, inactive_laundry: 45777, inactive_clean: 22888, free: 19766 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2914 aa:44473 ac:62436 id:32350 il:3852 ic:1289 fr:5458 aa:434652 ac:580753 id:239807 il:41925 ic:21599 fr:11394 0*4kB 1*8kB 2*16kB 3*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11656kB) 0*4kB 1*8kB 1232*16kB 66*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21832kB) 4*4kB 4311*8kB 630*16kB 1*32kB 3*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 45576kB) Swap cache: add 42963655, delete 42939479, find 13182644/20538112, race 0+783 41143 pages of slabcache 1018 pages of kernel stacks 0 lowmem pagetables, 11480 highmem pagetables Free swap: 9981644kB 1638400 pages of RAM 1343472 pages of HIGHMEM 96964 reserved pages 1107283 pages shared 24184 pages swap cached [aime@stajh03 ~]$ uname -a Linux stajh03 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686 i686 i386 GNU/Linux From User-Agent: XML-RPC During the conf call, lwoodman agreed to take a look at this bugzilla and assess Greg's request. This event sent from IssueTracker by martinez issue 74551
Greg, I have built a test kernel that will display the memory stats and do a
kernel stack backtrace when __alloc_pages fails so I can see exactly what code
is causing the memory allocation failures. Please install this test kernel so
we can try to fix the client that is not decreasing the size of the allocation.
Removing the break; in build_zonelists cause other problems like OOMkills, DMA
zone allocation failures and kswapd run-away due to the DMA zone getting
exhausted with kernel data structres in the slabcache. This test kernel is
located here:
>>>http://people.redhat.com/~lwoodman/.for_oracle/
removing the break from the build zone lists code actually causes the problem to surface sooner, so this is clearly not the answer. looking for more detailed information, but we're not going to be able to deploy the above kernel until next week results finally...it's failing on a 0-order alloc: Aug 11 18:59:52 stajh01 kernel: Mem-info: Aug 11 18:59:58 stajh01 kernel: Zone:DMA freepages: 2913 min: 0 low: 0 high: 0 Aug 11 18:59:59 stajh01 kernel: Zone:Normal freepages: 0 min: 1279 low: 4544 high: 6304 Aug 11 18:59:59 stajh01 kernel: Zone:HighMem freepages: 290 min: 255 low: 22016 high: 33024 Aug 11 19:00:00 stajh01 kernel: Free pages: 3203 ( 290 HighMem) Aug 11 19:00:01 stajh01 kernel: ( Active: 322313/902009, inactive_laundry: 235580, inactive_clean: 27008, free: 3203 ) Aug 11 19:00:02 stajh01 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2913 Aug 11 19:00:03 stajh01 kernel: aa:34128 ac:86303 id:22350 il:4666 ic:3133 fr:0 Aug 11 19:00:03 stajh01 kernel: aa:49418 ac:152464 id:879659 il:230914 ic:23875 fr:290 Aug 11 19:00:04 stajh01 kernel: 1*4kB 2*8kB 3*16kB 4*32kB 3*64kB 2*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11652kB) Aug 11 19:00:05 stajh01 kernel: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB) Aug 11 19:00:05 stajh01 kernel: 14*4kB 12*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1160kB) Aug 11 19:00:06 stajh01 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Aug 11 19:00:06 stajh01 kernel: 33170 pages of slabcache Aug 11 19:00:07 stajh01 kernel: 424 pages of kernel stacks Aug 11 19:00:08 stajh01 kernel: 0 lowmem pagetables, 5109 highmem pagetables Aug 11 19:00:09 stajh01 kernel: Free swap: 10288440kB Aug 11 19:00:10 stajh01 kernel: 1638400 pages of RAM Aug 11 19:00:11 stajh01 kernel: 1343472 pages of HIGHMEM Aug 11 19:00:11 stajh01 kernel: 96966 reserved pages Aug 11 19:00:12 stajh01 kernel: 587042 pages shared Aug 11 19:00:12 stajh01 kernel: 0 pages swap cached Aug 11 19:00:14 stajh01 kernel: __alloc_pages: 0-order allocation failed. Note the logs don't mention the unable to allocate rpc buffer, so i'm not sure we're seeing the same problem we were, or if this is just alloc pages failing on some other alloc (more likely). Waiting to get the full logs. Greg, does this bugzilla need to remain private? If not, please uncheck the "Oracle Confidential Group" box below. Thanks. The problem here is that the caller to __alloc_pages is passing GFP_ATOMIC which prevents the allocator from using inactive clean pages. Greg was trying to reproduce this problem with a debug kernel that printed out the kernel stack backtrace so we could see exactly who is making the call and determine if that call can be changed to pass GFP_KERNEL. This would allow the allocator to use pages on the inactive clean list rather than just free pages which are totally depleted. aa:34128 ac:86303 id:22350 il:4666 ic:3133 fr:0 Larry Woodman moving development for the reopened bug over to bug 176849 This bug was fixed on RHEL3 U4 as mentioned in Comment #20. The issue reported in Comments #21 and following is a separate problem, see bug 176849 for further details. |