Bug 159326
Summary: | RSS limited to 1.8GB if process pinned to one CPU | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Erich Focht <efocht> | ||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.0 | CC: | bakerg3, dshaks, jnomura, jparadis, peterm, petrides, riek, tburke | ||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0144 | Doc Type: | Enhancement | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-03-15 16:01:35 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 168424, 173390 | ||||||||
Attachments: |
|
Description
Erich Focht
2005-06-01 17:46:02 UTC
Created attachment 115045 [details]
oncpu.c : used for pinning the current shell and children to a CPU
Created attachment 115047 [details]
eatmem.c: allocates and touches memory
Does the same problem occur when you boot with "numa=off"? The problem does not occur with "numa=off". This means it is clearly a problem in the memory allocation with NUMA. It is not an option to switch off NUMA on all Opteron machines. This brings significant performance degradations for all other jobs. I tested on a dual CPU machine and confirm that the problem occurs here, too. The problem does NOT occur on SUSE Professional 9.0 (kernel 2.4.21-209-smp) and of course it does not occur on SLES9 (with 2.6 kernel). Something is broken in the RHEL3 NUMA memory allocation (or zones list) part. Regards, Erich Hi, I found that the mentioned problem does NOT occur with RHEL3 update2 (kernel version 2.4.21-15.ELsmp)! So there was some patch in between which introduced this bad NUMA problem. Regards, Erich OK, thanks Erich. I'll get to the bottom of what changed that caused this degradation. If I run your test program on RHEL3-U2 it works OK but it does not on later kernels? Larry Woodman The last kernel on which the problem does not occur is 2.4.21-15.0.4.ELsmp The next kernel (-20.ELsmp) shows the problem. Regards, Erich Focht Erich, I'm not seeing the same behavior internally. I suspect that its because my memory sizes/node counts are different than yours. Can you please grab me an AltSysrq-M output before and durring the running of your eatmem program so I can see exactly how much memory is on your system and howmany nodes its spread accross. I will try to configure a system with the same memory/node layout as yours so I can reproduce it here. Thanks, Larry Woodman Hi Larry, the problem occurs when the memory of node 0 is exhausted. So you can simply tune the eatmem -m parameter (in megabytes) to be bigger than the memory of node 0. I have 2GB/CPU, so I chose 4000MB. Here is the output I get: ----- before running eatmem ----- SysRq : Show Memory Mem-info: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages:447053 min: 1278 low: 9213 high: 13308 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages:496597 min: 1278 low: 9213 high: 13308 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages:485325 min: 1279 low: 13310 high: 19453 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 2616 min: 0 low: 0 high: 0 Zone:Normal freepages:491046 min: 1278 low: 9149 high: 13212 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Free pages: 1922637 ( 0 HighMem) ( Active: 13716/9342, inactive_laundry: 8466, inactive_clean: 1536, free: 1922637 ) aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:1098 ac:2322 id:3144 il:473 ic:512 fr:447053 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:1207 ac:2551 id:2635 il:366 ic:448 fr:496597 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:105 ac:2003 id:143 il:7144 ic:0 fr:485325 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:2616 aa:1680 ac:2750 id:3420 il:483 ic:576 fr:491046 aa:0 ac:0 id:0 il:0 ic:0 fr:0 133*4kB 36*8kB 18*16kB 13*32kB 1*64kB 10*128kB 8*256kB 3*512kB 0*1024kB 0*2048kB 435*4096kB = 17882 Swap cache: add 337136, delete 329971, find 90/157, race 0+0 38930 pages of slabcache 150 pages of kernel stacks 92 lowmem pagetables, 220 highmem pagetables Free swap: 2066496kB 2359292 pages of RAM 1958592 free pages 347660 reserved pages 13098 pages shared 7165 pages swap cached Buffer memory: 12760kB Cache memory: 103372kB CLEAN: 648 buffers, 2580 kbyte, 55 used (last=518), 0 locked, 0 dirty 0 delay LOCKED: 1 buffers, 4 kbyte, 1 used (last=1), 1 locked, 0 dirty 0 delay DIRTY: 26 buffers, 104 kbyte, 26 used (last=26), 0 locked, 25 dirty 0 delay -------- while running the eatmem program: -------- SysRq : Show Memory Mem-info: Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages: 1293 min: 1278 low: 9213 high: 13308 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages:497421 min: 1278 low: 9213 high: 13308 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 0 min: 0 low: 0 high: 0 Zone:Normal freepages:485367 min: 1279 low: 13310 high: 19453 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Zone:DMA freepages: 2616 min: 0 low: 0 high: 0 Zone:Normal freepages:490887 min: 1278 low: 9149 high: 13212 Zone:HighMem freepages: 0 min: 0 low: 0 high: 0 Free pages: 1477568 ( 0 HighMem) ( Active: 353537/93554, inactive_laundry: 26774, inactive_clean: 4608, free: 1477568 ) aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:340382 ac:2335 id:87309 il:22396 ic:0 fr:1277 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:1405 ac:2579 id:2636 il:366 ic:448 fr:497421 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:157 ac:2005 id:154 il:3559 ic:3584 fr:485369 aa:0 ac:0 id:0 il:0 ic:0 fr:0 aa:0 ac:0 id:0 il:0 ic:0 fr:2616 aa:1821 ac:2853 id:3425 il:483 ic:576 fr:490887 aa:0 ac:0 id:0 il:0 ic:0 fr:0 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 5112kB) Swap cache: add 535468, delete 502704, find 122/259, race 0+0 37945 pages of slabcache 158 pages of kernel stacks 112 lowmem pagetables, 1460 highmem pagetables Free swap: 1273508kB 2359292 pages of RAM 1512843 free pages 347660 reserved pages 14378 pages shared 34850 pages swap cached Buffer memory: 13208kB Cache memory: 197992kB CLEAN: 615 buffers, 2448 kbyte, 58 used (last=606), 0 locked, 0 dirty 0 delay DIRTY: 40 buffers, 160 kbyte, 40 used (last=40), 0 locked, 39 dirty 0 delay Regards, Erich Focht Oh, by the way: in the example above the eatmem process was not pinned to CPU0! It was running on CPU3. Erich Hi, we found that the problem (swapping) occurs also on RHEL3 update2 on an MSI 9245 server. The reason why it didn't occur for the Tyan TX46 board was that the kernel didn't recognise the NUMA structure and switched off NUMA support right at boot time. So on TX46 with RHEL3u2 we have only one pseudo-node holding all memory instead of 4 nodes. We see the NUMA artefacts as expected. A MSI 9245 Dual-Opteron server gets recognised as NUMA machine with 2 nodes, therefore runs into the same trouble as described in the first posting. Regards, Erich Erich, thanks for the update, that makes much more sence. There was nothing added to the kernel that caused this to start happening. The 2.4 kernel NUMA implementation has always allowed the total exhaustion of one or more nodes before touching other nodes. Larry Larry, I think the notable feature of this bug is that apparently we pin the thread's RSS and might even OOM-kill it even though there's memory available on other nodes. When a thread fills up its home node, it *should* fail over to using memory on other nodes. Apparently the latter is not happening correctly... I am still working this issue, the problem is in the 2.4 kernel build_zonelists(). It does not build the zonelists to spill over to the next node when its exhausted the wat the 2.6 kernel does. Unfortunately, it cant be changed without creating per-node kswapds like 2.6 and that wont be done for RHEL3. I am attempting to change __alloc_pages() to look at other zones when they are on different nodes without changing kswapd. Larry Woodman This was an NEC-Support bug that Larry Woodman is working. Engineering has apparently suggested is a feature instead so NEC opened another Issue Tracker to track it... Adding to U7Proposed list for consideration. This is also a highly desireable bug/feature request for a group that Brent Fox of RedHat is responsible for, issue #75010. Using the technique discussed in this ticket It appears that the RHEL3 kernel (2.4.21-37) exhibits the "bad" behavior, whereas the default "Linus" linux kernel (2.4.31) works as expected. Is this the behavior that RedHat sees/expects as well? *** 2.4.31 NUMA behavior [root@i-opteron-dt log]# uname -a Linux i-opteron-dt 2.4.31 #11 SMP Wed Nov 9 11:57:28 CST 2005 x86_64 x86_64 x86_64 GNU/Linux Nov 10 04:24:59 i-opteron-dt kernel: Linux version 2.4.31 (root@i-opteron-dt) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #11 SMP Wed Nov 9 11:57:28 CST 2005 Nov 10 04:24:59 i-opteron-dt kernel: Scanning NUMA topology in Northbridge 24 Nov 10 04:24:59 i-opteron-dt kernel: Node 0 MemBase 0000000000000000 Limit 00000000ffffffff Nov 10 04:24:59 i-opteron-dt kernel: Node 1 MemBase 0000000100000000 Limit 00000001ffffffff Nov 10 04:24:59 i-opteron-dt kernel: node 1 shift 24 addr 100000000 conflict 0 Nov 10 04:24:59 i-opteron-dt kernel: node 1 shift 25 addr 1fe000000 conflict 0 Nov 10 04:24:59 i-opteron-dt kernel: Using node hash shift of 26 Nov 10 04:24:59 i-opteron-dt kernel: Bootmem setup node 0 0000000000000000-00000000ffffffff Nov 10 04:24:59 i-opteron-dt kernel: Bootmem setup node 1 0000000100000000-00000001ffffffff Nov 10 04:24:59 i-opteron-dt kernel: Scan SMP from 0000010000000000 for 1024 bytes. Nov 10 04:24:59 i-opteron-dt kernel: Scan SMP from 000001000009fc00 for 1024 bytes. Nov 10 04:24:59 i-opteron-dt kernel: Scan SMP from 00000100000f0000 for 65536 bytes. Nov 10 04:24:59 i-opteron-dt kernel: found SMP MP-table at 00000000000ff780 Nov 10 04:24:59 i-opteron-dt kernel: hm, page 000ff000 reserved twice. Nov 10 04:24:59 i-opteron-dt kernel: hm, page 00100000 reserved twice. Nov 10 04:24:59 i-opteron-dt kernel: hm, page 000fa000 reserved twice. Nov 10 04:24:59 i-opteron-dt kernel: hm, page 000fb000 reserved twice. Nov 10 04:24:59 i-opteron-dt kernel: setting up node 0 0-fffff Nov 10 04:24:59 i-opteron-dt kernel: On node 0 totalpages: 1048575 Nov 10 04:24:59 i-opteron-dt kernel: zone(0): 4096 pages. Nov 10 04:24:59 i-opteron-dt kernel: zone(1): 1044479 pages. Nov 10 04:24:59 i-opteron-dt kernel: zone(2): 0 pages. Nov 10 04:24:59 i-opteron-dt kernel: setting up node 1 100000-1fffff Nov 10 04:24:59 i-opteron-dt kernel: On node 1 totalpages: 1048575 Nov 10 04:24:59 i-opteron-dt kernel: zone(0): 0 pages. Nov 10 04:24:59 i-opteron-dt kernel: zone(1): 1048575 pages. Nov 10 04:24:59 i-opteron-dt kernel: zone(2): 0 pages. [root@i-opteron-dt log]# cat /proc/cmdline ro root=/dev/hda2 apm=power-off hdc=ide-scsi [root@i-opteron-dt tmp]# ./oncpu 0 [root@i-opteron-dt tmp]# ./eatmem -m 7000 -c 10 -p 90000 -v 10: Writing 7000MB elapse=8 pp=0.004 ms 9: Writing 7000MB elapse=5 pp=0.003 ms 8: Writing 7000MB elapse=4 pp=0.002 ms 7: Writing 7000MB elapse=4 pp=0.002 ms 6: Writing 7000MB elapse=5 pp=0.003 ms 5: Writing 7000MB elapse=4 pp=0.002 ms 4: Writing 7000MB elapse=4 pp=0.002 ms 3: Writing 7000MB elapse=4 pp=0.002 ms 2: Writing 7000MB elapse=4 pp=0.002 ms 1: Writing 7000MB elapse=5 pp=0.003 ms *** 2.4.21-37 NUMA behavior [root@i-opteron-dt greg]# uname -a Linux i-opteron-dt 2.4.21-37.ELsmp #1 SMP Wed Sep 7 13:32:18 EDT 2005 x86_64 unknown unknown GNU/Linux Nov 10 10:45:22 i-opteron-dt kernel: Linux version 2.4.21-37.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-53)) #1 SMP Wed Sep 7 13:32:18 EDT 2005 Nov 10 10:45:22 i-opteron-dt kernel: Scanning NUMA topology in Northbridge 24 Nov 10 10:45:22 i-opteron-dt kernel: Number of nodes: 2 (10010) Nov 10 10:45:22 i-opteron-dt kernel: Node 0 MemBase 0000000000000000 Limit 00000000ffffffff Nov 10 10:45:22 i-opteron-dt kernel: Node 1 MemBase 0000000100000000 Limit 00000001ffffffff Nov 10 10:45:22 i-opteron-dt kernel: node 1 shift 24 addr 100000000 conflict 2 Nov 10 10:45:22 i-opteron-dt kernel: node 1 shift 25 addr 1fe000000 conflict 0 Nov 10 10:45:22 i-opteron-dt kernel: Using node hash shift of 26 Nov 10 10:45:22 i-opteron-dt kernel: Bootmem setup node 0 0000000000000000-00000000ffffffff Nov 10 10:45:22 i-opteron-dt kernel: Bootmem setup node 1 0000000100000000-00000001ffffffff Nov 10 10:45:22 i-opteron-dt kernel: found SMP MP-table at 000ff780 Nov 10 10:45:22 i-opteron-dt irqbalance: irqbalance startup succeeded Nov 10 10:45:22 i-opteron-dt kernel: hm, page 000ff000 reserved twice. Nov 10 10:45:22 i-opteron-dt kernel: hm, page 00100000 reserved twice. Nov 10 10:45:22 i-opteron-dt kernel: hm, page 000fa000 reserved twice. Nov 10 10:45:22 i-opteron-dt kernel: hm, page 000fb000 reserved twice. Nov 10 10:45:22 i-opteron-dt kernel: setting up node 0 0-fffff Nov 10 10:45:22 i-opteron-dt kernel: On node 0 totalpages: 1048575 Nov 10 10:45:22 i-opteron-dt kernel: zone(0): 4096 pages. Nov 10 10:45:22 i-opteron-dt kernel: zone(1): 1044479 pages. Nov 10 10:45:22 i-opteron-dt kernel: zone(2): 0 pages. Nov 10 10:45:22 i-opteron-dt kernel: setting up node 1 100000-1fffff Nov 10 10:45:22 i-opteron-dt kernel: On node 1 totalpages: 1048575 Nov 10 10:45:22 i-opteron-dt kernel: zone(0): 0 pages. Nov 10 10:45:22 i-opteron-dt kernel: zone(1): 1048575 pages. Nov 10 10:45:22 i-opteron-dt kernel: zone(2): 0 pages. [root@i-opteron-dt tmp]# cat /proc/cmdline ro root=LABEL=/ apm=power-off hdc=ide-scsi [root@i-opteron-dt tmp]# ./eatmem -m 7000 -c 10 -p 90000 -v 10: Writing 7000MB elapse=129 pp=0.072 ms 9: Writing 7000MB <...doesn't complete in 'reasonable (10m)' timeframe....> repeating these test show consistent results. Comments? ...sorry, forgot to include this bit regarding HW the info from above was collected on. Forgot to include this bit regarding test below: [root@i-opteron-dt root]# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 7931113472 45940736 7885172736 0 2535424 10080256 Swap: 16063963136 8347648 16055615488 MemTotal: 7745228 kB MemFree: 7700364 kB MemShared: 0 kB Buffers: 2476 kB Cached: 9056 kB SwapCached: 788 kB Active: 6420 kB Inactive: 5952 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 7745228 kB LowFree: 7700364 kB SwapTotal: 15687464 kB SwapFree: 15679312 kB [root@i-opteron-dt root]# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 stepping : 8 cpu MHz : 2191.279 cache size : 1024 KB physical id : 0 siblings : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4377.80 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 stepping : 8 cpu MHz : 2191.279 cache size : 1024 KB physical id : 0 siblings : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4377.80 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits A fix for this problem has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.12.EL). To enable an improved NUMA-friendly page allocation policy, please set /proc/sys/vm/numa_memory_allocator via the "sysctl" command (or put "vm.numa_memory_allocator = 1" in /etc/sysctl.conf). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html |