Bug 525941
Summary: | OOM on i686 kernel-smp | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Qian Cai <qcai> | ||||||||
Component: | kernel | Assignee: | Jiri Pirko <jpirko> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Petr Beňas <pbenas> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 4.7.z | CC: | lwoodman, pbenas, pstehlik, rkhan, vgoyal, vmayatsk | ||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-02-16 16:05:11 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Qian Cai
2009-09-27 12:03:55 UTC
After pulled out this patch for 2.6.9-78.0.14.EL kernel, OOM is gone. linux-2.6.9-pidhashing-fix-alloc_pidmap.patch Does this problem also contribute to this bug? Bug 510371 - task_struct (and related slabcache) grow continously in RHEL 4 https://bugzilla.redhat.com/show_bug.cgi?id=510371 I was trying to reproduce this on intel-s5000phb-01.rhts.eng.bos.redhat.com x64_64 with no luck. However on gs-bl460cg1-01.rhts.bos.redhat.com i686 and 2.6.9-78.0.14.ELsmp kernel this is triggered in a moment. I tried to take 2.6.9-78 and only add patch linux-2.6.9-pidhashing-fix-alloc_pidmap.patch. This kernel works just fine. This kernel is on gs-bl460cg1-01.rhts.bos.redhat.com atm named 2.6.9-78.EL.testsmp. I also looked into a code and patched alloc_pidmap() is almost identical to the ones in RHEL5 and upstream. Therefore I think that the regression is brought by a different patch and my patch only uncovers the issue with pidseqchk app. Thoughts? Very true, I also didn't find divergence with upstream in the patch. Jiri, I have tried the 2.6.9-78.EL.testsmp kernel you mentioned, but it does not look like the patch is applied since BZ #479182 can still be reproducible there. # echo 99999 >/proc/sys/kernel/pid_max # ./pidseqchk ... sequence break: 13868 - 32767 (new 65536) sequence break: 65536 - 70188 (new 70190) sequence break: 70190 - 98303 (new 131072) sequence break: 131072 - 131072 (new 300) ... It has rebuild a kernel and confirmed that, * 2.6.9-78.EL.smp + linux-2.6.9-pidhashing-fix-alloc_pidmap.patch = OK * 2.6.9-78.0.5.EL.smp + linux-2.6.9-pidhashing-fix-alloc_pidmap.patch = OOM I am continuing bisecting. update: Indeed, my 2.6.9-78.EL.testsmp didn't include linux-2.6.9-pidhashing-fix-alloc_pidmap.patch. I put it there as linux-kernel-test.patch but it looks like it is ignored - Cai did it the same way and therefore he has negative result with 2.6.9-78.EL.smp in comment #8. I did another build of 2.6.9-78.EL.smp which includes my patch (2.6.9-78.EL.test2smp) and I have the oom killer too. So it looks like there is really something wrong with this patch. Investigation continues... According to slabtop, size-4096 goes off the roof... Therefore the suspect is patch line: + void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); During the pidseqchk run slab-4096 goes up to ~300MB. This is allocated in the lowmem. Because gs-bl460cg1-01 has 63GB of the memory, the lowmem with 2.6.9-78.EL.test2smp is small and pidseqchk run causes to fill it all and eventually triggers the OOM killer. This doesn't happen on systems with e.g. 1GB of RAM (intel-s5000phb-01) when the lowmem is bigger. When you use hugemem kernel, the lowmem is significantly bigger and OOM doesn't appear. Anyway booting 2.6.9-78.EL.test2smp kernel on gs-bl460cg1-01 gives you warning you are using >16GB of RAM. Here are relevant parts of /proc/meminfo for hosts/kernels: gs-bl460cg1-01.rhts.bos.redhat.com 2.6.9-78.EL.test2hugemem MemTotal: 65526128 kB MemFree: 65400908 kB LowTotal: 2873716 kB LowFree: 2829260 kB ------------------------------------------------------------- gs-bl460cg1-01.rhts.bos.redhat.com 2.6.9-78.EL.test2smp MemTotal: 65528440 kB MemFree: 65406300 kB LowTotal: 387368 kB LowFree: 343580 kB ------------------------------------------------------------- intel-s5000phb-01.rhts.eng.bos.redhat.com 2.6.9-78.EL.test2smp MemTotal: 1028768 kB MemFree: 909168 kB LowTotal: 903516 kB LowFree: 866468 kB Hope this clears it up. Closing this as NOTABUG. Feel free to reopen. Created attachment 363184 [details]
Free Low memory watcher
This application shows the low memory level and how it fluctuates.
Done some research for RHEL5, and found that kernel-PAE has also had around 300M low memory in this machine. However, the OOM is unable to be reproduced there due to it does not allow pid_max to be set to more than 32768 for 32-bit. # echo -n 32769 >/proc/sys/kernel/pid_max -bash: echo: write error: Invalid argument Therefore, how about bring a similar behaviour into RHEL4 as well -- unable to set pid_max to more than 32768 for 32-bit kernels other than hugemem? Created attachment 363384 [details]
untested patch
a.d. c#14: Okay Cai, that seems reasonable. Following two commits solves the problem: http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.166.68 http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1938.166.69 I've already backported these to RHEL4 and I'm going to test it. Jiri, one question though. So, it is not allowed to set the max_pid bigger than 0x8000 even with hugemem kernel anymore? Since it has 4G/4G split there, so I am not sure if there is any existing RHEL4 customers expect larger max_pid in that setup. (In reply to comment #18) > Jiri, one question though. So, it is not allowed to set the max_pid bigger than > 0x8000 even with hugemem kernel anymore? Since it has 4G/4G split there, so I > am not sure if there is any existing RHEL4 customers expect larger max_pid in > that setup. Correct. This is the way it's done in upstream kernel and also in RHEL5 kernel. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 89.43.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Reproduced in 89.42.ELsmp and verified in 89.43.ELsmp. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |