Bug 193542
| Summary: | Oom killer killing processes with free memory available? | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | James Pearson <james-p> | ||||||
| Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4.0 | CC: | allance.chen, averma, buckh, djuran, jbaron, jburke, kajtzu, rpersai, tao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2007-07-10 14:35:41 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
James Pearson
2006-05-30 11:40:35 UTC
Created attachment 130221 [details]
Dmesg output showing ooms
Can you do me a favor and try: 'echo 100 > /proc/sys/vm/lower_zone_protection'. This will cause page reclaimation to happen sooner thus providing more 'protection' for the zones. I've seen this tuning work in other similar circumstances so hopefully it will work here as well. thanks. Thanks - /proc/sys/vm/lower_zone_protection now set to 100 I'll monitor what happens over the the next day or so - we usually get a few potential ooms a day ... No more ooms in the last 48 hours since setting lower_zone_protection to 100. Normally, we would have seen a few more by now ... but I'll keep my eye on the situation for a few more days. However, it looks like it may have 'fixed' the problem - thanks Jim, can you get an AltSysrq-M output with the lower_zone_protection set to 100 while the system is under heavy load(when it used to OOM kill). Thanks, Larry Output from AltSysrq-M - The machine was 'busy' (doing what it normally does) when this was run, but I have no idea if it would have done an OOM at this point: SysRq : Show Memory Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 HighMem per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Free pages: 119068kB (1600kB HighMem) Active:77149 inactive:903018 dirty:35135 writeback:0 unstable:0 free:29767 slab: 23031 mapped:74517 pagetables:660 DMA free:12572kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:163 84kB pages_scanned:10323225 all_unreclaimable? yes protections[]: 0 46400 72000 Normal free:104896kB min:928kB low:1856kB high:2784kB active:64516kB inactive:58 8084kB present:901120kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 25600 HighMem free:1600kB min:512kB low:1024kB high:1536kB active:244080kB inactive:30 23988kB present:4325376kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 3*4kB 4*8kB 3*16kB 4*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12572kB Normal: 156*4kB 1342*8kB 892*16kB 251*32kB 875*64kB 113*128kB 3*256kB 0*512kB 0* 1024kB 0*2048kB 0*4096kB = 104896kB HighMem: 54*4kB 53*8kB 2*16kB 15*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0* 2048kB 0*4096kB = 1600kB Swap cache: add 3241896, delete 3241760, find 5064188/5324042, race 2+3 0 bounce buffer pages Free swap: 1964392kB 1310720 pages of RAM 819136 pages of HIGHMEM 273922 reserved pages 339074 pages shared 227 pages swap cached James, is the problem 'fixed' with the 100 setting? have you seen any new oom kills since changing that setting? thanks. No new ooms since setting lower_zone_protection to 100 James, any reason to keep this open ? Is setting lower_zone_protection to 100 the 'fix'? - or will the situation be different in subsequent kernel releases? (i.e. will I have to use this setting for the RHEL4U4 kernel)? i'd like to know the answer to that as well, please even after setting lower_zone_protection, we managed to get some serious
oom-killer activity yesterday, with what looked like plenty of free RAM.
(tech support was of little help on this, so i'm just adding to this BZ)
in the attached, note how lower_zone_protection seemed to be doing its
thing on 11 Oct, leading to page faults when running ./configure (which
is the only process that triggers this behavior that i've noticed. i
guess it's forking a lot or something, although this machine monitors
about 100 hosts with nagios, which probably does similarly, and that's
never led to an alloc failure). yesterday, 12 Oct, starting around 17:20,
however, the kernel started to run into intermittent shortfalls, despite
claiming to have plenty of high memory available, and the oom-killer was
busy
thanks for whatever help you can proffer
--buck
)% uname -a ; free ; /sbin/sysctl vm.lower_zone_protection
Linux foo 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386
GNU/Linux
total used free shared buffers cached
Mem: 4153528 974564 3178964 0 380 50840
-/+ buffers/cache: 923344 3230184
Swap: 4008208 0 4008208
vm.lower_zone_protection = 100
Created attachment 138419 [details] kern.info log attachment mentioned in comment 12 that i now realize i shoulda just attached and commented on all at once. sorry subject of comment 12 and comment 13, upon restart of nagios, just panic-ed: Out of memory and no killable processes, after dispatching every process (save maybe init) on the machine unfortunately, syslog was early to be killed off in the bloodbath, so no log i can furnish. last console screen indicates 3+ GB free overall, but Normal memory seems to be the problem (as almost all of the 3+ GB is high memory): 928kb free, 928kb min, 1856kb low, 2784kb high, 184kb active, 156kb inactive, 901120kb present, 1879412 pages scanned, all_unreclaimable: yes DMA also said ``all_unreclaimable: yes'', but there were 12548kb free, well above the 128kb min (if that means anything) we've tracked our problems down (service req. 1081734) to a bug in the kernel audit facility associated with our audit.rules containing ``-p'' restrictions (since we just copied-and-pasted the capps.rules example). disabling these precludes a leak of size-32 /proc/slabinfo allocations, which was where all our ZONE_NORMAL memory seemed to be disappearing. i attached a patch to that service req. but hesitate to attach it hereto, seeing as it's probably broken and that it may distract from addressing whatever the root of this bugzilla is, as it's quite likely to be distinct from ours, but, in case it's not, . . . FYI Buck, the cause of your OOK kill is the slabcache is consuming all of Lowmem "slab:212854 ". Something is leaking memory and thats causing this, please get a /proc/slabinfo output when this happens as well a "lsmod" output so we can determine who/what is doing this. Larry Woodman thanks for inquiring, Mr. Woodman. as mentioned in my last comment, RH support tracked this down for us already. they opened the following ``private'' bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216667 (it's so private, i can't get into it, so i'm only taking your support tech's word for it that it's related to our service ticket.) so please forget i intruded on this bugzilla, since ours was entirely non-I/O-related *** Bug 155201 has been marked as a duplicate of this bug. *** *** Bug 149088 has been marked as a duplicate of this bug. *** *** Bug 175277 has been marked as a duplicate of this bug. *** *** Bug 180572 has been marked as a duplicate of this bug. *** *** Bug 208210 has been marked as a duplicate of this bug. *** |