Description of problem: Running Veritas Netbackup server (v5.1) on Dell 2850's (4Gb RAM, 2Gb swap) using RHEL4 U3 i686. Oom killer kills random processes when there appears to be plenty of memory e.g. 'top' shows that about 3.8Gb is cached and swap hardly being used. dmesg output is attached. There are a large (ish) number of SCSI devices attached (mostly over FC) - approx 40 tape drives, robots and disks (as seen in /proc/scsc/scsi) The FC controller is a dual port: 'LSI Logic / Symbios Logic FC929X Fibre Channel Adapter' and the on board SCSI controller is a dual port: 'SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08)' Version-Release number of selected component (if applicable): kernel 2.6.9-34.ELsmp (i686) How reproducible: Run Netbackup Server Steps to Reproduce: 1. 2. 3. Actual results: Oom kills Expected results: No Oom kills! Additional info: Have now disabled the oom-kill (/proc/sys/vm/oom-kill = 0), with no bad side effects ... the attached dmesg output shows later ooms with the comment: 'Would have oom-killed but /proc/sys/vm/oom-kill is disabled'
Created attachment 130221 [details] Dmesg output showing ooms
Can you do me a favor and try: 'echo 100 > /proc/sys/vm/lower_zone_protection'. This will cause page reclaimation to happen sooner thus providing more 'protection' for the zones. I've seen this tuning work in other similar circumstances so hopefully it will work here as well. thanks.
Thanks - /proc/sys/vm/lower_zone_protection now set to 100 I'll monitor what happens over the the next day or so - we usually get a few potential ooms a day ...
No more ooms in the last 48 hours since setting lower_zone_protection to 100. Normally, we would have seen a few more by now ... but I'll keep my eye on the situation for a few more days. However, it looks like it may have 'fixed' the problem - thanks
Jim, can you get an AltSysrq-M output with the lower_zone_protection set to 100 while the system is under heavy load(when it used to OOM kill). Thanks, Larry
Output from AltSysrq-M - The machine was 'busy' (doing what it normally does) when this was run, but I have no idea if it would have done an OOM at this point: SysRq : Show Memory Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 HighMem per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Free pages: 119068kB (1600kB HighMem) Active:77149 inactive:903018 dirty:35135 writeback:0 unstable:0 free:29767 slab: 23031 mapped:74517 pagetables:660 DMA free:12572kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:163 84kB pages_scanned:10323225 all_unreclaimable? yes protections[]: 0 46400 72000 Normal free:104896kB min:928kB low:1856kB high:2784kB active:64516kB inactive:58 8084kB present:901120kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 25600 HighMem free:1600kB min:512kB low:1024kB high:1536kB active:244080kB inactive:30 23988kB present:4325376kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 3*4kB 4*8kB 3*16kB 4*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12572kB Normal: 156*4kB 1342*8kB 892*16kB 251*32kB 875*64kB 113*128kB 3*256kB 0*512kB 0* 1024kB 0*2048kB 0*4096kB = 104896kB HighMem: 54*4kB 53*8kB 2*16kB 15*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0* 2048kB 0*4096kB = 1600kB Swap cache: add 3241896, delete 3241760, find 5064188/5324042, race 2+3 0 bounce buffer pages Free swap: 1964392kB 1310720 pages of RAM 819136 pages of HIGHMEM 273922 reserved pages 339074 pages shared 227 pages swap cached
James, is the problem 'fixed' with the 100 setting? have you seen any new oom kills since changing that setting? thanks.
No new ooms since setting lower_zone_protection to 100
James, any reason to keep this open ?
Is setting lower_zone_protection to 100 the 'fix'? - or will the situation be different in subsequent kernel releases? (i.e. will I have to use this setting for the RHEL4U4 kernel)?
i'd like to know the answer to that as well, please
even after setting lower_zone_protection, we managed to get some serious oom-killer activity yesterday, with what looked like plenty of free RAM. (tech support was of little help on this, so i'm just adding to this BZ) in the attached, note how lower_zone_protection seemed to be doing its thing on 11 Oct, leading to page faults when running ./configure (which is the only process that triggers this behavior that i've noticed. i guess it's forking a lot or something, although this machine monitors about 100 hosts with nagios, which probably does similarly, and that's never led to an alloc failure). yesterday, 12 Oct, starting around 17:20, however, the kernel started to run into intermittent shortfalls, despite claiming to have plenty of high memory available, and the oom-killer was busy thanks for whatever help you can proffer --buck )% uname -a ; free ; /sbin/sysctl vm.lower_zone_protection Linux foo 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux total used free shared buffers cached Mem: 4153528 974564 3178964 0 380 50840 -/+ buffers/cache: 923344 3230184 Swap: 4008208 0 4008208 vm.lower_zone_protection = 100
Created attachment 138419 [details] kern.info log attachment mentioned in comment 12 that i now realize i shoulda just attached and commented on all at once. sorry
subject of comment 12 and comment 13, upon restart of nagios, just panic-ed: Out of memory and no killable processes, after dispatching every process (save maybe init) on the machine unfortunately, syslog was early to be killed off in the bloodbath, so no log i can furnish. last console screen indicates 3+ GB free overall, but Normal memory seems to be the problem (as almost all of the 3+ GB is high memory): 928kb free, 928kb min, 1856kb low, 2784kb high, 184kb active, 156kb inactive, 901120kb present, 1879412 pages scanned, all_unreclaimable: yes DMA also said ``all_unreclaimable: yes'', but there were 12548kb free, well above the 128kb min (if that means anything)
we've tracked our problems down (service req. 1081734) to a bug in the kernel audit facility associated with our audit.rules containing ``-p'' restrictions (since we just copied-and-pasted the capps.rules example). disabling these precludes a leak of size-32 /proc/slabinfo allocations, which was where all our ZONE_NORMAL memory seemed to be disappearing. i attached a patch to that service req. but hesitate to attach it hereto, seeing as it's probably broken and that it may distract from addressing whatever the root of this bugzilla is, as it's quite likely to be distinct from ours, but, in case it's not, . . . FYI
Buck, the cause of your OOK kill is the slabcache is consuming all of Lowmem "slab:212854 ". Something is leaking memory and thats causing this, please get a /proc/slabinfo output when this happens as well a "lsmod" output so we can determine who/what is doing this. Larry Woodman
thanks for inquiring, Mr. Woodman. as mentioned in my last comment, RH support tracked this down for us already. they opened the following ``private'' bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216667 (it's so private, i can't get into it, so i'm only taking your support tech's word for it that it's related to our service ticket.) so please forget i intruded on this bugzilla, since ours was entirely non-I/O-related
*** Bug 155201 has been marked as a duplicate of this bug. ***
*** Bug 149088 has been marked as a duplicate of this bug. ***
*** Bug 175277 has been marked as a duplicate of this bug. ***
*** Bug 180572 has been marked as a duplicate of this bug. ***
*** Bug 208210 has been marked as a duplicate of this bug. ***