193542 – Oom killer killing processes with free memory available?

Bug 193542 - Oom killer killing processes with free memory available?

Summary: Oom killer killing processes with free memory available?

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	149088 155201 175277 180572 208210 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-30 11:40 UTC by James Pearson
Modified:	2018-10-27 13:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-07-10 14:35:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Dmesg output showing ooms (54.73 KB, text/plain) 2006-05-30 11:40 UTC, James Pearson	no flags	Details
kern.info log (596.35 KB, text/plain) 2006-10-13 12:07 UTC, Buck Huppmann	no flags	Details
View All

Description James Pearson 2006-05-30 11:40:35 UTC

Description of problem:

Running Veritas Netbackup server (v5.1) on Dell 2850's (4Gb RAM, 2Gb swap) using
RHEL4 U3 i686.

Oom killer kills random processes when there appears to be plenty of memory e.g.
'top' shows that about 3.8Gb is cached and swap hardly being used.

dmesg output is attached.

There are a large (ish) number of SCSI devices attached (mostly over FC) -
approx 40 tape drives, robots and disks (as seen in /proc/scsc/scsi)

The FC controller is a dual port:

 'LSI Logic / Symbios Logic FC929X Fibre Channel Adapter'

and the on board SCSI controller is a dual port:

 'SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT
Dual Ultra320 SCSI (rev 08)'

Version-Release number of selected component (if applicable):

kernel 2.6.9-34.ELsmp (i686)

How reproducible:

Run Netbackup Server

Steps to Reproduce:
1.
2.
3.
  
Actual results:

Oom kills

Expected results:

No Oom kills!

Additional info:

Have now disabled the oom-kill (/proc/sys/vm/oom-kill = 0), with no bad side
effects ... the attached dmesg output shows later ooms with the comment:

 'Would have oom-killed but /proc/sys/vm/oom-kill is disabled'

Comment 1 James Pearson 2006-05-30 11:40:35 UTC

Created attachment 130221 [details]
Dmesg output showing ooms

Comment 2 Jason Baron 2006-05-30 20:25:38 UTC

Can you do me a favor and try: 'echo 100 > /proc/sys/vm/lower_zone_protection'.
This will cause page reclaimation to happen sooner thus providing more
'protection' for the zones. I've seen this tuning work in other similar
circumstances so hopefully it will work here as well. thanks.

Comment 3 James Pearson 2006-05-31 10:23:22 UTC

Thanks - /proc/sys/vm/lower_zone_protection now set to 100

I'll monitor what happens over the the next day or so - we usually get a few
potential ooms a day ...

Comment 4 James Pearson 2006-06-02 11:40:11 UTC

No more ooms in the last 48 hours since setting lower_zone_protection to 100.

Normally, we would have seen a few more by now ... but I'll keep my eye on the
situation for a few more days.

However, it looks like it may have 'fixed' the problem - thanks

Comment 5 Larry Woodman 2006-06-17 10:02:54 UTC

Jim, can you get an AltSysrq-M output with the lower_zone_protection set to 100
while the system is under heavy load(when it used to OOM kill).

Thanks, Larry

Comment 6 James Pearson 2006-06-20 10:06:53 UTC

Output from AltSysrq-M - The machine was 'busy' (doing what it normally does)
when this was run, but I have no idea if it would have done an OOM at this point:

SysRq : Show Memory
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16

Free pages:      119068kB (1600kB HighMem)
Active:77149 inactive:903018 dirty:35135 writeback:0 unstable:0 free:29767 slab:
23031 mapped:74517 pagetables:660
DMA free:12572kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:163
84kB pages_scanned:10323225 all_unreclaimable? yes
protections[]: 0 46400 72000
Normal free:104896kB min:928kB low:1856kB high:2784kB active:64516kB inactive:58
8084kB present:901120kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 25600
HighMem free:1600kB min:512kB low:1024kB high:1536kB active:244080kB inactive:30
23988kB present:4325376kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
DMA: 3*4kB 4*8kB 3*16kB 4*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 
2*4096kB = 12572kB
Normal: 156*4kB 1342*8kB 892*16kB 251*32kB 875*64kB 113*128kB 3*256kB 0*512kB 0*
1024kB 0*2048kB 0*4096kB = 104896kB
HighMem: 54*4kB 53*8kB 2*16kB 15*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*
2048kB 0*4096kB = 1600kB
Swap cache: add 3241896, delete 3241760, find 5064188/5324042, race 2+3
0 bounce buffer pages
Free swap:       1964392kB
1310720 pages of RAM
819136 pages of HIGHMEM
273922 reserved pages
339074 pages shared
227 pages swap cached

Comment 7 Jason Baron 2006-06-29 17:19:24 UTC

James, is the problem 'fixed' with the 100 setting? have you seen any new oom
kills since changing that setting? thanks.

Comment 8 James Pearson 2006-06-29 20:18:54 UTC

No new ooms since setting lower_zone_protection to 100

Comment 9 Bob Johnson 2006-08-26 13:51:35 UTC

James, any reason to keep this open ?

Comment 10 James Pearson 2006-08-29 09:59:55 UTC

Is setting lower_zone_protection to 100 the 'fix'? - or will the situation be
different in subsequent kernel releases? (i.e. will I have to use this setting
for the RHEL4U4 kernel)?

Comment 11 Buck Huppmann 2006-09-14 15:48:29 UTC

i'd like to know the answer to that as well, please

Comment 12 Buck Huppmann 2006-10-13 12:03:07 UTC

even after setting lower_zone_protection, we managed to get some serious
oom-killer activity yesterday, with what looked like plenty of free RAM.
(tech support was of little help on this, so i'm just adding to this BZ)

in the attached, note how lower_zone_protection seemed to be doing its
thing on 11 Oct, leading to page faults when running ./configure (which
is the only process that triggers this behavior that i've noticed. i
guess it's forking a lot or something, although this machine monitors
about 100 hosts with nagios, which probably does similarly, and that's
never led to an alloc failure). yesterday, 12 Oct, starting around 17:20,
however, the kernel started to run into intermittent shortfalls, despite
claiming to have plenty of high memory available, and the oom-killer was
busy

thanks for whatever help you can proffer

--buck

)% uname -a ; free ; /sbin/sysctl vm.lower_zone_protection
Linux foo 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386
GNU/Linux
             total       used       free     shared    buffers     cached
Mem:       4153528     974564    3178964          0        380      50840
-/+ buffers/cache:     923344    3230184
Swap:      4008208          0    4008208
vm.lower_zone_protection = 100

Comment 13 Buck Huppmann 2006-10-13 12:07:22 UTC

Created attachment 138419 [details]
kern.info log

attachment mentioned in comment 12 that i now realize i shoulda just
attached and commented on all at once. sorry

Comment 14 Buck Huppmann 2006-10-13 12:54:48 UTC

subject of comment 12 and comment 13, upon restart of nagios, just
panic-ed: Out of memory and no killable processes, after dispatching
every process (save maybe init) on the machine

unfortunately, syslog was early to be killed off in the bloodbath,
so no log i can furnish. last console screen indicates 3+ GB free
overall, but Normal memory seems to be the problem (as almost all
of the 3+ GB is high memory):

928kb free, 928kb min, 1856kb low, 2784kb high, 184kb active,
156kb inactive, 901120kb present, 1879412 pages scanned,
all_unreclaimable: yes

DMA also said ``all_unreclaimable: yes'', but there were 12548kb
free, well above the 128kb min (if that means anything)

Comment 15 Buck Huppmann 2006-11-21 14:37:47 UTC

we've tracked our problems down (service req. 1081734) to a bug in the kernel
audit facility associated with our audit.rules containing ``-p'' restrictions
(since we just copied-and-pasted the capps.rules example). disabling these
precludes a leak of size-32 /proc/slabinfo allocations, which was where all
our ZONE_NORMAL memory seemed to be disappearing. i attached a patch to that
service req. but hesitate to attach it hereto, seeing as it's probably broken
and that it may distract from addressing whatever the root of this bugzilla
is, as it's quite likely to be distinct from ours, but, in case it's not, . . .
FYI

Comment 16 Larry Woodman 2006-12-01 19:17:37 UTC

Buck, the cause of your OOK kill is the slabcache is consuming all of Lowmem
"slab:212854 ".  Something is leaking memory and thats causing this, please
get a /proc/slabinfo output when this happens as well a "lsmod" output so we can
determine who/what is doing this.

Larry Woodman

Comment 17 Buck Huppmann 2006-12-05 00:58:03 UTC

thanks for inquiring, Mr. Woodman. as mentioned in my last comment, RH support
tracked this down for us already. they opened the following ``private'' bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216667

(it's so private, i can't get into it, so i'm only taking your support tech's
word for it that it's related to our service ticket.) so please forget i
intruded on this bugzilla, since ours was entirely non-I/O-related

Comment 18 Larry Woodman 2006-12-08 13:02:11 UTC

*** Bug 155201 has been marked as a duplicate of this bug. ***

Comment 19 Larry Woodman 2006-12-08 13:06:29 UTC

*** Bug 149088 has been marked as a duplicate of this bug. ***

Comment 20 Larry Woodman 2006-12-08 13:26:39 UTC

*** Bug 175277 has been marked as a duplicate of this bug. ***

Comment 21 Larry Woodman 2006-12-08 13:29:53 UTC

*** Bug 180572 has been marked as a duplicate of this bug. ***

Comment 22 Larry Woodman 2006-12-08 13:36:52 UTC

*** Bug 208210 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.