Bug 150971 - RHEL 4 smp kernel has memory leak, eventually causes OOM kills
Summary: RHEL 4 smp kernel has memory leak, eventually causes OOM kills
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Larry Woodman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-03-13 04:41 UTC by Dave Miller
Modified: 2007-11-30 22:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-11-16 18:36:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oom-killer logs from all incidents to-date (354.56 KB, text/plain)
2005-03-13 04:46 UTC, Dave Miller
no flags Details
slabinfo during low lowmem period (13.58 KB, text/plain)
2005-03-22 21:06 UTC, Dave Miller
no flags Details
AltSysRq+M and slabinfo (15.64 KB, text/plain)
2005-03-23 18:36 UTC, Dave Miller
no flags Details
output from /proc/slabinfo during fit of low memory (13.06 KB, text/plain)
2005-04-28 22:39 UTC, James Ralston
no flags Details

Description Dave Miller 2005-03-13 04:41:35 UTC
Description of problem:

We have a dedicated MySQL database server running on RHEL 4, which
since its deployment, was consistantly running out of LowMem after
about 2 1/2 days, and no number of processes being killed would free
up the memory, necessitating a reboot to clear it up.  We did not
originally file a bug on it because it sounded a lot like bug 131251
or bug 149635, and we obtained via one of our contacts at RedHat a
prerelease version of the kernel which supposedly contained the fix
outlined on bug 149635 (which has since been closed as "notabug" which
is why I'm opening this one).   The fast leak does appear to be gone
with the pre-release kernel, but we still have a slow leak.  With the
new kernel, the system has lasted approximately 10 days instead of 2
1/2 before exhausting LowMem.


Version-Release number of selected component (if applicable):

original kernel: 2.6.9-5.0.3.ELsmp  (this had the 2.5 day cycle)
prerelease: 2.6.9-6.16.ELsmp (this has the 10 day cycle)


How reproducible:

Always


Steps to Reproduce:

1. Boot machine.
2. Let it run for the specified time period with a MySQL server on it
under production load.


Actual results:

LowMem is exhausted and the kernel starts firing OOM kills


Expected results:

The machine continues to run indefinitely without intervention.


Additional info:

Our existing trail of information on this situation is at
https://bugzilla.mozilla.org/show_bug.cgi?id=284325

Comment 1 Dave Miller 2005-03-13 04:46:29 UTC
Created attachment 111932 [details]
oom-killer logs from all incidents to-date

Comment 2 Dave Jones 2005-03-13 04:49:15 UTC
Can you grab the latest beta from
http://people.redhat.com/davej/kernels/RHEL4/
and give that a try ?  It'll print some extra diagnostic info at the
time of the oom kill, which could be useful in tracking this down, and
also has 1-2 VM tweaks.


Comment 3 Dave Miller 2005-03-13 04:57:15 UTC
At this point, the machine has about 17.5% LowMem still free, and is
losing approximately 1% give or take 0.3% or so every 2 hours, so we
expect it to start dying again in the next 12 to 15 hours.  We're
quite likely to pre-emptively reboot it before it gets that far.

This is the only i686/multiple-CPU box we have with RHEL 4 on it so
far, so I don't have other machines to compare with.

Comment 4 Dave Miller 2005-03-13 05:07:44 UTC
Cool, thanks.  Got the new kernel installed, and queued for rebooting
into.  Guess I'll go ahead and do it now instead of waiting for it to
die since I have an excuse to reboot now :)

Comment 5 Larry Woodman 2005-03-14 22:27:40 UTC
The problem here is the slab is consuming almost all of the Normal zone:

ctive:455893 inactive:275863 dirty:10 writeback:1 unstable:0
free:53480 slab:216673


Please get me an /proc/slabinfo output when this happens and I'll
figure out whats leaking.


Thanks, Larry Woodman


Comment 6 Dave Miller 2005-03-14 23:26:55 UTC
Would a current slabinfo help, and perhaps another one in a week or
so?  This appears to be happening over an extended period of time.  Or
do we just need to let it die next time and get that data before
rebooting?

Comment 7 David Tonhofer 2005-03-20 18:24:16 UTC
Just to add my 10c: I have a 2-Xeon 64-bit machine running 2.6.9-5.0.3.ELsmp
stock kernel with a MySQL 4.0.20 and several Sun Tiger JVMs. Which up a lot of
memory
btw.

The machine does seem to hold up pretty well so far: 22 days uptime,
though I have impression that the buffer cache is a bit low.

As I'm bored senseless (not really), I have set up MRTG graphs of the
current memory status on this page:

http://misato.m-plify.net/

If you tell me what numbers you are interested in, I will gladly help.



Comment 8 Larry Woodman 2005-03-22 20:51:15 UTC
Is this still a problem with the latest RHEL4 kernel?  Its located
here >>>http://people.redhat.com/davej/kernels/RHEL4/

If you still see the same problem, please get me a /proc/slabinfo
output so I can see where those 216673 pages of slabcache are going.


Larry Woodman


Comment 9 Dave Miller 2005-03-22 21:06:11 UTC
Created attachment 112231 [details]
slabinfo during low lowmem period

we've had it running on kernel-smp-2.6.9-6.26.EL since the last incident, and
as of the last day or two we're getting alerts from our nagios monitoring that
LowMem is running on the low side again.  As of this morning it's down to 0.3%
free LowMem, but it hasn't started firing OOM kills yet.

slabinfo is attached.

Comment 10 Dave Miller 2005-03-22 21:32:01 UTC
.

Comment 11 Larry Woodman 2005-03-22 21:35:24 UTC
The slabcache doesnt seem to be too bad here, please include an AltSysrq-M
output along with the /proc/slabinfo output.

Thanks, Larry Woodman


Comment 12 Dave Miller 2005-03-22 21:51:45 UTC
(In reply to comment #11)
> please include an AltSysrq-M output

That sounds like a keyboard combination...  is that possible to do from remote?
 This machine is in a colo facility.  I'll have to send somebody in to do it if
it has to be done from console.


Comment 13 Dave Jones 2005-03-22 22:01:10 UTC
echo m > /proc/sysrq-trigger
output will be in dmesg


Comment 14 Dave Miller 2005-03-23 02:20:49 UTC
ok, that got me this:

SysRq : Show Memory
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16

Free pages:      243956kB (23872kB HighMem)
Active:838951 inactive:83231 dirty:271 writeback:1 unstable:0 free:60989
slab:19012 mapped:65859 pagetables:446
DMA free:6900kB min:16kB low:32kB high:48kB active:3660kB inactive:1336kB
present:16384kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Normal free:213184kB min:936kB low:1872kB high:2808kB active:372372kB
inactive:195532kB present:901120kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
HighMem free:23872kB min:512kB low:1024kB high:1536kB active:2979772kB
inactive:136056kB present:3145600kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
DMA: 191*4kB 115*8kB 72*16kB 57*32kB 15*64kB 4*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 6900kB
Normal: 5704*4kB 1348*8kB 790*16kB 547*32kB 399*64kB 402*128kB 195*256kB
38*512kB 3*1024kB 0*2048kB 0*4096kB = 213184kB
HighMem: 878*4kB 379*8kB 151*16kB 100*32kB 79*64kB 38*128kB 7*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 23872kB
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap:       2047992kB
1015776 pages of RAM
786400 pages of HIGHMEM
9384 reserved pages
329113 pages shared
0 pages swap cached
IPT INPUT packet died: IN=eth0 OUT=
MAC=00:11:43:32:31:2a:00:05:85:f3:b8:9d:08:00 SRC=140.211.166.139
DST=140.211.166.201 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=27553 DF PROTO=TCP
SPT=56188 DPT=5666 WINDOW=5840 RES=0x00 SYN URGP=0 
IPT INPUT packet died: IN=eth0 OUT=
MAC=00:11:43:32:31:2a:00:05:85:f3:b8:9d:08:00 SRC=140.211.166.139
DST=140.211.166.201 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=27555 DF PROTO=TCP
SPT=56188 DPT=5666 WINDOW=5840 RES=0x00 SYN URGP=0 

Note that it's been a few hours, and the memory usage freed up, it's got 34%
free now.  It's unusual that it freed up again, this kernel we're running now
must be better at it than the previous ones were :)

Comment 15 Dave Miller 2005-03-23 18:36:46 UTC
Created attachment 112274 [details]
AltSysRq+M and slabinfo

down in the < 1.0% LowMem free zone again.  Seems to be holding up much better
this week than in the past, so this kernel must deal with it.  My pager's going
off a lot though because after the first couple times we pointed nagios at it
to keep tabs on it and warn us when it got low. :)

Comment 16 Larry Woodman 2005-03-23 21:20:28 UTC
The latest attachment does not show any problems.  The system will use all of
available memory to cache file system data and as long as it is reclaimable(on
either the active or inactive list) it can be quiclky reclaimed.  In this case
there is ~900MB of lowmem(Normal zone) and ~780MB active+inactive.  Not a problem.
------------------------------------------------------------------------
Normal free:6280kB min:936kB low:1872kB high:2808kB active:561440kB
inactive:218364kB present:901120kB
------------------------------------------------------------------------

This should not cause an OOM kill problem, does it?

BTW, if you see Normal zone Free+Active+Inactive down to some low % of present
then thats a problem.

Larry


Comment 17 Dave Miller 2005-03-23 22:02:37 UTC
Yeah, so far it's looking like the latest kernel we dropped on there is
reclaiming stuff before it gets to the point of an OOM kill now (which means
it's doing what it should be doing).

Comment 18 Dave Miller 2005-04-02 22:34:21 UTC
I'm going to sign off on this being fixed (for us).  We've gone three weeks with
no problems now with the 2.6.9-6.26.ELsmp kernel.  Leaving the bug open in case
you need to account for an errata...  the currently-released kernel for RHEL 4
I'm assuming still has this problem since I haven't seen any new kernels yet
since then.

Comment 19 Tony Naccarato 2005-04-04 10:26:14 UTC
It seems to do the trick for me also.

I was seeing this problem, with kernel-2.6.9-5.0.3.EL, but on s390x (Z-series)
virtual machine (with 512MB RAM). Running a suite of automated tests for our
mail-server product, I could see free memory going down, swap usage going up and
after 2-3 hours, processes were getting killed. /var/log/messages showed entries
like:

------------------------------------------------------------------------------
Apr  4 04:02:29 virtual-178 kernel: oom-killer: gfp_mask=0xd0
Apr  4 04:02:30 virtual-178 kernel: DMA per-cpu:
Apr  4 04:02:30 virtual-178 kernel: cpu 0 hot: low 32, high 96, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 0 cold: low 0, high 32, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 1 hot: low 32, high 96, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 1 cold: low 0, high 32, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 2 hot: low 32, high 96, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 2 cold: low 0, high 32, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 3 hot: low 32, high 96, batch 16
Apr  4 04:02:30 virtual-178 kernel: cpu 3 cold: low 0, high 32, batch 16
Apr  4 04:02:30 virtual-178 kernel: Normal per-cpu: empty
Apr  4 04:02:30 virtual-178 kernel: HighMem per-cpu: empty
Apr  4 04:02:30 virtual-178 kernel:
Apr  4 04:02:31 virtual-178 kernel: Free pages:       11160kB (0kB HighMem)
Apr  4 04:02:31 virtual-178 kernel: Active:340 inactive:374 dirty:0 writeback:11
unstable:0 free:2790 slab:3538 mapped:1 pagetables:116938
Apr  4 04:02:31 virtual-178 kernel: DMA free:11160kB min:724kB low:1448kB
high:2172kB active:1392kB inactive:1496kB present:524288kB
Apr  4 04:02:31 virtual-178 kernel: protections[]: 0 0 0
Apr  4 04:02:31 virtual-178 kernel: Normal free:0kB min:0kB low:0kB high:0kB
active:0kB inactive:0kB present:0kB
Apr  4 04:02:32 virtual-178 kernel: protections[]: 0 0 0
Apr  4 04:02:32 virtual-178 kernel: HighMem free:0kB min:128kB low:256kB
high:384kB active:0kB inactive:0kB present:0kB
Apr  4 04:02:32 virtual-178 kernel: protections[]: 0 0 0
Apr  4 04:02:32 virtual-178 kernel: DMA: 2428*4kB 181*8kB 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11160kB
Apr  4 04:02:32 virtual-178 kernel: Normal: empty
Apr  4 04:02:32 virtual-178 kernel: HighMem: empty
Apr  4 04:02:32 virtual-178 kernel: Swap cache: add 291954, delete 291934, find
135038/191384, race 2+36
Apr  4 04:02:32 virtual-178 kernel: Out of Memory: Killed process 6337 (omctmon).
-----------------------------------------------------------------------------

This was perfectly reproducible. Our app is a 31-bit application, running on the
64-bit kernel and I think that may be significant, as I did not see this problem
when running on a VM with the 31-bit kernel.

Anyway, after installing the pre-release 2.6.9-6.36 kernel I found in the area
mentioned in comment #8, all works fine. So eagerly awaiting the formal release
of this kernel in the next RHEL4 update. Thanks folks!




Comment 20 James Ralston 2005-04-28 22:00:43 UTC
We've encountered [what seems to be] this problem as well, on one of our three
RHEL4 boxes.  The box has 4GB of physical memory, but eventually (after 3-5
days) it uses oom-killer to shoot itself in the head.  There are no userland
processes taking up any memory to speak of.

Before pasting slabinfo et. al. stuff in here, I'm going to go pull the kernel
mentioned in comment #8 and see if that corrects the problem.  I'll report back.


Comment 21 Dave Miller 2005-04-28 22:34:36 UTC
James: were you using the new kernel they just issued last week?  (I've been
debating downgrading to it, since it's newer than the one we originally had this
problem with, but haven't seen any assurance that it fixes this :)

I can confirm that we have had zero problems with this issue since installing
the kernel I mentioned in comment 9.  Of course now I'm wondering if I need a
newer one to address the security issues that the 5.x.x errata covered.

Comment 22 James Ralston 2005-04-28 22:39:59 UTC
Created attachment 113813 [details]
output from /proc/slabinfo during fit of low memory

Actually, I'll go ahead and attach the /proc/slabinfo I have, because it looks
to be different than the slabinfos that have already been attached to this bug.


In particular, my biovec-1 and bio numbers are much larger.  I don't know if
that's significant, though.

Comment 23 James Ralston 2005-04-28 22:47:46 UTC
Dave: 2.6.9-5.0.5.EL does *not* fix the problem; I'm experiencing the low memory
problem under 2.6.9-5.0.5.EL.


Comment 24 Larry Woodman 2005-04-29 12:43:06 UTC

Is this the latest RHEL4-U1 kernel?  There are over 190K out of 225K lowmem
pages allocated to the bio slab and that was fixed in RHEL4-U1. 

bio 5906946 5907143 128 31 1 : tunables 120 60 8 : slabdata 190553 190553 0


Larry Woodman


Comment 25 James Ralston 2005-04-29 20:52:17 UTC
The slabinfo in comment #22 is from 2.6.9-5.0.5.ELsmp.

Is RHEL4-U1 out yet?  Or is the RHEL4-U1 kernel the one mentioned in comment #8?


Comment 26 Need Real Name 2005-06-01 14:49:28 UTC
I was having the same problem. I was using kernel-smp-2.6.9-5.0.5.EL until
yesterday. The server has 2Gb of memory and although no process seemed to use
it, almost all the memory was used. I had the machine crashed once a week
because of OOM. I went to init 1, killed every process that had nothing to do
with the kernel, and it had still 1.5G used. 
Yesterday I installed kernel-smp-2.6.9-11.EL. The memory usage looks very normal
until now. Cou can see a graph of memory usage at
http://isis.tecnoera.com/mailscanner-mrtg/memory/memory.html
If you need info from my system, I can provide.

Comment 27 Mike McLean 2005-06-09 16:49:48 UTC
Taking this out of NEEDINFO since comment#25 seems to be in answer to
comment#24.  This bug was in MODIFIED before that.  If you can reproduce this
bug with the U1 kernel (2.6.9-11), please report.

Comment 28 Dave Miller 2005-06-09 18:44:14 UTC
I highly suspect this is fixed in 2.6.9-11 (since I'm positive it's fixed in
2.6.9-6.26, and the changes from 2.6.9-6.26 are still in the current changelog).

But I'll let you know for sure after I let it run for a week.

Comment 29 Tom Sightler 2005-06-14 19:59:49 UTC
I'm not sure if this is the exact same problem, but since upgrading to 2.6.9-11
we've been seeing OOM errors where we hadn't before.

We wrote a small script (barely even a script) that simply uses 'dd' to
continually create 40GB files from /dev/zero until our 1TB LUN is full, then
delete them, then start over.  We created the script to allow us to reproduce a
performance problem as noted in Bugzilla 156437 (and an official support ticket).

We have a Dell 6450 with 8GB RAM and, when we run two copies of the script
simultaneously we start getting OOM's within a couple of ours, usually killing
gdmgreeter multiple times and eventually portmap.  After another hour or so the
system hangs hard.

Should I open a different bugzilla, or is this possibly related.  I suspect it
is not related since I don't every remember seeing this behaviour with 2.6.9-5.0.5.

Later,
Tom








Comment 30 James Ralston 2005-08-19 19:51:43 UTC
FWIW, we haven't seen any oom-killer problems with 2.6.9-11.  I'm fairly certain
that if the problem still existed in 2.6.9-11, we would have hit it by now.

Tom, do you think the problems you detailed in comment #29 might be caused by a
different issue?


Comment 31 David Tonhofer 2005-11-16 16:54:40 UTC
FWIW bis, this should probably be closed I had the problem as described this
last week on an old, old, old non-SMP (2.6.9-5.0.3) kernel on which I
erroneously booted (mixup in boot partitions, slab size grows until can't any
longer) but there is no problem with the non-SMP 2.6.9-22.0.1




Note You need to log in before you can comment on or make changes to this bug.