Description of problem: We have a dedicated MySQL database server running on RHEL 4, which since its deployment, was consistantly running out of LowMem after about 2 1/2 days, and no number of processes being killed would free up the memory, necessitating a reboot to clear it up. We did not originally file a bug on it because it sounded a lot like bug 131251 or bug 149635, and we obtained via one of our contacts at RedHat a prerelease version of the kernel which supposedly contained the fix outlined on bug 149635 (which has since been closed as "notabug" which is why I'm opening this one). The fast leak does appear to be gone with the pre-release kernel, but we still have a slow leak. With the new kernel, the system has lasted approximately 10 days instead of 2 1/2 before exhausting LowMem. Version-Release number of selected component (if applicable): original kernel: 2.6.9-5.0.3.ELsmp (this had the 2.5 day cycle) prerelease: 2.6.9-6.16.ELsmp (this has the 10 day cycle) How reproducible: Always Steps to Reproduce: 1. Boot machine. 2. Let it run for the specified time period with a MySQL server on it under production load. Actual results: LowMem is exhausted and the kernel starts firing OOM kills Expected results: The machine continues to run indefinitely without intervention. Additional info: Our existing trail of information on this situation is at https://bugzilla.mozilla.org/show_bug.cgi?id=284325
Created attachment 111932 [details] oom-killer logs from all incidents to-date
Can you grab the latest beta from http://people.redhat.com/davej/kernels/RHEL4/ and give that a try ? It'll print some extra diagnostic info at the time of the oom kill, which could be useful in tracking this down, and also has 1-2 VM tweaks.
At this point, the machine has about 17.5% LowMem still free, and is losing approximately 1% give or take 0.3% or so every 2 hours, so we expect it to start dying again in the next 12 to 15 hours. We're quite likely to pre-emptively reboot it before it gets that far. This is the only i686/multiple-CPU box we have with RHEL 4 on it so far, so I don't have other machines to compare with.
Cool, thanks. Got the new kernel installed, and queued for rebooting into. Guess I'll go ahead and do it now instead of waiting for it to die since I have an excuse to reboot now :)
The problem here is the slab is consuming almost all of the Normal zone: ctive:455893 inactive:275863 dirty:10 writeback:1 unstable:0 free:53480 slab:216673 Please get me an /proc/slabinfo output when this happens and I'll figure out whats leaking. Thanks, Larry Woodman
Would a current slabinfo help, and perhaps another one in a week or so? This appears to be happening over an extended period of time. Or do we just need to let it die next time and get that data before rebooting?
Just to add my 10c: I have a 2-Xeon 64-bit machine running 2.6.9-5.0.3.ELsmp stock kernel with a MySQL 4.0.20 and several Sun Tiger JVMs. Which up a lot of memory btw. The machine does seem to hold up pretty well so far: 22 days uptime, though I have impression that the buffer cache is a bit low. As I'm bored senseless (not really), I have set up MRTG graphs of the current memory status on this page: http://misato.m-plify.net/ If you tell me what numbers you are interested in, I will gladly help.
Is this still a problem with the latest RHEL4 kernel? Its located here >>>http://people.redhat.com/davej/kernels/RHEL4/ If you still see the same problem, please get me a /proc/slabinfo output so I can see where those 216673 pages of slabcache are going. Larry Woodman
Created attachment 112231 [details] slabinfo during low lowmem period we've had it running on kernel-smp-2.6.9-6.26.EL since the last incident, and as of the last day or two we're getting alerts from our nagios monitoring that LowMem is running on the low side again. As of this morning it's down to 0.3% free LowMem, but it hasn't started firing OOM kills yet. slabinfo is attached.
.
The slabcache doesnt seem to be too bad here, please include an AltSysrq-M output along with the /proc/slabinfo output. Thanks, Larry Woodman
(In reply to comment #11) > please include an AltSysrq-M output That sounds like a keyboard combination... is that possible to do from remote? This machine is in a colo facility. I'll have to send somebody in to do it if it has to be done from console.
echo m > /proc/sysrq-trigger output will be in dmesg
ok, that got me this: SysRq : Show Memory Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1 cpu 2 cold: low 0, high 2, batch 1 cpu 3 hot: low 2, high 6, batch 1 cpu 3 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 HighMem per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 Free pages: 243956kB (23872kB HighMem) Active:838951 inactive:83231 dirty:271 writeback:1 unstable:0 free:60989 slab:19012 mapped:65859 pagetables:446 DMA free:6900kB min:16kB low:32kB high:48kB active:3660kB inactive:1336kB present:16384kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Normal free:213184kB min:936kB low:1872kB high:2808kB active:372372kB inactive:195532kB present:901120kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 HighMem free:23872kB min:512kB low:1024kB high:1536kB active:2979772kB inactive:136056kB present:3145600kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 191*4kB 115*8kB 72*16kB 57*32kB 15*64kB 4*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 6900kB Normal: 5704*4kB 1348*8kB 790*16kB 547*32kB 399*64kB 402*128kB 195*256kB 38*512kB 3*1024kB 0*2048kB 0*4096kB = 213184kB HighMem: 878*4kB 379*8kB 151*16kB 100*32kB 79*64kB 38*128kB 7*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23872kB Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap: 2047992kB 1015776 pages of RAM 786400 pages of HIGHMEM 9384 reserved pages 329113 pages shared 0 pages swap cached IPT INPUT packet died: IN=eth0 OUT= MAC=00:11:43:32:31:2a:00:05:85:f3:b8:9d:08:00 SRC=140.211.166.139 DST=140.211.166.201 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=27553 DF PROTO=TCP SPT=56188 DPT=5666 WINDOW=5840 RES=0x00 SYN URGP=0 IPT INPUT packet died: IN=eth0 OUT= MAC=00:11:43:32:31:2a:00:05:85:f3:b8:9d:08:00 SRC=140.211.166.139 DST=140.211.166.201 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=27555 DF PROTO=TCP SPT=56188 DPT=5666 WINDOW=5840 RES=0x00 SYN URGP=0 Note that it's been a few hours, and the memory usage freed up, it's got 34% free now. It's unusual that it freed up again, this kernel we're running now must be better at it than the previous ones were :)
Created attachment 112274 [details] AltSysRq+M and slabinfo down in the < 1.0% LowMem free zone again. Seems to be holding up much better this week than in the past, so this kernel must deal with it. My pager's going off a lot though because after the first couple times we pointed nagios at it to keep tabs on it and warn us when it got low. :)
The latest attachment does not show any problems. The system will use all of available memory to cache file system data and as long as it is reclaimable(on either the active or inactive list) it can be quiclky reclaimed. In this case there is ~900MB of lowmem(Normal zone) and ~780MB active+inactive. Not a problem. ------------------------------------------------------------------------ Normal free:6280kB min:936kB low:1872kB high:2808kB active:561440kB inactive:218364kB present:901120kB ------------------------------------------------------------------------ This should not cause an OOM kill problem, does it? BTW, if you see Normal zone Free+Active+Inactive down to some low % of present then thats a problem. Larry
Yeah, so far it's looking like the latest kernel we dropped on there is reclaiming stuff before it gets to the point of an OOM kill now (which means it's doing what it should be doing).
I'm going to sign off on this being fixed (for us). We've gone three weeks with no problems now with the 2.6.9-6.26.ELsmp kernel. Leaving the bug open in case you need to account for an errata... the currently-released kernel for RHEL 4 I'm assuming still has this problem since I haven't seen any new kernels yet since then.
It seems to do the trick for me also. I was seeing this problem, with kernel-2.6.9-5.0.3.EL, but on s390x (Z-series) virtual machine (with 512MB RAM). Running a suite of automated tests for our mail-server product, I could see free memory going down, swap usage going up and after 2-3 hours, processes were getting killed. /var/log/messages showed entries like: ------------------------------------------------------------------------------ Apr 4 04:02:29 virtual-178 kernel: oom-killer: gfp_mask=0xd0 Apr 4 04:02:30 virtual-178 kernel: DMA per-cpu: Apr 4 04:02:30 virtual-178 kernel: cpu 0 hot: low 32, high 96, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 0 cold: low 0, high 32, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 1 hot: low 32, high 96, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 1 cold: low 0, high 32, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 2 hot: low 32, high 96, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 2 cold: low 0, high 32, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 3 hot: low 32, high 96, batch 16 Apr 4 04:02:30 virtual-178 kernel: cpu 3 cold: low 0, high 32, batch 16 Apr 4 04:02:30 virtual-178 kernel: Normal per-cpu: empty Apr 4 04:02:30 virtual-178 kernel: HighMem per-cpu: empty Apr 4 04:02:30 virtual-178 kernel: Apr 4 04:02:31 virtual-178 kernel: Free pages: 11160kB (0kB HighMem) Apr 4 04:02:31 virtual-178 kernel: Active:340 inactive:374 dirty:0 writeback:11 unstable:0 free:2790 slab:3538 mapped:1 pagetables:116938 Apr 4 04:02:31 virtual-178 kernel: DMA free:11160kB min:724kB low:1448kB high:2172kB active:1392kB inactive:1496kB present:524288kB Apr 4 04:02:31 virtual-178 kernel: protections[]: 0 0 0 Apr 4 04:02:31 virtual-178 kernel: Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB Apr 4 04:02:32 virtual-178 kernel: protections[]: 0 0 0 Apr 4 04:02:32 virtual-178 kernel: HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB Apr 4 04:02:32 virtual-178 kernel: protections[]: 0 0 0 Apr 4 04:02:32 virtual-178 kernel: DMA: 2428*4kB 181*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11160kB Apr 4 04:02:32 virtual-178 kernel: Normal: empty Apr 4 04:02:32 virtual-178 kernel: HighMem: empty Apr 4 04:02:32 virtual-178 kernel: Swap cache: add 291954, delete 291934, find 135038/191384, race 2+36 Apr 4 04:02:32 virtual-178 kernel: Out of Memory: Killed process 6337 (omctmon). ----------------------------------------------------------------------------- This was perfectly reproducible. Our app is a 31-bit application, running on the 64-bit kernel and I think that may be significant, as I did not see this problem when running on a VM with the 31-bit kernel. Anyway, after installing the pre-release 2.6.9-6.36 kernel I found in the area mentioned in comment #8, all works fine. So eagerly awaiting the formal release of this kernel in the next RHEL4 update. Thanks folks!
We've encountered [what seems to be] this problem as well, on one of our three RHEL4 boxes. The box has 4GB of physical memory, but eventually (after 3-5 days) it uses oom-killer to shoot itself in the head. There are no userland processes taking up any memory to speak of. Before pasting slabinfo et. al. stuff in here, I'm going to go pull the kernel mentioned in comment #8 and see if that corrects the problem. I'll report back.
James: were you using the new kernel they just issued last week? (I've been debating downgrading to it, since it's newer than the one we originally had this problem with, but haven't seen any assurance that it fixes this :) I can confirm that we have had zero problems with this issue since installing the kernel I mentioned in comment 9. Of course now I'm wondering if I need a newer one to address the security issues that the 5.x.x errata covered.
Created attachment 113813 [details] output from /proc/slabinfo during fit of low memory Actually, I'll go ahead and attach the /proc/slabinfo I have, because it looks to be different than the slabinfos that have already been attached to this bug. In particular, my biovec-1 and bio numbers are much larger. I don't know if that's significant, though.
Dave: 2.6.9-5.0.5.EL does *not* fix the problem; I'm experiencing the low memory problem under 2.6.9-5.0.5.EL.
Is this the latest RHEL4-U1 kernel? There are over 190K out of 225K lowmem pages allocated to the bio slab and that was fixed in RHEL4-U1. bio 5906946 5907143 128 31 1 : tunables 120 60 8 : slabdata 190553 190553 0 Larry Woodman
The slabinfo in comment #22 is from 2.6.9-5.0.5.ELsmp. Is RHEL4-U1 out yet? Or is the RHEL4-U1 kernel the one mentioned in comment #8?
I was having the same problem. I was using kernel-smp-2.6.9-5.0.5.EL until yesterday. The server has 2Gb of memory and although no process seemed to use it, almost all the memory was used. I had the machine crashed once a week because of OOM. I went to init 1, killed every process that had nothing to do with the kernel, and it had still 1.5G used. Yesterday I installed kernel-smp-2.6.9-11.EL. The memory usage looks very normal until now. Cou can see a graph of memory usage at http://isis.tecnoera.com/mailscanner-mrtg/memory/memory.html If you need info from my system, I can provide.
Taking this out of NEEDINFO since comment#25 seems to be in answer to comment#24. This bug was in MODIFIED before that. If you can reproduce this bug with the U1 kernel (2.6.9-11), please report.
I highly suspect this is fixed in 2.6.9-11 (since I'm positive it's fixed in 2.6.9-6.26, and the changes from 2.6.9-6.26 are still in the current changelog). But I'll let you know for sure after I let it run for a week.
I'm not sure if this is the exact same problem, but since upgrading to 2.6.9-11 we've been seeing OOM errors where we hadn't before. We wrote a small script (barely even a script) that simply uses 'dd' to continually create 40GB files from /dev/zero until our 1TB LUN is full, then delete them, then start over. We created the script to allow us to reproduce a performance problem as noted in Bugzilla 156437 (and an official support ticket). We have a Dell 6450 with 8GB RAM and, when we run two copies of the script simultaneously we start getting OOM's within a couple of ours, usually killing gdmgreeter multiple times and eventually portmap. After another hour or so the system hangs hard. Should I open a different bugzilla, or is this possibly related. I suspect it is not related since I don't every remember seeing this behaviour with 2.6.9-5.0.5. Later, Tom
FWIW, we haven't seen any oom-killer problems with 2.6.9-11. I'm fairly certain that if the problem still existed in 2.6.9-11, we would have hit it by now. Tom, do you think the problems you detailed in comment #29 might be caused by a different issue?
FWIW bis, this should probably be closed I had the problem as described this last week on an old, old, old non-SMP (2.6.9-5.0.3) kernel on which I erroneously booted (mixup in boot partitions, slab size grows until can't any longer) but there is no problem with the non-SMP 2.6.9-22.0.1