143628 – Out of Memory Killer is trigered

Bug 143628 - Out of Memory Killer is trigered

Summary: Out of Memory Killer is trigered

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	146015
TreeView+	depends on / blocked

Reported:	2004-12-23 01:34 UTC by Racing Guo
Modified:	2007-11-30 22:07 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-02-10 12:10:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to fix the sysfs memory leak (326 bytes, patch) 2005-01-19 01:48 UTC, ZouNanHai	no flags	Details \| Diff
test script (253 bytes, text/plain) 2005-02-01 13:39 UTC, Andy Robinson	no flags	Details
extract from /var/log/messages showing OOM killer (18.10 KB, text/plain) 2005-02-02 14:36 UTC, Andy Robinson	no flags	Details
mem=256M OOM killer /var/log/messages extract (32.95 KB, text/plain) 2005-02-02 16:39 UTC, Andy Robinson	no flags	Details
slabinfo/cmdline/meminfo when booted mem=128M (14.19 KB, text/plain) 2005-02-02 16:53 UTC, Andy Robinson	no flags	Details
View All

Description Racing Guo 2004-12-23 01:34:37 UTC

Description of problem:
  Out of Memory Killer is trigered when stress test is running
Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Racing Guo 2004-12-23 01:37:52 UTC

OS version: rhel4-pre-rc1
kernel version:kernel-smp-2.6.9-1.849_EL

Comment 2 Rik van Riel 2004-12-23 15:00:34 UTC

Could you please test with the kernels in
http://people.redhat.com/riel/.test-kernel-bug-143628/ ?

I think I've already fixed this bug, but the patches just need to make
it into the RHEL4 tree. Please let me know if the bug still exists
with the test kernel.

Comment 3 ZouNanHai 2004-12-28 01:14:36 UTC

I still see the OOM killer on a RC2 kernel.
We will test your kernel.

Comment 4 Rik van Riel 2004-12-28 14:27:57 UTC

I added another patch (rhel4-vm-extraround.patch) and uploaded a new
test  kernel (congest3) to
http://people.redhat.com/riel/.test-kernel-bug-143628/

The previous test kernel could still trigger an OOM in our own
internal tests, though it took a few days for the error to trigger. 
Please verify that the congest3 kernel works fine for you.

Comment 5 ZouNanHai 2005-01-04 01:11:03 UTC

This kernel hanged after more than 2 days stress test running.
However, this time I did not see OOM killer.
I saw  some 
"SCSI error <0 0 0 0> return code = 0x800002" 
messages 
printed on the screen.

Comment 6 Rik van Riel 2005-01-04 04:25:25 UTC

I've uploaded a -congest4 kernel with further fixes
(rhel4-nr_scanned-count-writeback.patch) to
http://people.redhat.com/riel/.test-kernel-bug-143628/

Could you please try that kernel to check how that one behaves?

Comment 7 ZouNanHai 2005-01-04 04:41:42 UTC

I will test it.

Do you think the SCSI error message is related to VM?
I have tested the SCSI disk by 
"dd if=/dev/sda of=/dev/null" successfully.
So it should not be a hardware problem.

Comment 8 ZouNanHai 2005-01-10 08:49:29 UTC

I have tested this kernel.

Although I have not seen oops on screen, the system is realy unusable
after 72 hours runing.
It almost stop response.
Only very few free memory is left in the system. 
When I do cat /proc/slabinfo
I see there is huge amount of size-64 slab.
Kernel memory leak?

Comment 9 ZouNanHai 2005-01-12 01:00:27 UTC

It should be this bug. 
It is a memory leak in sysfs.

Base kerenl has already fixed it.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110204311025022&w=2

Comment 10 Jay Turner 2005-01-14 14:47:08 UTC

Back to assigned.

Comment 11 ZouNanHai 2005-01-18 06:07:54 UTC

I have tested new RHEL4-RC
The memory leak in sysfs is still there.
Simply do a 
ls -lR /sys
will see a leak in slab memory.

Comment 12 ZouNanHai 2005-01-19 01:48:33 UTC

Created attachment 109953 [details]
Patch to fix the sysfs memory leak

Comment 13 Jason Baron 2005-01-19 22:02:21 UTC

Tim, you may want to add this to the Day-0, or at least U1 lists.

Comment 14 Andy Robinson 2005-01-31 17:42:25 UTC

Hi I work for VERITAS and have been seeing Out of memory killer being 
triggered from out tests. I have investigated further and still get a 
memory leak without any of our products loaded.

I am doing 'parted -s /dev/sdb print' in a loop and see memory 
leakage.

Could this be related?

I am running on 2.6.9-1.648_ELsmp

Comment 15 Jay Turner 2005-01-31 19:32:41 UTC

Andy, that's a very old kernel.  Please try again with the latest and greatest
code which has been made available to our partners.  The latest kernel available
is 2.6.9-5.EL.  Thanks.

Comment 16 Andy Robinson 2005-02-01 11:56:49 UTC

Ok I re-installed with 2.6.9-5.ELsmp and am still seeing memory 
leakage when running 'parted -s /dev/XXXX print' in a loop.

I also sometimes see the parted command hang - I think this could be 
related to 140472. If when parted has hung I do another command to 
the disk this will also hang. The processes are unkillable.

Comment 17 Tim Burke 2005-02-01 12:04:19 UTC

Andy - can we get some more info, such as:

- what architecture?  (x86, IPF, x86_64)
- what type of disk, what disk driver
- how big is the disk?  Does it reproduce on smaller, vs larger disks?
- how much memory in your system
- is it doing anything else at the time?
- please attach a small test script consisting of your parted loop
- how long does it take to reproduce the oops

Comment 18 Andy Robinson 2005-02-01 13:39:43 UTC

Created attachment 110496 [details]
test script

Comment 19 Andy Robinson 2005-02-01 13:53:57 UTC

Sun Dual Opteron x86_64
2 Gig Memory
LSI53C1030 - Fusion MPT SCSI Host driver 3.01.16
Disk - Vendor: SEAGATE   Model: ST373307LC (74 Gig)
Just running the attached script (parted -s /dev/sdb print)
You can watch memory leaking and after about 2 hours it will
start killing processes.

Also observe memory leak if do he same to fibre attached disk:
qla2300 - 3Pardata array

Comment 20 Andy Robinson 2005-02-01 14:20:03 UTC

I guess I should have added that I also see the message:

program parted is using a deprecated SCSI ioctl, please convert it to 
SG_IO

on the console continuously while me test is running

Comment 21 Tim Burke 2005-02-01 15:00:34 UTC

Do you still get the OOM kills if the test is done using the qla driver?

Comment 22 Rob Kenna 2005-02-01 19:54:20 UTC

Could #145695 be triggering the same problem?

Comment 23 Tim Burke 2005-02-01 23:15:12 UTC

Andy, in comment #16 you say that with the latest kernel you still see
"leakage".  But, are you still seeing the oom kills?  Can you better describe
the specific problem exhibited with the latest kernel?

Comment 24 Larry Woodman 2005-02-02 03:40:02 UTC

Andy, can you attach the console outout(/var/log/messages) when the OOM kill
occurred?

Thanks, Larry Woodman

Comment 25 Andy Robinson 2005-02-02 14:32:40 UTC

I run the test script I have attached which calls parted in a loop
and uses 'top' to display memory useage, this can been seen to decrease.

I have also used 'echo m > /proc/sysrq-trigger' to check teh memory.

I booted my box with reduced memory (mem=256M) and this then did hit OOM
I have attached extract from /var/log/messages ...

Comment 26 Andy Robinson 2005-02-02 14:36:38 UTC

Created attachment 110548 [details]
extract from /var/log/messages showing OOM killer

OOM killer when running parted -s /dev/sda print in a loop

Comment 27 Larry Woodman 2005-02-02 15:10:06 UTC

Andy, you said you booted with 256MB?  Thats weird, 256MB is 65535 pages but
your system only has about half of that!  First of all we dont support less than
256MB for any architecture on RHEL4 but this might indicate a problem siging
memory when its limited at the boot command line with the mem= option.

--------------------------
DMA: present:16384kB        which is 4096 pages
Normal: present:115712kB    which is 28928 pages
Highmem: present:0kB        which is 0 pages
--------------------------

Can you send along the outputs of "cat /proc/meminfo", "cat /proc/slabinfo" and
"cat /proc/cmdline".  Also, are you running that memory leak patch that is
attached here?

Larry Woodman

Comment 28 Andy Robinson 2005-02-02 15:32:03 UTC

I'm sorry, my mistake I actually booted with mem=128M in order to get the
OOM to happen quicker.

I will retry (again) with mem=256M. Also I have not tried with the patch 
included here (to fix sysfs memleak) as I dont have a kernel build environment
setup yet for the 2.6.9-5 kernel and I am not doing anything to /sys.

Does it not look like there is a memory leak with running parted? Surely it 
would be a simple exercise for you to try this ....

I also see that parted command sometime hangs on one of the disks - as I said 
above - this seems to be worse on the fibre disks but also happens on the 
locally attached disks.

I will attach a new messages file if (when) I get OOM with 256M.

Comment 29 Larry Woodman 2005-02-02 16:17:08 UTC

Before you reboot grab me that /proc/slabinfo data, "slab:26127" is all of
memory which isnt a surprise when you boot with 128MB.

Larry

Comment 30 Andy Robinson 2005-02-02 16:39:20 UTC

Created attachment 110555 [details]
mem=256M OOM killer /var/log/messages extract

Booted with mem=256M and ran multiple 
while :
do
parted -s /dev/sda print >/dev/null 2>&1
done

Comment 31 Larry Woodman 2005-02-02 16:47:36 UTC

OK, please get me that /proc/slabinfo just after an OOM kill happens.

Larry Woodman

Comment 32 Andy Robinson 2005-02-02 16:53:41 UTC

Created attachment 110557 [details]
slabinfo/cmdline/meminfo when booted mem=128M

Information requested when booted mem=128M

Comment 33 Larry Woodman 2005-02-02 17:07:56 UTC

Andy, are you sure this /proc/slabinfo was at the time of the OOM kill? All of
the memory is accountable on the page lists and the slabcache is pretty much
empty. I need a /proc/slabinfo output at the time the OOM kills occur to debug
this problem.
--------------------------
MemTotal:       123600 kB
MemFree:         12392 kB
Buffers:          4692 kB
Cached:          58956 kB
SwapCached:          0 kB
Active:          53532 kB
Inactive:        38156 kB
-------------------------


Larry Woodman

Comment 34 Andy Robinson 2005-02-02 17:16:39 UTC

Sorry I misunderstod I thought you wanted info on the 128M boot,
I am running another test now and we grab slabinfo when it OOM
(Its quite hard to catch this .... with all the 'deprecated' noise
on the console)

Comment 35 Andy Robinson 2005-02-10 10:37:10 UTC

This does not appear to be still a problem on RHEL4 pre-RC3.

Originally we were unable to run our test cases to completion (on beta2)
without memory starvation - we saw OOM and even PANICs (kdb_panic()). 

I tried to make a test case that showed the problem without any of our
code loaded.

I have since ported our code to RC3 and can now run our test cases to 
completeion, without any apparent memory lose - so whatever the issue was
on beta2 it has now gone.


I also had to apply the patch we have developed for the scsi inquiry hang
issue we have reported as bugzilla 140472

Comment 36 Jay Turner 2005-02-10 12:10:34 UTC

Closing this out based on comment 35.

Note You need to log in before you can comment on or make changes to this bug.