Description of problem: Out of Memory Killer is trigered when stress test is running Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
OS version: rhel4-pre-rc1 kernel version:kernel-smp-2.6.9-1.849_EL
Could you please test with the kernels in http://people.redhat.com/riel/.test-kernel-bug-143628/ ? I think I've already fixed this bug, but the patches just need to make it into the RHEL4 tree. Please let me know if the bug still exists with the test kernel.
I still see the OOM killer on a RC2 kernel. We will test your kernel.
I added another patch (rhel4-vm-extraround.patch) and uploaded a new test kernel (congest3) to http://people.redhat.com/riel/.test-kernel-bug-143628/ The previous test kernel could still trigger an OOM in our own internal tests, though it took a few days for the error to trigger. Please verify that the congest3 kernel works fine for you.
This kernel hanged after more than 2 days stress test running. However, this time I did not see OOM killer. I saw some "SCSI error <0 0 0 0> return code = 0x800002" messages printed on the screen.
I've uploaded a -congest4 kernel with further fixes (rhel4-nr_scanned-count-writeback.patch) to http://people.redhat.com/riel/.test-kernel-bug-143628/ Could you please try that kernel to check how that one behaves?
I will test it. Do you think the SCSI error message is related to VM? I have tested the SCSI disk by "dd if=/dev/sda of=/dev/null" successfully. So it should not be a hardware problem.
I have tested this kernel. Although I have not seen oops on screen, the system is realy unusable after 72 hours runing. It almost stop response. Only very few free memory is left in the system. When I do cat /proc/slabinfo I see there is huge amount of size-64 slab. Kernel memory leak?
It should be this bug. It is a memory leak in sysfs. Base kerenl has already fixed it. http://marc.theaimsgroup.com/?l=linux-kernel&m=110204311025022&w=2
Back to assigned.
I have tested new RHEL4-RC The memory leak in sysfs is still there. Simply do a ls -lR /sys will see a leak in slab memory.
Created attachment 109953 [details] Patch to fix the sysfs memory leak
Tim, you may want to add this to the Day-0, or at least U1 lists.
Hi I work for VERITAS and have been seeing Out of memory killer being triggered from out tests. I have investigated further and still get a memory leak without any of our products loaded. I am doing 'parted -s /dev/sdb print' in a loop and see memory leakage. Could this be related? I am running on 2.6.9-1.648_ELsmp
Andy, that's a very old kernel. Please try again with the latest and greatest code which has been made available to our partners. The latest kernel available is 2.6.9-5.EL. Thanks.
Ok I re-installed with 2.6.9-5.ELsmp and am still seeing memory leakage when running 'parted -s /dev/XXXX print' in a loop. I also sometimes see the parted command hang - I think this could be related to 140472. If when parted has hung I do another command to the disk this will also hang. The processes are unkillable.
Andy - can we get some more info, such as: - what architecture? (x86, IPF, x86_64) - what type of disk, what disk driver - how big is the disk? Does it reproduce on smaller, vs larger disks? - how much memory in your system - is it doing anything else at the time? - please attach a small test script consisting of your parted loop - how long does it take to reproduce the oops
Created attachment 110496 [details] test script
Sun Dual Opteron x86_64 2 Gig Memory LSI53C1030 - Fusion MPT SCSI Host driver 3.01.16 Disk - Vendor: SEAGATE Model: ST373307LC (74 Gig) Just running the attached script (parted -s /dev/sdb print) You can watch memory leaking and after about 2 hours it will start killing processes. Also observe memory leak if do he same to fibre attached disk: qla2300 - 3Pardata array
I guess I should have added that I also see the message: program parted is using a deprecated SCSI ioctl, please convert it to SG_IO on the console continuously while me test is running
Do you still get the OOM kills if the test is done using the qla driver?
Could #145695 be triggering the same problem?
Andy, in comment #16 you say that with the latest kernel you still see "leakage". But, are you still seeing the oom kills? Can you better describe the specific problem exhibited with the latest kernel?
Andy, can you attach the console outout(/var/log/messages) when the OOM kill occurred? Thanks, Larry Woodman
I run the test script I have attached which calls parted in a loop and uses 'top' to display memory useage, this can been seen to decrease. I have also used 'echo m > /proc/sysrq-trigger' to check teh memory. I booted my box with reduced memory (mem=256M) and this then did hit OOM I have attached extract from /var/log/messages ...
Created attachment 110548 [details] extract from /var/log/messages showing OOM killer OOM killer when running parted -s /dev/sda print in a loop
Andy, you said you booted with 256MB? Thats weird, 256MB is 65535 pages but your system only has about half of that! First of all we dont support less than 256MB for any architecture on RHEL4 but this might indicate a problem siging memory when its limited at the boot command line with the mem= option. -------------------------- DMA: present:16384kB which is 4096 pages Normal: present:115712kB which is 28928 pages Highmem: present:0kB which is 0 pages -------------------------- Can you send along the outputs of "cat /proc/meminfo", "cat /proc/slabinfo" and "cat /proc/cmdline". Also, are you running that memory leak patch that is attached here? Larry Woodman
I'm sorry, my mistake I actually booted with mem=128M in order to get the OOM to happen quicker. I will retry (again) with mem=256M. Also I have not tried with the patch included here (to fix sysfs memleak) as I dont have a kernel build environment setup yet for the 2.6.9-5 kernel and I am not doing anything to /sys. Does it not look like there is a memory leak with running parted? Surely it would be a simple exercise for you to try this .... I also see that parted command sometime hangs on one of the disks - as I said above - this seems to be worse on the fibre disks but also happens on the locally attached disks. I will attach a new messages file if (when) I get OOM with 256M.
Before you reboot grab me that /proc/slabinfo data, "slab:26127" is all of memory which isnt a surprise when you boot with 128MB. Larry
Created attachment 110555 [details] mem=256M OOM killer /var/log/messages extract Booted with mem=256M and ran multiple while : do parted -s /dev/sda print >/dev/null 2>&1 done
OK, please get me that /proc/slabinfo just after an OOM kill happens. Larry Woodman
Created attachment 110557 [details] slabinfo/cmdline/meminfo when booted mem=128M Information requested when booted mem=128M
Andy, are you sure this /proc/slabinfo was at the time of the OOM kill? All of the memory is accountable on the page lists and the slabcache is pretty much empty. I need a /proc/slabinfo output at the time the OOM kills occur to debug this problem. -------------------------- MemTotal: 123600 kB MemFree: 12392 kB Buffers: 4692 kB Cached: 58956 kB SwapCached: 0 kB Active: 53532 kB Inactive: 38156 kB ------------------------- Larry Woodman
Sorry I misunderstod I thought you wanted info on the 128M boot, I am running another test now and we grab slabinfo when it OOM (Its quite hard to catch this .... with all the 'deprecated' noise on the console)
This does not appear to be still a problem on RHEL4 pre-RC3. Originally we were unable to run our test cases to completion (on beta2) without memory starvation - we saw OOM and even PANICs (kdb_panic()). I tried to make a test case that showed the problem without any of our code loaded. I have since ported our code to RC3 and can now run our test cases to completeion, without any apparent memory lose - so whatever the issue was on beta2 it has now gone. I also had to apply the patch we have developed for the scsi inquiry hang issue we have reported as bugzilla 140472
Closing this out based on comment 35.