Bug 169776

Summary: OOM-Killer kill Oracle Processes then system
Product: Red Hat Enterprise Linux 4 Reporter: Thomas Tracy <tom.tracy>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: davej, jbaron, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-18 15:54:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oom-killer log file from /var/log/messages
none
slabinfo file from 8 hour run before oom-killer starts
none
CPU amd Memory information during a test none

Description Thomas Tracy 2005-10-03 14:55:16 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; SV1; (R1 1.5); .NET CLR 1.1.4322; .NET CLR 1.0.3705)

Description of problem:
Running with either RHAS4 update 1 or RHAS4 update 2 OOM-killer kills Oracle processes after 8 hours in a single query. This query is a complex join of 2 tables that produces high I/O,CPU time. Other types of workloads (reading tables) causes oom-killer to raise it's head after 18 hours of reading tables. 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-16.EL

How reproducible:
Always

Steps to Reproduce:
1.start system
2. mount ocfsv2 volumes
3.start database
4. wait 8 hours for oom-killer to start
  

Actual Results:  System crash

Expected Results:  queries should finished

Additional info:

attaching logs from message file. Last test I did last night, I wrote a script that copied the contents of /proc/slabinfo into a text file during a test. I have seen similar but dis-similar bugs on this subject but nothing concerning Oracle 10.1.0.4

Comment 1 Thomas Tracy 2005-10-03 15:00:49 UTC
Created attachment 119551 [details]
oom-killer log file from /var/log/messages

Comment 2 Thomas Tracy 2005-10-03 15:01:56 UTC
Created attachment 119552 [details]
slabinfo file from 8 hour run before oom-killer starts

Comment 3 Thomas Tracy 2005-10-03 15:40:16 UTC
Created attachment 119553 [details]
CPU amd Memory information during a test

Comment 4 Thomas Tracy 2005-10-06 14:25:05 UTC
Do not know if this will help but I was finally successful in commpleting an
Oracle complex-join query by turning off NFS and shutting down an internal
program  called collectl, which gathers basis system statistic
(CPU,Memory,IO,Network Bandwidth). The query takes 17 hours to complete which is
about right for a 2 cpu blade. I am running another test now with NFS turned on,
the collectl program off. I have seen a note within bugzilla of changing
/proc/sys/vm/lower_zone_protestion to 100. That had no effect in previous
experiments. 

Comment 5 Larry Woodman 2005-10-07 19:54:26 UTC
The problem appears to be *someone* is leaking about 700MB of lowmem via
kmalloc() of size 32 bytes:

size-32 20159195 20159195 32 119 1 : tunables 120 60 8 : slabdata 169405 169405 0


Please send along an lsmod output and an AltSysrq-M output when this happens.

Larry Woodman


Comment 6 Tom Tracy 2005-10-18 14:55:55 UTC
Larry
        We can close this bug as we discovered that the memory leak was caused 
by ocfsv2 version 1.0.4-1. Working with Oracle, we have tested and verified 
the fix.

Thanks
Tom