169776 – OOM-Killer kill Oracle Processes then system

Bug 169776 - OOM-Killer kill Oracle Processes then system

Summary: OOM-Killer kill Oracle Processes then system

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-10-03 14:55 UTC by Thomas Tracy
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-10-18 15:54:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oom-killer log file from /var/log/messages (33.48 KB, application/octet-stream) 2005-10-03 15:00 UTC, Thomas Tracy	no flags	Details
slabinfo file from 8 hour run before oom-killer starts (385.76 KB, text/plain) 2005-10-03 15:01 UTC, Thomas Tracy	no flags	Details
CPU amd Memory information during a test (3.01 KB, text/plain) 2005-10-03 15:40 UTC, Thomas Tracy	no flags	Details
View All

Description Thomas Tracy 2005-10-03 14:55:16 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; SV1; (R1 1.5); .NET CLR 1.1.4322; .NET CLR 1.0.3705)

Description of problem:
Running with either RHAS4 update 1 or RHAS4 update 2 OOM-killer kills Oracle processes after 8 hours in a single query. This query is a complex join of 2 tables that produces high I/O,CPU time. Other types of workloads (reading tables) causes oom-killer to raise it's head after 18 hours of reading tables. 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-16.EL

How reproducible:
Always

Steps to Reproduce:
1.start system
2. mount ocfsv2 volumes
3.start database
4. wait 8 hours for oom-killer to start
  

Actual Results:  System crash

Expected Results:  queries should finished

Additional info:

attaching logs from message file. Last test I did last night, I wrote a script that copied the contents of /proc/slabinfo into a text file during a test. I have seen similar but dis-similar bugs on this subject but nothing concerning Oracle 10.1.0.4

Comment 1 Thomas Tracy 2005-10-03 15:00:49 UTC

Created attachment 119551 [details]
oom-killer log file from /var/log/messages

Comment 2 Thomas Tracy 2005-10-03 15:01:56 UTC

Created attachment 119552 [details]
slabinfo file from 8 hour run before oom-killer starts

Comment 3 Thomas Tracy 2005-10-03 15:40:16 UTC

Created attachment 119553 [details]
CPU amd Memory information during a test

Comment 4 Thomas Tracy 2005-10-06 14:25:05 UTC

Do not know if this will help but I was finally successful in commpleting an
Oracle complex-join query by turning off NFS and shutting down an internal
program  called collectl, which gathers basis system statistic
(CPU,Memory,IO,Network Bandwidth). The query takes 17 hours to complete which is
about right for a 2 cpu blade. I am running another test now with NFS turned on,
the collectl program off. I have seen a note within bugzilla of changing
/proc/sys/vm/lower_zone_protestion to 100. That had no effect in previous
experiments.

Comment 5 Larry Woodman 2005-10-07 19:54:26 UTC

The problem appears to be *someone* is leaking about 700MB of lowmem via
kmalloc() of size 32 bytes:

size-32 20159195 20159195 32 119 1 : tunables 120 60 8 : slabdata 169405 169405 0


Please send along an lsmod output and an AltSysrq-M output when this happens.

Larry Woodman

Comment 6 Tom Tracy 2005-10-18 14:55:55 UTC

Larry
        We can close this bug as we discovered that the memory leak was caused 
by ocfsv2 version 1.0.4-1. Working with Oracle, we have tested and verified 
the fix.

Thanks
Tom

Note You need to log in before you can comment on or make changes to this bug.