Bug 169776

Summary:

OOM-Killer kill Oracle Processes then system

Product:

Red Hat Enterprise Linux 4

Reporter:

Thomas Tracy <tom.tracy>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED NOTABUG

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.0

CC:

davej, jbaron, riel

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-10-18 15:54:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
oom-killer log file from /var/log/messages	none
slabinfo file from 8 hour run before oom-killer starts	none
CPU amd Memory information during a test	none

Description Thomas Tracy 2005-10-03 14:55:16 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; SV1; (R1 1.5); .NET CLR 1.1.4322; .NET CLR 1.0.3705)

Description of problem:
Running with either RHAS4 update 1 or RHAS4 update 2 OOM-killer kills Oracle processes after 8 hours in a single query. This query is a complex join of 2 tables that produces high I/O,CPU time. Other types of workloads (reading tables) causes oom-killer to raise it's head after 18 hours of reading tables. 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-16.EL

How reproducible:
Always

Steps to Reproduce:
1.start system
2. mount ocfsv2 volumes
3.start database
4. wait 8 hours for oom-killer to start
  

Actual Results:  System crash

Expected Results:  queries should finished

Additional info:

attaching logs from message file. Last test I did last night, I wrote a script that copied the contents of /proc/slabinfo into a text file during a test. I have seen similar but dis-similar bugs on this subject but nothing concerning Oracle 10.1.0.4

Comment 1 Thomas Tracy 2005-10-03 15:00:49 UTC

Created attachment 119551 [details]
oom-killer log file from /var/log/messages

Comment 2 Thomas Tracy 2005-10-03 15:01:56 UTC

Created attachment 119552 [details]
slabinfo file from 8 hour run before oom-killer starts

Comment 3 Thomas Tracy 2005-10-03 15:40:16 UTC

Created attachment 119553 [details]
CPU amd Memory information during a test

Comment 4 Thomas Tracy 2005-10-06 14:25:05 UTC

Do not know if this will help but I was finally successful in commpleting an
Oracle complex-join query by turning off NFS and shutting down an internal
program  called collectl, which gathers basis system statistic
(CPU,Memory,IO,Network Bandwidth). The query takes 17 hours to complete which is
about right for a 2 cpu blade. I am running another test now with NFS turned on,
the collectl program off. I have seen a note within bugzilla of changing
/proc/sys/vm/lower_zone_protestion to 100. That had no effect in previous
experiments.

Comment 5 Larry Woodman 2005-10-07 19:54:26 UTC

The problem appears to be *someone* is leaking about 700MB of lowmem via
kmalloc() of size 32 bytes:

size-32 20159195 20159195 32 119 1 : tunables 120 60 8 : slabdata 169405 169405 0


Please send along an lsmod output and an AltSysrq-M output when this happens.

Larry Woodman

Comment 6 Tom Tracy 2005-10-18 14:55:55 UTC

Larry
        We can close this bug as we discovered that the memory leak was caused 
by ocfsv2 version 1.0.4-1. Working with Oracle, we have tested and verified 
the fix.

Thanks
Tom