From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; SV1; (R1 1.5); .NET CLR 1.1.4322; .NET CLR 1.0.3705) Description of problem: Running with either RHAS4 update 1 or RHAS4 update 2 OOM-killer kills Oracle processes after 8 hours in a single query. This query is a complex join of 2 tables that produces high I/O,CPU time. Other types of workloads (reading tables) causes oom-killer to raise it's head after 18 hours of reading tables. Version-Release number of selected component (if applicable): kernel-smp-2.6.9-16.EL How reproducible: Always Steps to Reproduce: 1.start system 2. mount ocfsv2 volumes 3.start database 4. wait 8 hours for oom-killer to start Actual Results: System crash Expected Results: queries should finished Additional info: attaching logs from message file. Last test I did last night, I wrote a script that copied the contents of /proc/slabinfo into a text file during a test. I have seen similar but dis-similar bugs on this subject but nothing concerning Oracle 10.1.0.4
Created attachment 119551 [details] oom-killer log file from /var/log/messages
Created attachment 119552 [details] slabinfo file from 8 hour run before oom-killer starts
Created attachment 119553 [details] CPU amd Memory information during a test
Do not know if this will help but I was finally successful in commpleting an Oracle complex-join query by turning off NFS and shutting down an internal program called collectl, which gathers basis system statistic (CPU,Memory,IO,Network Bandwidth). The query takes 17 hours to complete which is about right for a 2 cpu blade. I am running another test now with NFS turned on, the collectl program off. I have seen a note within bugzilla of changing /proc/sys/vm/lower_zone_protestion to 100. That had no effect in previous experiments.
The problem appears to be *someone* is leaking about 700MB of lowmem via kmalloc() of size 32 bytes: size-32 20159195 20159195 32 119 1 : tunables 120 60 8 : slabdata 169405 169405 0 Please send along an lsmod output and an AltSysrq-M output when this happens. Larry Woodman
Larry We can close this bug as we discovered that the memory leak was caused by ocfsv2 version 1.0.4-1. Working with Oracle, we have tested and verified the fix. Thanks Tom