Bug 112891

Summary:

OOM killer strikes with lots of free swap space

Product:

Red Hat Enterprise Linux 3

Reporter:

Martin Wilck <martin.wilck>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

3.0

CC:

chrisw, david.browne, hfuchi, michigantech, nobody+wcheng, petrides, raimondi, riel, rperkins, tao

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-08-05 21:26:33 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

107562, 114888

Attachments:

Description	Flags
Typical system log showing OOM messages	none

Description Martin Wilck 2004-01-05 14:43:25 UTC

Description of problem:

In a system running a copy/compare stress test, processes start
getting killed by the OOM killer after a few hours although there is
still plenty of swap available. We observed this on several machines
using the Adaptec DPT RAID controller (dpt_i2o) and, less frequently,
on systems with other SCSI controllers. System memory varied between
2GB and 8GB.

The copy/compare test generates a typical load with lots of memory
used for cache, but little process memory usage and _no_ swap usage.

Usually at least one kernel log message 

reclaim_page got unmapped page c1b3f834 (count 1), let's see if we
survive!

appears. Then after a while a few ENOMEM messages appear. Finally the
OOM killer starts killing processes, until the machine gets unusable
(vital processes are being killed by the OOM killer).

Repeat: This happens although there is plenty of swap space available
(actually, the swap appears to be unused entirely), and although
memory pressure comes almost exclusively from the page cache.

Version-Release number of selected component (if applicable):
2.4.20-4.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Install system with DPT RAID controller
2. Start copy/compare test
  
Actual results:
Process get killed after a few hours, system becomes unusable.

Expected results:
System keeps running.

Additional info:

Here is a typical meminfo during full load. Most of the memory is in
the ActiveCache queue, followed by Inactive_dirty and
Inactive_laundry. No massive shortage of either high or low memory is
observed. Swap and SwapCache are entirely unused.

MemTotal:      3094712 kB
MemFree:        531280 kB
MemShared:           0 kB
Buffers:        390668 kB
Cached:         898824 kB
SwapCached:          0 kB
Active:        1267056 kB
ActiveAnon:       3636 kB
ActiveCache:   1263420 kB
Inact_dirty:    746216 kB
Inact_laundry:  292328 kB
Inact_clean:     45032 kB
Inact_target:   470124 kB
HighTotal:     2228160 kB
HighFree:       256344 kB
LowTotal:       866552 kB
LowFree:        274936 kB
SwapTotal:     2040244 kB
SwapFree:      2040244 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

Comment 1 Martin Wilck 2004-01-05 14:46:27 UTC

Created attachment 96768 [details]
Typical system log showing OOM messages

Comment 2 Arjan van de Ven 2004-01-05 14:51:28 UTC

1) please repeat with the latest available kernel instead
2) the dpt_i2o driver isn't too good and can't do direct dma over 1Gb
and uses/allocated bounce buffers; on a 8Gb machine this is quite
inefficient.

Comment 3 Martin Wilck 2004-01-05 15:30:13 UTC

We were able to alleviate the problem partially by re-adding the lines 

+	/*
+	 * Enough swap space left?  Not OOM.
+	 */
+	if (nr_swap_pages > 0)
+		return;

at the beginning of out_of_memory() in mm/oom_kill.c. These lines are
deleted by the rmap patch (linux-2.4.21-rmap.patch) of the 2.4.21-4.EL
kernel. More precisely, they were removed with rmap-14b as of 2002/09/17.

With these lines re-added, the overall system behaviour is more
graceful. Processes aren't killed anymore. However the system still
runs into extreme memory pressure which is hard to explain under the
given load conditions.

Comment 4 Martin Wilck 2004-01-05 15:53:02 UTC

The newest EL3.0 kernel I know about is -4.0.1EL. The 5 added lines in
that kernel are unlikely to have solved this problem AFAICT. (We will
 try it nonetheless). Or do you want us to try the beta release?

We know about the bounce buffer problem. But our Japanese colleagues
need to release new machines with this controller and driver. The
machine I sent the log for had 3GB which is pretty common in today's
servers. 

Anyway, earlier RedHat kernels did not show this problem (the current
AS2.1 kernel doesn't, for example).

Comment 5 Martin Wilck 2004-01-08 16:02:45 UTC

We checked -6.EL now and it shows the same behavior as -4.EL, i.e. no
improvement. Processes are still killed, although there is always free
memory available (both high and low), and swap isn't used at all.

More info tomorrow.

Comment 6 Martin Wilck 2004-01-15 15:46:05 UTC

Took me somewhat longer to get information.

First of all, I confirm that the problem still occurs with -6.EL.

My most impressive finding is that the number of inactive_clean pages
in zone "Normal" becomes _negative_. Yes, negative. I have put a call
to show_mem into the oom_kill function in order to see what's
happening, here is the output when the killer strikes. See the "ic:-4"
below.

Jan 15 00:22:53 egal kernel: ENOMEM in
journal_get_undo_access_Rsmp_6317abf4, retrying.
Jan 15 00:22:53 egal kernel: ENOMEM in journal_alloc_journal_head,
retrying.
Jan 15 00:22:54 egal kernel: OOM
Jan 15 00:22:54 egal kernel: Mem-info:
Jan 15 00:22:54 egal kernel: Zone:DMA freepages:  2099 min:     0 low:
    0 high:     0
Jan 15 00:22:54 egal kernel: Zone:Normal freepages:   320 min:  1279
low:  4544 high:  6304
Jan 15 00:22:54 egal kernel: Zone:HighMem freepages:526920 min:   255
low: 12282 high: 18423
Jan 15 00:22:54 egal kernel: Free pages:      529339 (526920 HighMem)
Jan 15 00:22:54 egal kernel: ( Active: 261343/151150,
inactive_laundry: 1881, inactive_clean: 2684, free: 529339 )
Jan 15 00:22:54 egal kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2099
Jan 15 00:22:54 egal kernel:   aa:5 ac:100069 id:59756 il:0 ic:-4 fr:320
Jan 15 00:22:54 egal kernel:   aa:15713 ac:145556 id:91394 il:1881
ic:2688 fr:526920
Jan 15 00:22:54 egal kernel: 1*4kB 1*8kB 2*16kB 3*32kB 1*64kB 2*128kB
1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8396kB)
Jan 15 00:22:55 egal kernel: 0*4kB 0*8kB 0*16kB 4*32kB 4*64kB 3*128kB
2*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1280kB)
Jan 15 00:22:56 egal kernel: 79722*4kB 66197*8kB 35599*16kB 11665*32kB
2523*64kB 798*128kB 28*256kB 9*512kB 4*1024kB 0*2048kB 9*4096kB =
2107680kB)
Jan 15 00:22:57 egal kernel: Swap cache: add 491, delete 335, find
768/953, race 0+0
Jan 15 00:22:57 egal kernel: 25438 pages of slabcache
Jan 15 00:22:57 egal kernel: 176 pages of kernel stacks
Jan 15 00:22:57 egal kernel: 0 lowmem pagetables, 405 highmem pagetables
Jan 15 00:22:57 egal kernel: Free swap:       2095680kB
Jan 15 00:22:57 egal kernel: 1015536 pages of RAM
Jan 15 00:22:57 egal kernel: 786160 pages of HIGHMEM
Jan 15 00:22:57 egal kernel: 16106 reserved pages
Jan 15 00:22:57 egal kernel: 127598 pages shared
Jan 15 00:22:57 egal kernel: 156 pages swap cached
Jan 15 00:22:57 egal kernel: f6549e40 c01590ab c02b8704 004a5b9c
00000000 c015919d 00000759 c01536ed 
Jan 15 00:22:57 egal kernel:        ffffffff 00000000 f6548038
f6548000 00000001 000001f0 00000000 c0153dc1 
Jan 15 00:22:57 egal kernel:        000001f0 c03a7608 00000000
c0155b47 c03a7600 00000000 00000000 00000001 
Jan 15 00:22:57 egal kernel: Call Trace:   [<c01590ab>] oom_kill
[kernel] 0x1b (0xf6549e44)
Jan 15 00:22:57 egal kernel: [<c015919d>] out_of_memory [kernel] 0x5d
(0xf6549e54)
Jan 15 00:22:57 egal kernel: [<c01536ed>] do_try_to_free_pages
[kernel] 0x13d (0xf6549e5c)
Jan 15 00:22:57 egal kernel: [<c0153dc1>] try_to_free_pages [kernel]
0x51 (0xf6549e7c)
Jan 15 00:22:59 egal kernel: [<c0155b47>] __alloc_pages [kernel] 0x187
(0xf6549e8c)
Jan 15 00:23:00 egal kernel: [<c0220960>] sock_wfree [kernel] 0x0
(0xf6549ea8)
Jan 15 00:23:00 egal kernel: [<c0221f59>] __kfree_skb [kernel] 0x139
(0xf6549eac)
Jan 15 00:23:01 egal kernel: [<c0155d8c>] __get_free_pages [kernel]
0x1c (0xf6549ed0)
Jan 15 00:23:03 egal kernel: [<c0175101>] __pollwait [kernel] 0x31
(0xf6549ed4)
Jan 15 00:23:07 egal kernel: [<c024a9a6>] tcp_poll [kernel] 0x36
(0xf6549ee8)
Jan 15 00:23:07 egal kernel: [<c021e33c>] sock_poll [kernel] 0x2c
(0xf6549f08)
Jan 15 00:23:07 egal kernel: [<c01754de>] do_select [kernel] 0x24e
(0xf6549f1c)
Jan 15 00:23:08 egal kernel: [<c017586e>] sys_select [kernel] 0x34e
(0xf6549f60)
Jan 15 00:23:09 egal kernel: 
Jan 15 00:23:10 egal kernel: Out of Memory: Killed process 21152
(nautilus).
Jan 15 00:23:10 egal kernel: Out of Memory: Killed process 21165
(nautilus).

Comment 7 Martin Wilck 2004-01-16 09:41:57 UTC

I think I know why the zone->inactive_clean_pages counter becomes
negative. It is this part of reclaim_page() (mm/vmscan.c):

                /* You might think that we should never ever get here.
                 * But you'd be wrong. 
                 * 
                 * The VM can grab temporary page references while
                 * scanning pages.  If the page gets unlinked while such
                 * a reference is held, we end up here with the
                 * page->count == 1 (the temp ref), but no mapping.
                 */
                del_page_from_inactive_clean_list(page);
                add_page_to_inactive_dirty_list(page);
                zone->inactive_clean_pages--;

del_page_from_inactive_clean_list() decrements the
inactive_clean_pages counter, and then it is decremented again two
lines below!

I am just trying out if this is really the reason for our problem.

Comment 8 Martin Wilck 2004-01-16 12:59:02 UTC

Confirmed. Whenever the above code is executed, the
zone->inactive_clean_pages counter goes negative. I have put a message
and a correction there. It appears that the OOM killer doesn' strike
anymore. However we need to continue this test to be sure.

Anyway I strongly recommend to remove the line
zone->inactive_clean_pages--; 
from vmscan.c. I wonder why it was inserted there in the first place.

Just another pick: the signedness of the counters for pages in
different LRU lists is not treated consistently. Eg they are "unsigned
long" in the zone struct but "int" in functions such as
zone_free_limit() (mm_inline.h).

Comment 9 Larry Woodman 2004-01-16 15:27:01 UTC

Fixed and commited the patch.  It will be in the U2 update.

Larry

Comment 10 Martin Wilck 2004-01-16 15:50:54 UTC

Please note that the same bug appears to be in RH9 kernels, too.

Comment 11 Martin Wilck 2004-01-19 10:51:03 UTC

Bad news: an over-the-weekend test of our test load got stock after
about 12h (ping works, but nothing else) with the -6.EL kernel and the
"zone->inactive_clean_pages--;" removed. I don't know the reason yet.

Next thing we'll try is -9.EL with that patch applied.

Regards,
Martin

Comment 12 Larry Woodman 2004-01-19 19:56:08 UTC

reclaim_page() decrements the zone->inactive_clean_pages twice,causing
it to go negative.  This results in incorrect oom kills.




--- linux-2.4.21/mm/vmscan.c.orig	2004-01-16 09:52:10.000000000 -0500
+++ linux-2.4.21/mm/vmscan.c	2004-01-16 09:52:16.000000000 -0500
@@ -209,7 +209,6 @@
 
	 */
 
	del_page_from_inactive_clean_list(page);
 
	add_page_to_inactive_dirty_list(page);
-
	zone->inactive_clean_pages--;
 
	pte_chain_unlock(page);
 
	UnlockPage(page);
 	}


I fixed this last week so it will be in U2.  Here is the patch for
that fix.

Larry

Comment 14 Martin Wilck 2004-01-20 07:24:04 UTC

Yes, I know, thanks. When is "U2" going to be released? 

The problems reported in comment #11 occured with -6.EL with that
patch already applied. I wasn't able to reproduce them so far, though.

Regards, Martin

Comment 15 Martin Wilck 2004-01-20 07:25:27 UTC

It would be nice if you could make bug 107562 readable for me.

Comment 16 Stephen Tweedie 2004-01-20 10:41:10 UTC

Bug #107562 is an internal-only tracking bug that we use to group
certain bugs together.

Comment 17 Martin Wilck 2004-01-20 11:46:56 UTC

Ok thanks. That means there are no real bugs that depend on this one?

Comment 19 Robert Perkins 2004-01-22 14:37:32 UTC

Martin Wilck- yes, you are correct, there is not another bug that
depends upon this one.  FYI, I am Redhat's new partner manager for FSC.

Comment 21 Larry Woodman 2004-02-02 21:54:56 UTC

Can we get someone at Fujitsu-Siemens to verify that this patch 
applied to the RHEL3-U1 kernel fixes the OOM kill problems discribed
in this bug?

The same patch will apply to the 2.4.21-9.EL kernel without any problems.

Thank You, Larry Woodman

Comment 22 Martin Wilck 2004-02-03 07:27:27 UTC

Yes, the patch does fix those problems. 

We verified this on several machines, running under high load for
several days. Sorry, I just realized we didn't communicate that
clearly yet.

Please forget my comment #11, I have no clue what happened there and I
haven't been able to reproduce it.

Martin

Comment 23 Ernie Petrides 2004-02-04 20:55:11 UTC

Martin has confirmed with me that the patch as applied to the
2.4.21-9.EL kernel does indeed resolve this problem.  Larry's
fix will be committed to the RHEL3 U2 patch pool this week.  I
have put this bug back into the "assigned" state until the fix
is committed, at which time I'll change it to "modified" to
confirm the commit.

We also intend to include Larry's fix in the next available
RHEL3 security errata.  The availability date for this has
not yet been determined, but it is likely to be sometime
later this month.

Comment 24 Martin Wilck 2004-02-05 07:59:13 UTC

Thank you.

Comment 25 Ernie Petrides 2004-02-10 23:57:29 UTC

The fix for this problem was committed to the RHEL3 U2 patch pool
last night, and is initially available internally at Red Hat in
the Engineering build of kernel version 2.4.21-9.6.EL.

In order to honor Fujitsu's request for early availability of this
patch, this fix was also committed to the build of kernel version
2.4.21-9.0.1.EL, which will be made available at the end of next
week as part of the next Red Hat Security Advisory (RHSA-2004:066).

Comment 29 Ernie Petrides 2004-02-19 21:38:47 UTC

This fix is now available to customers via RHN in today's
release of Red Hat Security Advisory RHSA-2004:066.

Comment 31 Larry Troan 2004-02-25 18:03:20 UTC

FROM ISSUE TRACKER 29298
Event posted 02-21-2004 07:08pm by glen.foster with duration of 0.00  	
Issue Tracker 29298 documents the code changes needed for
certification tests to run, so I believe I am not the owner of this
issue anymore.  My "OOM" was a red herring, but clearly there are
others experiencing real OOM issues... so this issue should remain open.

Comment 33 Chris Worley 2004-07-28 23:35:43 UTC

I realize the above bug is fixed, and I've verified that the above
patch is in my U2 kernel, and, my inactive clean pages is not dropping
below zero, and I'm not getting any messages logged at all when my
nodes hang.

Otherwise, I seem to be having the same problem ;)

As this is the closest bug to the problem I'm experiencing, I'm
guessing I'm doing something wrong...

HPC apps tend to hog RAM and CPU cycles.  In previous distributions
(i.e. 7.3), hogging too much memory leads to swapping and a badly
performing HPC app (that doesn't hog lots of CPU while swapping).  In
RHEL, with the -15smp kernel, if an app starts requesting more memory
than is physically available, then, rather than swapping, the node
hangs hard (even though there's plenty of swap space to handle these
apps going overboard).  These are dual XEONs w/ 2GB RAM.

Although the node is hung, I can get a Magic SysReq on memory:

SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  3131 min:     0 low:     0 high:     0
Zone:Normal freepages:   320 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages:   254 min:   255 low:  4608 high:  6912
Free pages:        3705 (   254 HighMem)
( Active: 484476/5763, inactive_laundry: 28, inactive_clean: 0, free:
3705 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:3131
  aa:198239 ac:3771 id:2684 il:28 ic:0 fr:320
  aa:279423 ac:2805 id:3318 il:0 ic:0 fr:254
3*4kB 2*8kB 3*16kB 3*32kB 1*64kB 2*128kB 1*256kB 1*512kB 1*1024kB
1*2048kB 2*4096kB = 12524kB)
4*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB
0*2048kB 0*4096kB = 1280kB)
0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 1016kB)
Swap cache: add 44665, delete 44665, find 1415/10133, race 0+0
8364 pages of slabcache
150 pages of kernel stacks
0 lowmem pagetables, 1220 highmem pagetables
Free swap:            0kB
524288 pages of RAM
294912 pages of HIGHMEM
8672 reserved pages
48835 pages shared
0 pages swap cached

This is as compared to like-configured node that is not experiencing
problems:

Mem-info:
Zone:DMA freepages:  3131 min:     0 low:     0 high:     0
Zone:Normal freepages: 59724 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages:  4325 min:   255 low:  4608 high:  6912
Free pages:       67180 (  4325 HighMem)
( Active: 371202/42460, inactive_laundry: 12345, inactive_clean: 468,
free: 67180 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:3131
  aa:133220 ac:5279 id:4586 il:1415 ic:4 fr:59724
  aa:228287 ac:4416 id:37874 il:10930 ic:464 fr:4325
3*4kB 2*8kB 3*16kB 3*32kB 1*64kB 2*128kB 1*256kB 1*512kB 1*1024kB
1*2048kB 2*4096kB = 12524kB)
64*4kB 4*8kB 129*16kB 84*32kB 34*64kB 14*128kB 18*256kB 14*512kB
7*1024kB 11*2048kB 46*4096kB = 238896kB)
9*4kB 2*8kB 2*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 4*4096kB = 17300kB)
Swap cache: add 9229, delete 122, find 355/355, race 0+0
8650 pages of slabcache
150 pages of kernel stacks
0 lowmem pagetables, 1044 highmem pagetables
Free swap:        67984kB
524288 pages of RAM
294912 pages of HIGHMEM
8672 reserved pages
34404 pages shared
9107 pages swap cached

Can anybody see what I'm doing wrong?

Comment 34 Chris Worley 2004-07-29 14:10:12 UTC

Nevermind.

It looks like I had an order of magnitude less swap than I thought I had.

Comment 35 David Browne 2004-08-03 04:23:07 UTC

Hi, 
We are on RHAS 3.0 
2.4.21-9.0.1.ELsmp #1 SMP. The story goes that we are using Oracle 
and the Oracle Clustered File System (OCFS). I am simply trying to cp 
file from an ext filesystem to the OCFS and the boxing keeps running 
out of memory...we have 12GB of RAM and shed loads of swap which is 
never being used ?? There is nothing else running on the box .

Heres the messages file extract 

Aug  3 12:03:23 dev1 kernel: aacraid:Battery charge is now OK
Aug  3 12:14:20 dev1 sshd(pam_unix)[2749]: session closed for user 
root
Aug  3 12:25:25 dev1 sshd(pam_unix)[2943]: session opened for user 
root by (uid=0)
Aug  3 12:46:09 dev1 sshd(pam_unix)[3016]: session opened for user 
root by (uid=0)
Aug  3 12:51:19 dev1 kernel: Out of Memory: Killed process 2620 
(httpd).
Aug  3 12:56:28 dev1 EV_AGENT[2171]: Agent Main --Estream construct 
failed. Err: RWSockErr: in RWSocket::getpeername: NOTCONN
Aug  3 12:56:28 dev1 EV_AGENT[2171]: Agent Main --Estream construct 
failed. Err: RWSockErr: in RWSocket::getpeername: NOTCONN
Aug  3 12:56:28 dev1 kernel: ENOMEM in 
journal_get_undo_access_Rsmp_6317abf4, retrying.
Aug  3 12:56:28 dev1 kernel: Out of Memory: Killed process 2621 
(httpd).
Aug  3 12:56:33 dev1 kernel: Out of Memory: Killed process 2622 
(httpd).
Aug  3 12:56:38 dev1 kernel: Out of Memory: Killed process 2623 
(httpd).
Aug  3 12:56:43 dev1 kernel: Out of Memory: Killed process 2624 
(httpd).
Aug  3 12:56:48 dev1 kernel: Out of Memory: Killed process 2625 
(httpd).
Aug  3 12:56:53 dev1 kernel: Out of Memory: Killed process 2626 
(httpd).
Aug  3 12:56:58 dev1 kernel: Out of Memory: Killed process 2627 
(httpd).
Aug  3 12:57:03 dev1 kernel: Out of Memory: Killed process 2561 
(sendmail).
Aug  3 12:57:08 dev1 kernel: Out of Memory: Killed process 2679 (xfs).
Aug  3 13:02:29 dev1 kernel: Out of Memory: Killed process 3076 
(httpd).
Aug  3 13:02:29 dev1 EV_AGENT[2171]: Agent Main --Estream construct 
failed. Err: RWSockErr: in RWSocket::getpeername: NOTCONN
Aug  3 13:02:29 dev1 kernel: ENOMEM in 
journal_get_undo_access_Rsmp_6317abf4, retrying.
Aug  3 13:02:30 dev1 sshd(pam_unix)[2943]: session closed for user 
root
Aug  3 13:02:34 dev1 kernel: Out of Memory: Killed process 3077 
(httpd).
Aug  3 13:02:39 dev1 kernel: Out of Memory: Killed process 3078 
(httpd).
Aug  3 13:02:44 dev1 kernel: Out of Memory: Killed process 3079 
(httpd).
Aug  3 13:02:49 dev1 kernel: Out of Memory: Killed process 1895 
(rpc.statd).
Aug  3 13:02:54 dev1 kernel: Out of Memory: Killed process 1876 
(portmap).
Aug  3 13:02:59 dev1 kernel: Out of Memory: Killed process 2581 
(httpd).
Aug  3 13:03:04 dev1 kernel: Out of Memory: Killed process 2294 
(ntpd).
Aug  3 13:03:09 dev1 kernel: Out of Memory: Killed process 2731 
(mdmpd).
Aug  3 13:03:09 dev1 kernel: Out of Memory: Killed process 2732 
(mdmpd).
Aug  3 13:03:14 dev1 kernel: Out of Memory: Killed process 2171 
(naviagent).
Aug  3 13:03:14 dev1 EV_AGENT[2171]: Signal TERM received
Aug  3 13:03:14 dev1 EV_AGENT[2171]: Agent is going down
Aug  3 13:45:40 dev1 syslogd 1.4.1: restart.
Aug  3 13:45:40 dev1 syslog: syslogd startup succeeded

Update 2 has been applied to this box.

Comment 36 Larry Woodman 2004-08-03 19:58:33 UTC

David, the 2.4.21-9 kernel you are running is quite old.  Since that
kernel was distributed, we have made several changes to eliminate or
at lease reduce the occurance or OOM kills.  In addition, we have
added more debugging code the the OOM kill path so that memory
statistics are written out to the messages file when an OOK kill
occurs.  Please try out the lastest RHEL3-U3 kernel and see if that
helps these problems.  Its located here:

http://people.redhat.com/~lwoodman/.RHEL3/


Larry Woodman

Comment 37 David Browne 2004-08-03 23:50:55 UTC

Thanks Larry. Doing this today will let you know how it goes.
Dave.

Comment 38 Ernie Petrides 2004-08-05 21:26:33 UTC

The fix for the problem (as notated in comment #29) first became
available in advisory RHSA-2004:066, so I'm closing this bug.

Comment 39 David Browne 2004-08-05 22:36:37 UTC

This level of the kernel is too high for our QLA cards (ie. EMC not 
recognised at this level) so we have had to go back to 
2.4.21.15.0.4.ELsmp. As this fix is not included in this kernel - 
could you advise what we should do now ?

Comment 40 David Browne 2004-08-05 22:51:54 UTC

OK forget this. We are going with this level as this is the RedHat 
supported version. Thanks,

Comment 41 Brian Lin 2006-01-06 06:40:38 UTC

even 2.4.21-27EL-smp kernel i found out the same problem while running oracle. 
oracle bk processes would be killed automatically. how can we expect something 
fixed for this bug shortly.