130012 – OOM Kill kicks in on 64 Gig Bull Nova system

Bug 130012 - OOM Kill kicks in on 64 Gig Bull Nova system

Summary: OOM Kill kicks in on 64 Gig Bull Nova system

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	ia64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-16 15:58 UTC by Bill Peck
Modified:	2007-11-30 22:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-20 20:55:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
buffers.patch (1.40 KB, text/plain) 2004-08-17 21:22 UTC, Larry Woodman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:550	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4	2004-12-20 05:00:00 UTC

Description Bill Peck 2004-08-16 15:58:59 UTC

Description of problem:
When running I/O aganist qla2300 on a 16 way Bull system the OOM
killer kicks in.

Larry Woodman is looking at this.  It looks like the buffer head is
not getting reclaimed and we run out of memory.

Version-Release number of selected component (if applicable):
happens with both 
2.4.21-9
2.4.21-18

How reproducible:
run I/O for less than an hour against qla2300 controller

Steps to Reproduce:
1. 
2.
3.
  
Actual results:


Expected results:


Additional info:
Larry has built a new kernel which I am trying now

Comment 1 Larry Woodman 2004-08-16 21:26:08 UTC

There are two separate problems that are causing the OOM killer to
attack the processes on this machine: 1.) The fancyIOtlb.patch for the
IA64 system without IOMMUs in hardware cause the allocation of all
kernel data structures(kmem_cache_alloc and kmalloc) to be allocated
out of the relatively small(2GB) DMA zone.  So, it doenst take very
long before the DMA zone is totally consumed by the slab and the
system starts OOM killing.  2.) The try_to_reclaim_buffers() routine
which is responsible for reclaiming all buffer headers on RHEL3 is
only called from kswapd and not form other tasks via __alloc_pages. 
This means that on a machine with more than 10 processors its possible
for the OOM killer to be involked more than 10 times in a short
timeframe without an intervening success from kswapd.  This can result
in erroneous OOM kills as well as really lousy performance when lowmem
gets consumed by buffer headers via the slab.

I am working of separate fices for both problems.

Comment 2 Larry Woodman 2004-08-17 21:22:20 UTC

Created attachment 102811 [details]
buffers.patch

Comment 3 Larry Woodman 2004-08-17 21:23:43 UTC

The above patch fixes both problems described above.  They have been
submitted to rhkernel-list for comments and RHEL3-U4 consideration.

Larry

Comment 4 Ernie Petrides 2004-09-15 00:08:09 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.6.EL).

Comment 5 Ernie Petrides 2004-09-18 05:57:37 UTC

The fix to the fix has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.7.EL).

Comment 6 John Flanagan 2004-12-20 20:55:54 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html

Note You need to log in before you can comment on or make changes to this bug.