Red Hat Bugzilla – Bug 130012
OOM Kill kicks in on 64 Gig Bull Nova system
Last modified: 2007-11-30 17:07:03 EST
Description of problem:
When running I/O aganist qla2300 on a 16 way Bull system the OOM
killer kicks in.
Larry Woodman is looking at this. It looks like the buffer head is
not getting reclaimed and we run out of memory.
Version-Release number of selected component (if applicable):
happens with both
run I/O for less than an hour against qla2300 controller
Steps to Reproduce:
Larry has built a new kernel which I am trying now
There are two separate problems that are causing the OOM killer to
attack the processes on this machine: 1.) The fancyIOtlb.patch for the
IA64 system without IOMMUs in hardware cause the allocation of all
kernel data structures(kmem_cache_alloc and kmalloc) to be allocated
out of the relatively small(2GB) DMA zone. So, it doenst take very
long before the DMA zone is totally consumed by the slab and the
system starts OOM killing. 2.) The try_to_reclaim_buffers() routine
which is responsible for reclaiming all buffer headers on RHEL3 is
only called from kswapd and not form other tasks via __alloc_pages.
This means that on a machine with more than 10 processors its possible
for the OOM killer to be involked more than 10 times in a short
timeframe without an intervening success from kswapd. This can result
in erroneous OOM kills as well as really lousy performance when lowmem
gets consumed by buffer headers via the slab.
I am working of separate fices for both problems.
Created attachment 102811 [details]
The above patch fixes both problems described above. They have been
submitted to rhkernel-list for comments and RHEL3-U4 consideration.
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.6.EL).
The fix to the fix has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.7.EL).
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.