Description of problem: When running I/O aganist qla2300 on a 16 way Bull system the OOM killer kicks in. Larry Woodman is looking at this. It looks like the buffer head is not getting reclaimed and we run out of memory. Version-Release number of selected component (if applicable): happens with both 2.4.21-9 2.4.21-18 How reproducible: run I/O for less than an hour against qla2300 controller Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Larry has built a new kernel which I am trying now
There are two separate problems that are causing the OOM killer to attack the processes on this machine: 1.) The fancyIOtlb.patch for the IA64 system without IOMMUs in hardware cause the allocation of all kernel data structures(kmem_cache_alloc and kmalloc) to be allocated out of the relatively small(2GB) DMA zone. So, it doenst take very long before the DMA zone is totally consumed by the slab and the system starts OOM killing. 2.) The try_to_reclaim_buffers() routine which is responsible for reclaiming all buffer headers on RHEL3 is only called from kswapd and not form other tasks via __alloc_pages. This means that on a machine with more than 10 processors its possible for the OOM killer to be involked more than 10 times in a short timeframe without an intervening success from kswapd. This can result in erroneous OOM kills as well as really lousy performance when lowmem gets consumed by buffer headers via the slab. I am working of separate fices for both problems.
Created attachment 102811 [details] buffers.patch
The above patch fixes both problems described above. They have been submitted to rhkernel-list for comments and RHEL3-U4 consideration. Larry
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.6.EL).
The fix to the fix has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.7.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html