Red Hat Bugzilla – Bug 497611
Miscellaneous bugs in the oom killer logic
Last modified: 2014-06-09 07:48:20 EDT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/20081217 Firefox/22.214.171.124
I am doing some enhancements for the oom killer logic in rh5.3 (will
be valid for rh5.3 2.6.18-138 and later branches) which solve a couple
of problems with oom killing behavior:
(a) even with the bits added in various parts of alloc_pages()
and the try to free pages paths to detect TIF_MEMDIE, it is
out of the box fairly easy to create a situation, especially
on a swapless system, where OOM is not declared by return values
from try_to_free_pages(). Some of this is specifically due to
a Redhat patch (as of 5.2) that just about guarantees that
OOM will never be returned by try_to_free_pages(). In playing
with it, I was able to get it to detect OOM and at least sometimes
return the OOM indication to slloc_pages() [at which point the
final checks for available free pages are done and out_of_memory()
is called]. This is a pre-condition for the OOM killing logic to
trigger in the first place.
(b) If using badness() to select the target of the OOM kill picks a process
then in my experience with multi-threaded large servers, it only kills
the thread group leader, which leads to latencies because that has to
land in the group exit function before it kicks the other threads. It
is not clear that this makes any sense. The goal is to get all the
threads on t->mm to land in exit_mm() before the physical pages have
a prayer of being released.
(c) Meanwhile other threads are down in the reclaim code causing scheduler
havoc and if they are unable to reap many new pages AND they don't return
an OOM indication, then the call to the page allocators is tried and if
it fails, they end up (subject to gfp mask bits) backing off for HZ/10
and retrying the whole thing. If no processes exit on their own and
the process which is expected to die (because OOM kill => SIGKILL to it)
does not die, then we have OOM dead lock bigtime.
Before pushing any of the patches which actually seem to solve these problems
I have some pre-patches in mm/oom_kill.c that I'd like to submit for
consideration in rhel5.4 or some later version of rhel5. These are pretty
obvious patches, mostly dealing with avoiding kernel threads which have
borrowed user pages (e.g. AIO worker threads), a missing task_unlock()
for the swapoff process and in the search for any thread on the same
mm to check for OOM disable (which aborts the whole thing), a minor typo.
Finally when doing the oomkilladj > 0 case, if points is 0, the logic
done by the left shift is fouled up. I found this one in some upstream
kernel.org release (can't remember which).
None of this is very testable because the OOM killer seems pretty hard
to hit in my experience. Whoever knows the mm code pretty well would
recognize the correctness of these patches.
Also - there is this OOM_DISABLED property which a task with the
right privileges can use the exempt itself from being OOM killed.
This is a very worthwhile notion, but an added enhancement which
I've found to be worthwhile is a global variant of this which is
not wired to -17 and if set to say 1 instead of --17 can exempt
all processes which have not 'volunteered' for OOM killing by
increasing their oom adjustment value > whatever this value is.
We call this limited oom killing and by doing a wrapper called
setoom around the process startup, we have predictable candidates for
the oom killing. It works particularly well if the candidate
is restartable by its parent. With such a scheme we can do massive
leak injections (to mimic a memory leak), watch the oom killer take
down the largest non-exempt process it can find and the system pretty
gracefully recovers the memory leaked. For this to work, out_of_memory()
has to be called. Also - in the interest of not thrashing while the
target of OOM killing exits and releases the memory, I have found
it very worthwhile to hook the final drop of the mm (in exit_mm())
with a wakeup of processes who are blocking for the event. Coupled
with some robust timers, this avoids the thrashing in the reclaim
code which spikes the load average and in general does nothing useful
if no other processes are exiting and releasing physical pages.
Created attachment 341271 [details]
Fixes miscellaneous bugs in mm/oom_kill.c
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Not sure why info is needed if this bug is WONTFIX.