Bug 591137 - server hang with no OOM reaper on memory exhaustion
server hang with no OOM reaper on memory exhaustion
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.2
All Linux
high Severity high
: ---
: ---
Assigned To: Luis Claudio R. Goncalves
David Sommerseth
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-11 10:18 EDT by Jon Thomas
Modified: 2016-05-22 19:30 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-12-17 11:07:40 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log from vmcore (83.79 KB, text/plain)
2010-05-11 10:25 EDT, Jon Thomas
no flags Details
foreach bt dump (8.20 MB, text/plain)
2010-05-11 10:51 EDT, Jon Thomas
no flags Details
reproducer for 158 crash (428 bytes, text/x-csrc)
2010-06-24 13:14 EDT, Jon Thomas
no flags Details

  None (edit)
Description Jon Thomas 2010-05-11 10:18:13 EDT
Customer is using mrg rt kernel is his machines . He is getting a large number of system hangs. There is memory shortage in the server during the hang. But oom killer is not getting invoked.

crash> sys
    KERNEL: /usr/lib/debug/lib/modules/2.6.24.7-139.el5rt/vmlinux
  DUMPFILE: /share/773563/2010610/vmcore
      CPUS: 8
      DATE: Thu Apr 15 18:36:59 2010
    UPTIME: 6 days, 04:37:46
LOAD AVERAGE: 1378.45, 1353.83, 1114.23
     TASKS: 9974
  NODENAME: xxxxxxxxxxxxxxx
   RELEASE: 2.6.24.7-139.el5rt
   VERSION: #1 SMP PREEMPT RT Mon Nov 16 12:02:19 EST 2009
   MACHINE: x86_64  (3333 Mhz)
    MEMORY: 32 GB
     PANIC: "SysRq : Trigger a crashdump"
Comment 2 Jon Thomas 2010-05-11 10:23:26 EDT
I think there may be two issues here:

1) the reaper is not getting called. I check the oom flags in the sosreport and the oom-killer should be called.

2) the hang. I suspect that this may be a dup of bz 546428. There are 928 processes in shrink_zone which just doesn't seem right.

grep -c shrink_zone btall

928
Comment 3 Jon Thomas 2010-05-11 10:25:33 EDT
Created attachment 413153 [details]
log from vmcore
Comment 4 Jon Thomas 2010-05-11 10:51:01 EDT
Created attachment 413160 [details]
foreach bt dump
Comment 9 Luis Claudio R. Goncalves 2010-06-11 12:20:50 EDT
I found a patch from RHEL we have been carrying for a long time in the v1 kernel, linux-2.6-rt-oomkill.patch, that (among other things) includes a function called should_oom_kill() that attempts to avoid and OOM frenzy. As there are no comments in the patch, I am right now investigating why the patch was added in the first place and whether we need to remove the should_oom_kill() references or fix the function.

I have a test kernel running with the should_oom_kill() bits removed and it is behaving the expected way. I have several instances of three different memory hoggers running and they are being killed right away when an OOM situation happens.

As soon as the historical motivation for the inclusion of that patch are clear, I will have a fix added to our latest kernel and test it on bigger systems.
Comment 14 Jon Thomas 2010-06-24 13:14:22 EDT
Created attachment 426656 [details]
reproducer for 158 crash
Comment 19 Frederik Bijlsma 2010-07-05 10:46:29 EDT
Guys, 

what is the status of this one?

Frederik
Comment 20 Luis Claudio R. Goncalves 2010-07-08 12:29:47 EDT
A patch, described below, has been added to kernel -160 in order to enhance the behavior of the oom-killer (speed and accuracy). This patch also includes a new invocation path for the oom-killer (page fault handling).

    oom-killer: several fixes and enhancements
    
    Bugzilla: 589741 591137
    
    This patch contains the backport of portions of upstream commits related
    to the oom-killer, bringing mrg v1 oom-killer closer to upstream. The
    following upstream commits have been backported:
    
      1c0fe6e mm: invoke oom-killer from page fault
      ff0ceb9 oom: serialize out of memory calls
      28b83c5 oom: move oom_adj value from task_struct to signal_struct
      4365a56 oom-kill: fix NUMA constraint check with nodemask
      b95c35e oom: fix the unsafe usage of badness() in proc_oom_score()
      d553ad8 param: fix NULL comparison on oom
      1ac0cb5 mm: fix anonymous dirtying
      5d863b8 oom: fix oom_adjust_write() input sanity check
      6583bb6 mm: avoid endless looping for oom killed tasks
      82553a9 oom: invoke oom killer for __GFP_NOFAIL
      a12888f oom_kill: don't call for int_sqrt(0)
      4779280 mm: make get_user_pages() interruptible   (partially backported)
      7a36a75 get_user_pages(): fix possible page leak on oom
      a1e0961 relay: nopage
      e91a810 oom_kill bug
      7b1915a mm/oom_kill.c: Use list_for_each_entry instead of list_for_each
Comment 32 Jon Thomas 2010-09-01 08:23:42 EDT
fyi: As per customer the kernel-rt supplied with MRG 1.3 beta works very good. OOM is done in under 2 seconds.
Comment 34 Jens Kuehnel 2010-09-10 09:08:08 EDT
What kind of infos is needed? 
I'm the customer with the problem.
Comment 39 Jon Thomas 2010-09-24 14:42:16 EDT
That may be possible. We're not sure yet.

but if SELinux is all that is required, will we support 1.3 rt on rhel5.4+seLinux upgrage?
Comment 40 Jens Kuehnel 2010-09-24 16:54:51 EDT
Hi,

we have SELinux deactivated with selinux=0 in /proc/cmdline.

SELinux is not a showstopper.

CU
Jens
aka the Customer ;-)
Comment 45 Jens Kuehnel 2010-12-17 11:07:40 EST
We updated to MRG 1.3 and the bug does not exsist there.

Service Ticket is already closed.

Note You need to log in before you can comment on or make changes to this bug.