Bug 150019

Summary: Don't oom kill TASK_UNINTERRUPTIBLE processes
Product: Red Hat Enterprise Linux 3 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Ernie Petrides <petrides>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: dhoward, jburke, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-663 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 14:50:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156320    
Attachments:
Description Flags
Don't oom kill TASK_UNINTERRUPTIBLE processes none

Description Issue Tracker 2005-03-01 18:05:10 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Don Howard 2005-03-01 18:08:17 UTC
LLNL reports:

We recently had a machine in which the Out of Memory (OOM) killer
continually looped trying to kill the same process.

Here's a chunk of console output:

2005-02-03 15:06:49 Mem-info:
2005-02-03 15:06:49 Zone:DMA freepages:  1055 min:  1056 low:  1088
high:  1120
2005-02-03 15:06:49 Zone:Normal freepages:  1275 min:  1279 low:  4480
high:  6208
2005-02-03 15:06:49 Zone:HighMem freepages:   254 min:   255 low: 
4672 high:  7008
2005-02-03 15:06:49 Free pages:        2584 (   254 HighMem)
2005-02-03 15:06:49 ( Active: 477411/7004, inactive_laundry: 78,
inactive_clean: 0, free: 2584 )
2005-02-03 15:06:49   aa:675 ac:5 id:1 il:0 ic:0 fr:1055
2005-02-03 15:06:49   aa:186476 ac:1792 id:78 il:0 ic:0 fr:1275
2005-02-03 15:06:49   aa:292693 ac:2773 id:0 il:0 ic:0 fr:254
2005-02-03 15:06:49 1*4kB 5*8kB 35*16kB 19*32kB 5*64kB 3*128kB 1*256kB
0*512kB 0*1024kB 1*2048kB 0*4096kB = 4220kB)
2005-02-03 15:06:49 15*4kB 0*8kB 17*16kB 11*32kB 3*64kB 11*128kB
7*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 5100kB)
2005-02-03 15:06:49 2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB
1*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB)
2005-02-03 15:06:49 Swap cache: add 3213506, delete 3213425, find
165543/969891, race 0+0
2005-02-03 15:06:49 7266 pages of slabcache
2005-02-03 15:06:49 332 pages of kernel stacks
2005-02-03 15:06:49 1936 lowmem pagetables, 0 highmem pagetables
2005-02-03 15:06:49 Free swap:            0kB
2005-02-03 15:06:49 524288 pages of RAM
2005-02-03 15:06:49 299008 pages of HIGHMEM
2005-02-03 15:06:49 10487 reserved pages
2005-02-03 15:06:49 3596 pages shared
2005-02-03 15:06:49 103 pages swap cached
2005-02-03 15:06:49 Out of Memory: Killed process 7641 (engine_par).
2005-02-03 15:06:54 Mem-info:
2005-02-03 15:06:54 Zone:DMA freepages:  1055 min:  1056 low:  1088
high:  1120
2005-02-03 15:06:54 Zone:Normal freepages:  1277 min:  1279 low:  4480
high:  6208
2005-02-03 15:06:54 Zone:HighMem freepages:   254 min:   255 low: 
4672 high:  7008
2005-02-03 15:06:54 Free pages:        2586 (   254 HighMem)
2005-02-03 15:06:54 ( Active: 484416/79, inactive_laundry: 0,
inactive_clean: 0, free: 2586 )
2005-02-03 15:06:54   aa:673 ac:7 id:1 il:0 ic:0 fr:1055
2005-02-03 15:06:54   aa:184536 ac:1794 id:1939 il:79 ic:0 fr:1277
2005-02-03 15:06:54   aa:289137 ac:2774 id:3554 il:0 ic:0 fr:254
2005-02-03 15:06:54 1*4kB 5*8kB 35*16kB 19*32kB 5*64kB 3*128kB 1*256kB
0*512kB 0*1024kB 1*2048kB 0*4096kB = 4220kB)
2005-02-03 15:06:54 17*4kB 0*8kB 17*16kB 11*32kB 3*64kB 11*128kB
7*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 5108kB)
2005-02-03 15:06:54 2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB
1*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB)
2005-02-03 15:06:54 Swap cache: add 3213506, delete 3213431, find
165543/969892, race 0+0
2005-02-03 15:06:54 7262 pages of slabcache
2005-02-03 15:06:54 332 pages of kernel stacks
2005-02-03 15:06:54 1936 lowmem pagetables, 0 highmem pagetables
2005-02-03 15:06:54 Free swap:            0kB
2005-02-03 15:06:54 524288 pages of RAM
2005-02-03 15:06:54 299008 pages of HIGHMEM
2005-02-03 15:06:54 10487 reserved pages
2005-02-03 15:06:54 3589 pages shared
2005-02-03 15:06:54 97 pages swap cached
2005-02-03 15:06:54 Out of Memory: Killed process 7641 (engine_par).

this went on every 5 seconds for almost 30 minutes until one of our
admins crashed and rebooted the machine.

It ends up the process "engine_par" w/ pid 7641 was in an
uninterruptible state.  So the OOM killer tries to, but can't kill
this process.  Five seconds later, the machine is still out of memory,
so it calls the OOM killer again.  The OOM killer picks process 7641
again to kill, but it fails.  And on and on we go.

Rik Van Riel put out a patch here to fix this looping problem:

http:// www.ussg.iu.edu/hypermail/linux/kernel/0302.2/1713.html

This is also already upstream in the 2.4.29 kernel.

Comment 4 Don Howard 2005-03-01 18:18:14 UTC
Created attachment 111538 [details]
Don't oom kill TASK_UNINTERRUPTIBLE processes

Comment 5 Don Howard 2005-03-01 18:20:36 UTC
LLNL has indicated that they can not readily test this.

Comment 6 Rik van Riel 2005-03-01 18:24:25 UTC
I don't agree with not killing TASK_UNINTERRUPTIBLE processes, for reasons
explained earlier.

Comment 8 Don Howard 2005-04-26 21:00:59 UTC
LLNL confirms that they have been running with the patch from comment #4 for
nearly 2 months and reports that it corrects the problem.

Comment 9 Ernie Petrides 2005-05-10 04:40:52 UTC
The patch in comment #4 has been rejected during code review.

An alternative patch (introducing a /proc/sys/vm/oom-kill sysctl)
has been posted for review on 9-May-2005.


Comment 10 Ernie Petrides 2005-05-14 05:10:13 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.4.EL).

Specifically, the fix introduces a sysctl /proc/sys/vm/oom-kill,
which defaults to 1 (enabling up to one concurrent OOM kill).  If
a sysadmin wishes to allow additional OOM kills while one is still
pending (presumably because it is stuck in an uninterruptible sleep),
then /proc/sys/vm/oom-kill should be set to the maximum number of
concurrent OOM kills to be allowed (or -1 for an unlimited number).
A value of 0 will prevent OOM kills altogether.

Comment 24 Red Hat Bugzilla 2005-09-28 14:50:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html