Escalated to Bugzilla from IssueTracker
LLNL reports: We recently had a machine in which the Out of Memory (OOM) killer continually looped trying to kill the same process. Here's a chunk of console output: 2005-02-03 15:06:49 Mem-info: 2005-02-03 15:06:49 Zone:DMA freepages: 1055 min: 1056 low: 1088 high: 1120 2005-02-03 15:06:49 Zone:Normal freepages: 1275 min: 1279 low: 4480 high: 6208 2005-02-03 15:06:49 Zone:HighMem freepages: 254 min: 255 low: 4672 high: 7008 2005-02-03 15:06:49 Free pages: 2584 ( 254 HighMem) 2005-02-03 15:06:49 ( Active: 477411/7004, inactive_laundry: 78, inactive_clean: 0, free: 2584 ) 2005-02-03 15:06:49 aa:675 ac:5 id:1 il:0 ic:0 fr:1055 2005-02-03 15:06:49 aa:186476 ac:1792 id:78 il:0 ic:0 fr:1275 2005-02-03 15:06:49 aa:292693 ac:2773 id:0 il:0 ic:0 fr:254 2005-02-03 15:06:49 1*4kB 5*8kB 35*16kB 19*32kB 5*64kB 3*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 4220kB) 2005-02-03 15:06:49 15*4kB 0*8kB 17*16kB 11*32kB 3*64kB 11*128kB 7*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 5100kB) 2005-02-03 15:06:49 2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB) 2005-02-03 15:06:49 Swap cache: add 3213506, delete 3213425, find 165543/969891, race 0+0 2005-02-03 15:06:49 7266 pages of slabcache 2005-02-03 15:06:49 332 pages of kernel stacks 2005-02-03 15:06:49 1936 lowmem pagetables, 0 highmem pagetables 2005-02-03 15:06:49 Free swap: 0kB 2005-02-03 15:06:49 524288 pages of RAM 2005-02-03 15:06:49 299008 pages of HIGHMEM 2005-02-03 15:06:49 10487 reserved pages 2005-02-03 15:06:49 3596 pages shared 2005-02-03 15:06:49 103 pages swap cached 2005-02-03 15:06:49 Out of Memory: Killed process 7641 (engine_par). 2005-02-03 15:06:54 Mem-info: 2005-02-03 15:06:54 Zone:DMA freepages: 1055 min: 1056 low: 1088 high: 1120 2005-02-03 15:06:54 Zone:Normal freepages: 1277 min: 1279 low: 4480 high: 6208 2005-02-03 15:06:54 Zone:HighMem freepages: 254 min: 255 low: 4672 high: 7008 2005-02-03 15:06:54 Free pages: 2586 ( 254 HighMem) 2005-02-03 15:06:54 ( Active: 484416/79, inactive_laundry: 0, inactive_clean: 0, free: 2586 ) 2005-02-03 15:06:54 aa:673 ac:7 id:1 il:0 ic:0 fr:1055 2005-02-03 15:06:54 aa:184536 ac:1794 id:1939 il:79 ic:0 fr:1277 2005-02-03 15:06:54 aa:289137 ac:2774 id:3554 il:0 ic:0 fr:254 2005-02-03 15:06:54 1*4kB 5*8kB 35*16kB 19*32kB 5*64kB 3*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 4220kB) 2005-02-03 15:06:54 17*4kB 0*8kB 17*16kB 11*32kB 3*64kB 11*128kB 7*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 5108kB) 2005-02-03 15:06:54 2*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB) 2005-02-03 15:06:54 Swap cache: add 3213506, delete 3213431, find 165543/969892, race 0+0 2005-02-03 15:06:54 7262 pages of slabcache 2005-02-03 15:06:54 332 pages of kernel stacks 2005-02-03 15:06:54 1936 lowmem pagetables, 0 highmem pagetables 2005-02-03 15:06:54 Free swap: 0kB 2005-02-03 15:06:54 524288 pages of RAM 2005-02-03 15:06:54 299008 pages of HIGHMEM 2005-02-03 15:06:54 10487 reserved pages 2005-02-03 15:06:54 3589 pages shared 2005-02-03 15:06:54 97 pages swap cached 2005-02-03 15:06:54 Out of Memory: Killed process 7641 (engine_par). this went on every 5 seconds for almost 30 minutes until one of our admins crashed and rebooted the machine. It ends up the process "engine_par" w/ pid 7641 was in an uninterruptible state. So the OOM killer tries to, but can't kill this process. Five seconds later, the machine is still out of memory, so it calls the OOM killer again. The OOM killer picks process 7641 again to kill, but it fails. And on and on we go. Rik Van Riel put out a patch here to fix this looping problem: http:// www.ussg.iu.edu/hypermail/linux/kernel/0302.2/1713.html This is also already upstream in the 2.4.29 kernel.
Created attachment 111538 [details] Don't oom kill TASK_UNINTERRUPTIBLE processes
LLNL has indicated that they can not readily test this.
I don't agree with not killing TASK_UNINTERRUPTIBLE processes, for reasons explained earlier.
LLNL confirms that they have been running with the patch from comment #4 for nearly 2 months and reports that it corrects the problem.
The patch in comment #4 has been rejected during code review. An alternative patch (introducing a /proc/sys/vm/oom-kill sysctl) has been posted for review on 9-May-2005.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL). Specifically, the fix introduces a sysctl /proc/sys/vm/oom-kill, which defaults to 1 (enabling up to one concurrent OOM kill). If a sysadmin wishes to allow additional OOM kills while one is still pending (presumably because it is stuck in an uninterruptible sleep), then /proc/sys/vm/oom-kill should be set to the maximum number of concurrent OOM kills to be allowed (or -1 for an unlimited number). A value of 0 will prevent OOM kills altogether.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html