Bug 435662 - lockup in shrink_zone when node out of memory
lockup in shrink_zone when node out of memory
Status: CLOSED DEFERRED
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.6.z
All Linux
urgent Severity high
: rc
: ---
Assigned To: Larry Woodman
Martin Jenner
: OtherQA, ZStream
Depends On: 205722
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-03 00:15 EST by CAI Qian
Modified: 2009-11-23 09:47 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-03-15 10:17:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
reproducer (1.61 KB, text/plain)
2008-03-03 00:15 EST, CAI Qian
no flags Details

  None (edit)
Comment 5 Larry Woodman 2008-03-05 14:32:22 EST
The /proc/sys/vm/pagecache tunable and associated kernel code will not prevent
the system form hanging up and eventually panic()'ng running the oom-kill.c test
program.  This test program forks off several processes/threads that allocate
all of the RAM in anonymous regions on a system without any swap space.  That
patch was intended to prevent real-world mixed workloads from causing NUMA
systems to hang and eventually panic() like this.  We are working on upstream
fixes such as re-structureing the VM code so it can get into this type of
lock-up and queued spinlocks so that it wont stay in this state if it does. 
Neither of these can be considered for RHEL4 because it will break the kABI and
will be too unstable for a maintance release.

You can probably prevent the panic and hang form happening by making the OOM
killer more aggressive but at the cost of killing many more processes than it
should:

------------------------------------------------------------------------------
--- linux-2.6.9/mm/oom_kill.c.orig      2008-03-05 14:31:16.000000000 -0500
+++ linux-2.6.9/mm/oom_kill.c   2008-03-05 14:32:39.000000000 -0500
@@ -252,6 +252,9 @@ void out_of_memory(int gfp_mask)
        unsigned long now, since;
  
        spin_lock(&oom_lock);
+
+       goto justkill;
+
        now = jiffies;
        since = now - last;
        last = now;
@@ -292,6 +295,7 @@ void out_of_memory(int gfp_mask)
         */
        lastkill = now;
  
+justkill:
        printk("oom-killer: gfp_mask=0x%x\n", gfp_mask);
  
        /* oom_kill() sleeps */
------------------------------------------------------------------------------

Larry Woodman
Comment 6 Larry Woodman 2008-03-05 14:55:29 EST
Cai, just to be clear here, are you saying that the z-stream RHEL4U6 kernel with
the pagecache patch is worse than the standard RHEL4U6 kernel as far as hanging
and panic()'ng is concerned when running that repro-oom program or ar you saying
that it still acts th same ???

Larry
Comment 7 CAI Qian 2008-03-05 17:34:12 EST
I'd say it is not a regression from standard RHEL4U6 kernel to Z-stream RHEL4U6.
As far as I remember, in standard RHEL4U6 kernel, the test case did hang the
machine, but did not panic, because of NMI watchdog is not working there. In
Z-stream RHEL4U6 kernel, as the NMI watchdog bug has been fixed, panic has been
seen.
Comment 8 CAI Qian 2008-03-15 10:17:00 EDT
Based on the comment #5, I am going to close this bug as DEFERRED.
Comment 9 Larry Woodman 2009-11-23 09:47:35 EST
Use this program:

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <errno.h>
#include <stdio.h>
main(int argc,char *argv[])
{
        unsigned long   siz;
        char    *ptr1;
        char    *i;
        int     option;

        if ((argc <= 1)||(argc >3)) {
                printf("bad args, usage: memory <Log2 of memsize> <loop otion: 0=exit,1=loop,2=spin>\n");
                exit(-1);
        }
        siz = ((long)1<<atol(argv[1]));
        ptr1 = (char *)malloc(siz);
        if (!ptr1){
                printf("Unable to malloc\n");
                exit(-1);
        }
        option = atoi(argv[2]);
        printf("option=%d\n", option);
        printf("mmaping %ld anonymous bytes\n", siz);
        ptr1 = (char *)mmap((void *)0,siz,PROT_READ|PROT_WRITE,MAP_ANONYMOUS|MAP_PRIVATE,-1,0);
        if (ptr1 == (char *)-1) {
                printf("ptr1 = %lx\n", ptr1);
                perror("");
        }
        printf("touching %d pages\n", siz/4096);
loop:
        for (i=ptr1; i<ptr1+siz-1; i+=4096) {
                *i=(char)'i';
        }
        if (option==1)
                goto loop;
        else if (option==2)
                while(1);
        else
                printf("exiting\n");
}

Note You need to log in before you can comment on or make changes to this bug.