Bug 205722

Summary: lockup in shrink_zone when node out of memory
Product: Red Hat Enterprise Linux 4 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.0CC: clalance, ddomingo, herbert.van.den.bergh, jbaron, jplans, jrfuller, nobody+bjmason, poelstra, tao
Target Milestone: ---Keywords: OtherQA, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:15:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 245197    
Bug Blocks: 234251, 238901, 238902, 238904, 238905, 245198, 246621, 248141, 248673, 435662    
Attachments:
Description Flags
Bootlog and crash messages
none
Larry Woodman's test patch from IT 101232 none

Description Issue Tracker 2006-09-08 08:30:33 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2006-09-08 08:30:48 UTC
We are seeing many occurences of the following lockup when nodes run
out of memory without swap.


NMI Watchdog detected LOCKUP, CPU=0, registers:
CPU 0 
Modules linked in: perfctr(U) netdump(U) job(U) i2c_dev(U) i2c_core(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) dm_mod(U) sd_mod(U) usb_storage(U) joydev(U) rtc(U) md(U) ohci_hcd(U) k8_edac(U) edac_mc(U) floppy(U) sata_nv(U) libata(U) scsi_mod(U) unionfs(U) nfs(U) lockd(U) sunrpc(U) e1000(U)
Pid: 25914, comm: lamp_DD Not tainted 2.6.9-50chaos
RIP: 0010:[<ffffffff802ed635>] <ffffffff802ed635>{.text.lock.spinlock+46}
RSP: 0018:00000103d25a18d8  EFLAGS: 00000086
RAX: 0000000000000000 RBX: 0000010300000800 RCX: 0000010300000808
RDX: 0000010300f245f0 RSI: 000000000000000e RDI: 0000010300000800
RBP: 0000000000000000 R08: 00000103d25a0000 R09: 0000000300000000
R10: 0000000300000000 R11: 0000000000000000 R12: 0000010300000780
R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8
FS:  0000000040200960(005b) GS:ffffffff804e2080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a956fb000 CR3: 0000000000101000 CR4: 00000000000006e0
Process lamp_DD (pid: 25914, threadinfo 00000103d25a0000, task 00000101ff647550)
Stack: 0000010300000800 ffffffff80161370 409cf80000000000 0000000000000000 
       ffffffe000000020 0000000000000020 0000000000000000 000000000004ea5b 
       409cf80000000000 0000000000000246 
Call Trace:<ffffffff80161370>{shrink_zone+1456} <ffffffff8012eba5>{move_tasks+406} 
       <ffffffff8014732c>{keventd_create_kthread+0} <ffffffff802ec3df>{thread_return+0} 
       <ffffffff802ec437>{thread_return+88} <ffffffff80131396>{autoremove_wake_function+0} 
       <ffffffff80162047>{try_to_free_pages+318} <ffffffff8015a69b>{__alloc_pages+545} 
       <ffffffff8015d3c0>{do_page_cache_readahead+209} <ffffffff80157278>{filemap_nopage+338} 
       <ffffffff80165e80>{do_no_page+1045} <ffffffff8016622a>{handle_mm_fault+343} 
       <ffffffff8011fb99>{do_page_fault+545} <ffffffff80131396>{autoremove_wake_function+0} 
       <ffffffff8018fbc8>{dnotify_parent+34} <ffffffff80175c20>{vfs_read+252} 
       <ffffffff8010fd89>{error_exit+0} 

Code: 83 3b 00 7e f9 e9 91 fd ff ff f3 90 83 3b 00 7e f9 e9 cf fd 
Kernel panic - not syncing: nmi watchdog
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at panic:74
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: perfctr(U) netdump(U) job(U) i2c_dev(U) i2c_core(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) dm_mod(U) sd_mod(U) usb_storage(U) joydev(U) rtc(U) md(U) ohci_hcd(U) k8_edac(U) edac_mc(U) floppy(U) sata_nv(U) libata(U) scsi_mod(U) unionfs(U) nfs(U) lockd(U) sunrpc(U) e1000(U)
Pid: 25914, comm: lamp_DD Not tainted 2.6.9-50chaos
RIP: 0010:[<ffffffff801336f2>] <ffffffff801336f2>{panic+211}
RSP: 0018:ffffffff80471ea8  EFLAGS: 00010082
RAX: 000000000000002c RBX: ffffffff80301543 RCX: 0000000000000046
RDX: 0000000000008378 RSI: 0000000000000046 RDI: ffffffff803b9c60
RBP: ffffffff80472058 R08: 0000000000000007 R09: ffffffff80301543
R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000003d
R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8
FS:  0000000040200960(005b) GS:ffffffff804e2080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a956fb000 CR3: 0000000000101000 CR4: 00000000000006e0
Process lamp_DD (pid: 25914, threadinfo 00000103d25a0000, task 00000101ff647550)
Stack: 0000003000000008 ffffffff80471f88 ffffffff80471ec8 0000000000000013 
       0000000000000000 0000000000000046 000000000000834c 0000000000000046 
       0000000000000007 ffffffff80303a98 
Call Trace:<ffffffff80110858>{show_stack+241} <ffffffff80110982>{show_registers+277} 
<ffffffff80110c89>{die_nmi+130} <ffffffff8011a58d>{nmi_watchdog_tick+210} 
<ffffffff80111560>{default_do_nmi+122} <ffffffff8011a643>{do_nmi+115} 
<ffffffff8011016b>{paranoid_exit+0} <ffffffff802ed635>{.text.lock.spinlock+46} 
 <EOE> <ffffffff80161370>{shrink_zone+1456} 
       <ffffffff8012eba5>{move_tasks+406} <ffffffff8014732c>{keventd_create_kthread+0} 
       <ffffffff802ec3df>{thread_return+0} <ffffffff802ec437>{thread_return+88} 
       <ffffffff80131396>{autoremove_wake_function+0} <ffffffff80162047>{try_to_free_pages+318} 
       <ffffffff8015a69b>{__alloc_pages+545} <ffffffff8015d3c0>{do_page_cache_readahead+209} 
       <ffffffff80157278>{filemap_nopage+338} <ffffffff80165e80>{do_no_page+1045} 
       <ffffffff8016622a>{handle_mm_fault+343} <ffffffff8011fb99>{do_page_fault+545} 
       <ffffffff80131396>{autoremove_wake_function+0} <ffffffff8018fbc8>{dnotify_parent+34} 
       <ffffffff80175c20>{vfs_read+252} <ffffffff8010fd89>{error_exit+0} 
       

Code: 0f 0b 9f 1a 30 80 ff ff ff ff 4a 00 31 ff e8 57 bf fe ff e8 
RIP <ffffffff801336f2>{panic+211} RSP <ffffffff80471ea8>
crash> bt
PID: 25914  TASK: 101ff647550       CPU: 0   COMMAND: "lamp_DD"
 #0 [ffffffff80471ce0] netpoll_start_netdump at ffffffffa02213b7
 #1 [ffffffff80471d10] die at ffffffff80110bf8
 #2 [ffffffff80471d30] do_invalid_op at ffffffff80110fc0
 #3 [ffffffff80471d68] panic at ffffffff801336f2
 #4 [ffffffff80471d70] release_console_sem at ffffffff8013401e
 #5 [ffffffff80471d90] vprintk at ffffffff8013424c
 #6 [ffffffff80471dc0] printk at ffffffff801342f6
 #7 [ffffffff80471df0] error_exit at ffffffff8010fd89
    [exception RIP: panic+211]
    RIP: ffffffff801336f2  RSP: ffffffff80471ea8  RFLAGS: 00010082
    RAX: 000000000000002c  RBX: ffffffff80301543  RCX: 0000000000000046
    RDX: 0000000000008378  RSI: 0000000000000046  RDI: ffffffff803b9c60
    RBP: ffffffff80472058   R8: 0000000000000007   R9: ffffffff80301543
    R10: 0000000000000000  R11: 0000000000000000  R12: 000000000000003d
    R13: 00000103d25a1ab8  R14: 00000103d25a1b38  R15: 00000103d25a1bd8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffffff80471ea0] panic at ffffffff801336de
 #9 [ffffffff80471f20] show_stack at ffffffff80110858
#10 [ffffffff80471f50] show_registers at ffffffff80110982
#11 [ffffffff80471f80] die_nmi at ffffffff80110c89
#12 [ffffffff80471fa0] nmi_watchdog_tick at ffffffff8011a58d
#13 [ffffffff80471fe0] default_do_nmi at ffffffff80111560
#14 [ffffffff80472040] do_nmi at ffffffff8011a643
    [exception RIP: .text.lock.spinlock+46]
    RIP: ffffffff802ed635  RSP: 00000103d25a18d8  RFLAGS: 00000086
    RAX: 0000000000000000  RBX: 0000010300000800  RCX: 0000010300000808
    RDX: 0000010300f245f0  RSI: 000000000000000e  RDI: 0000010300000800
    RBP: 0000000000000000   R8: 00000103d25a0000   R9: 0000000300000000
    R10: 0000000300000000  R11: 0000000000000000  R12: 0000010300000780
    R13: 00000103d25a1ab8  R14: 00000103d25a1b38  R15: 00000103d25a1bd8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
#15 [103d25a18d8] .text.lock.spinlock at ffffffff802ed635
#16 [103d25a18e0] shrink_zone at ffffffff80161370
#17 [103d25a1940] move_tasks at ffffffff8012eba5
#18 [103d25a19d0] thread_return at ffffffff802ec437
#19 [103d25a1ba0] try_to_free_pages at ffffffff80162047
#20 [103d25a1c50] __alloc_pages at ffffffff8015a69b
#21 [103d25a1cc0] do_page_cache_readahead at ffffffff8015d3c0
#22 [103d25a1d30] filemap_nopage at ffffffff80157278
#23 [103d25a1d90] do_no_page at ffffffff80165e80
#24 [103d25a1df0] handle_mm_fault at ffffffff8016622a
#25 [103d25a1e70] do_page_fault at ffffffff8011fb99
#26 [103d25a1ee0] dnotify_parent at ffffffff8018fbc8
#27 [103d25a1f10] vfs_read at ffffffff80175c20
#28 [103d25a1f50] error_exit at ffffffff8010fd89
    RIP: 0000002a96148d4c  RSP: 00000000401fd1e0  RFLAGS: 00010202
    RAX: 00000000fbad8004  RBX: 0000002a95936570  RCX: 00000000402000b0
    RDX: 0000000040200090  RSI: 00000000401fd3f8  RDI: 0000002a95936570
    RBP: 00000000401fd3f8   R8: 00000000401fd9c0   R9: 0000000000000000
    R10: 0000000040200101  R11: 0000002a96137e50  R12: 0000002a9632b080
    R13: 0000000000000000  R14: 0000000000000000  R15: 00000000401fd8c0
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  3913698      14.9 GB         ----
      FREE     6998      27.3 MB    0% of TOTAL MEM
      USED  3906700      14.9 GB   99% of TOTAL MEM
    SHARED        0            0    0% of TOTAL MEM
   BUFFERS        0            0    0% of TOTAL MEM
    CACHED    22763      88.9 MB    0% of TOTAL MEM
      SLAB        0            0    0% of TOTAL MEM

TOTAL HIGH        0            0    0% of TOTAL MEM
 FREE HIGH        0            0    0% of TOTAL HIGH
 TOTAL LOW  3913698      14.9 GB  100% of TOTAL MEM
  FREE LOW     6998      27.3 MB    0% of TOTAL LOW

TOTAL SWAP        0            0         ----
 SWAP USED        0            0  100% of TOTAL SWAP
 SWAP FREE        0            0    0% of TOTAL SWAP




This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 2 Issue Tracker 2006-09-08 08:30:53 UTC
Sorry, this is on x86_64, 4 socket, dual-core, 16GB RAM.

We have lower_zone_protection = 100

I will see about getting you a vmcore


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 3 Issue Tracker 2006-09-08 08:31:06 UTC
usemem doesn't seem to trigger. Maybe IB and locking of user space pages
in kernel space must be involved.
The bz:193695 patch is applied.


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 4 Issue Tracker 2006-09-08 08:31:21 UTC
10:55 <neb> So usemem-mpi causes the problem reliably?
10:56 <neb> So I can see what you are saying there are no spinlocks in
            shrink_zone
10:56 <neb> however it does call out to refill_inactive_zone, shrink_cache
and
            throttle_vm_writeout.
10:57 <neb> I don't know why we wouldn't see the one holding the lock in
the bt
            though.
10:57 <grondo> yeah, it must be one of those, not sure why they aren't in
the
               stack trace
10:57 <neb> I suspect that it is NOT throttle_vm_writeout()
10:58 <neb> Inlining?
10:59 <neb> with the usemem code how much memory is pinned?
10:59 <grondo> ah, yeah are those inlined?
10:59 <neb> it doesn't look like it explictly but who knows what evilness
the
            compiler does in the end.
11:01 <neb> shrink_cache just recursively calls shrink_zone
11:04 <neb> refill_inactive_zone has spinlocks
11:06 <neb>    674          spin_lock_irq(&zone->lru_lock);
11:07 <neb> so the hypothesis is that for whatever allocation is trying to
be
            done (and you have provided 3 exampls)
11:09 <neb> The number of active pages are so large that the process of
            scanning the pages takes longer than the NMI allows.
11:10 <neb> So to the NMI watchdog it looks as though the program is stuck
in
            the while look started in 625
11:10 <neb> That would also explain the lost timer ticks since it is
            spin_lock_irq
11:11 <neb> I would say that a reasonable test might be to set the
watchdog
            timer up to a higher value.
11:11 <neb> or to remove RAM from the machine and see if it works.
11:12 <neb> The RAM removal might reduce the time needed to scan all of
the
            pages below the threshold for the watchdog



This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 5 Issue Tracker 2006-09-08 08:31:35 UTC
16:08 <grondo> ok, finally running my test with NMI timeout of 60s
16:09 <grondo> well, you were right
16:10 <grondo> OOM killer went off as expected and node eventually
recovered,
               though it was out to lunch for at least 45-90s
16:10 <grondo> So that is a big clue
16:10 <grondo> Thanks for the suggestion
16:12 <neb> cool.
16:12 <neb> Hmm...I'll try to get more info as to what to do about it.



This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 6 Issue Tracker 2006-09-08 08:31:39 UTC
Kent, 
Can we get some more serious VM people to weigh in as to what to do about
this. I feel as though the implications of various approaches to solving
this issue would escape me and therefore it would be most beneficial to
draw on those people in RH who have more experience in how to work around
issues in the VM.

The problem seems to be that when memory is tight, an other or other CPUs
hang on the zone->lru_lock and then nmi watchdog kicks off. The
contributing factors seem to be:
1) no-swap
2) heavily SMP machine (8 cores) The 2nd or 3rd or 6th person waiting on
the lock might have to wait a very long time.
3) very active memory utilization
4) 16GiB of mem i.e. 4Mi pages so the time it takes to scan for active
pages takes a while.
5) only 5 secs before NMI watchdog goes off
6) x86_64 i.e. everything is in one zone

There might be something to do with pinned memory and the IB interface but
that could also just be that all that pinned  memory makes it difficult to
find a page that be put on the inactive list.

A piece of information that doesn't appear in the case log to this point
is that there were messages in the logs for the crashed nodes about missing
timer interrupts and drivers hogging the CPU with interrupts off. I don't
have the exact messages handy. This is consistent with the fact that the
lock is taken with interrupts disabled.

One data point that doesn't jive with my hypothesis is the fact that the
dumps show refill_inactive_zone in their backtraces. Why would that be? The
only thing that I could think of was that the compiler decided to inline it
even though it wasn't explicitly listed as an inline function.


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 7 Issue Tracker 2006-09-08 08:31:52 UTC
I'm not going to have a lot of information on this other than what's been
presented.   LLNL has already done most of the research and narrowing down
of what  they think is going on based on the analysis of their vmcores.  At
this point, they'd like to get a resident VM expert to have a look and
weigh in on it.

I don't have a reproducer readily available...this was reproduced in an
HPC environment using MPI and infiniband hardware on x86-64 compute nodes
with 8 AMD cores and 16GB of RAM each.

If there's any additional information that you guys need let me or Ben
know.  


Issue escalated to Support Engineering Group by: kbaxley.
Internal Status set to 'Waiting on SEG'
Status set to: Waiting on Tech

This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 8 Issue Tracker 2006-09-08 08:31:56 UTC
I was able to reproduce this problem by adding

 mlockall (MCL_FUTURE); 

at the beginning of usemem.c. So the pinning of memory seems to more
important
than anything else related to IB.

Kent, if you'd like I could attach the new usemem.c, but the addition is
trivial.

mark


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 10 Issue Tracker 2006-09-08 08:32:16 UTC
I agree it's unlikely to be IB related (unless IB is doing some
mlock'ing) - the mlockall is definitely a stress on the vm as those pages
are going to be completely unreclaimable by the vm. 

There's another IT with an uncannily similar footprint - 99166. 

Since raising the nmi timeout had positive impact, have they yet tested
lowering the amount of memory in the system? 

Do we have any idea what percentage of memory (or memory per node) is
locked when we're having a problem? 




This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 11 Issue Tracker 2006-09-08 08:32:32 UTC
Hi, Mark

I haven't been able to reproduce it with the modified usemem.c yet. 
Apparently you need a system with a lot of memory in it to get it to
trigger.  I tried this on a quad, dual-core opteron system with 8GB of RAM
(which is about the largest system I have access to at this time).  The OOM
killer launched and the system recovered.

I did, however, escalate it to engineering yesterday afternoon.  Here's
the response I got back from them...it looks like another customer out
there (Fujitsu) is hitting something similar (questions and comments
below):

We agree that the problem is unlikely to be IB related (unless IB is doing
some mlock'ing) - the mlockall is definitely a stress on the vm as those
pages are going to be completely unreclaimable by the vm. 

We've got another customer at Fujitsu with an uncannily similar problem.


Since raising the nmi timeout had positive impact, have they yet tested
lowering the amount of memory in the system? 

Does LLNL have any idea what percentage of memory (or memory per node) is
locked when they're having a problem?  


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 12 Issue Tracker 2006-09-08 08:32:38 UTC
In the mlockall() case, definitely most of the memory on each node is
pinned.

I forgot to say that in my tests, I ran 1 copy of "usemem" per core on
the system,
so on a quad socket/dual core node I was running 8 copies of usemem. I
don't know
if running just one copy will trigger the problem.

MPI jobs on Infiniband definitely need to lock large regions of memory. Any
memory
that might be used for communications cannot be swappable, so the IB
drivers or
MPI implementation locks these pages. I'm not sure how much of the
application's
memory is locked in this case, however. At this point, it might be
everything that
is malloc'd, I'm just not sure.

I will try lowering the amount of RAM in the machine. I'm confident that
at some point
this will relieve the problem.


This event sent from IssueTracker by nmurray  [SEG - Kernel]
 issue 101232

Comment 13 Norm Murray 2006-09-08 08:43:23 UTC
A very similar looking footprint has also been reported by Fujitsu Siemens -
though without the use of mlock'd memory. Tentatively linking that IT here as
well and wanting to get some conversation going with engineering on this as the
similarity seems to be multiprocessor, multicore systems running memory stress
tests.  

Comment 15 Larry Woodman 2006-09-11 16:45:57 UTC
Can someone grab a quick AltSysrq-M outout so I can see the size of the memory
nodes.

Larry Woodman


Comment 16 Issue Tracker 2006-09-11 17:16:40 UTC
Working on getting that for you Larry. However, as a preliminary note the
nodes are 4 socket dual core chips and they have 16GiB RAM spread evenly
amongst the nodes. Thus it should be 4GiB/node.

-ben



This event sent from IssueTracker by woodard 
 issue 101232

Comment 17 Issue Tracker 2006-09-11 18:46:25 UTC
2006-09-11 11:08:48 Mem-info:
2006-09-11 11:08:48 Node 3 DMA per-cpu: empty
2006-09-11 11:08:48 Node 3 Normal per-cpu:
2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 Node 3 HighMem per-cpu: empty
2006-09-11 11:08:48 Node 2 DMA per-cpu: empty
2006-09-11 11:08:48 Node 2 Normal per-cpu:
2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 Node 2 HighMem per-cpu: empty
2006-09-11 11:08:48 Node 1 DMA per-cpu: empty
2006-09-11 11:08:48 Node 1 Normal per-cpu:
2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 Node 1 HighMem per-cpu: empty
2006-09-11 11:08:48 Node 0 DMA per-cpu:
2006-09-11 11:08:48 cpu 0 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 0 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 1 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 1 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 2 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 2 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 3 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 3 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 4 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 4 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 5 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 5 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 6 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 6 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 cpu 7 hot: low 2, high 6, batch 1
2006-09-11 11:08:48 cpu 7 cold: low 0, high 2, batch 1
2006-09-11 11:08:48 Node 0 Normal per-cpu:
2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16
2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16
2006-09-11 11:08:48 Node 0 HighMem per-cpu: empty
2006-09-11 11:08:48
2006-09-11 11:08:48 Free pages:    15511408kB (0kB HighMem)
2006-09-11 11:08:48 Active:13026 inactive:9053 dirty:0 writeback:0
unstable:0 free:3877852 slab:6602 mapped:3965 pagetables:179
2006-09-11 11:08:48 Node 3 DMA free:0kB min:0kB low:0kB high:0kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 131000 131000
2006-09-11 11:08:48 Node 3 Normal free:4097208kB min:4192kB low:5240kB
high:6288kB active:2924kB inactive:4096kB present:4194300kB pages_scanned:0
all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 3 HighMem free:0kB min:128kB low:160kB high:192kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 2 DMA free:0kB min:0kB low:0kB high:0kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 131000 131000
2006-09-11 11:08:48 Node 2 Normal free:4088368kB min:4192kB low:5240kB
high:6288kB active:21264kB inactive:11972kB present:4194300kB
pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 2 HighMem free:0kB min:128kB low:160kB high:192kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 131000 131000
2006-09-11 11:08:48 Node 1 Normal free:4108168kB min:4192kB low:5240kB
high:6288kB active:6212kB inactive:4632kB present:4194300kB pages_scanned:0
all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 1 HighMem free:0kB min:128kB low:160kB high:192kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 0 DMA free:11880kB min:16kB low:20kB high:24kB
active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable?
no
2006-09-11 11:08:48 protections[]: 0 105300 105300
2006-09-11 11:08:48 Node 0 Normal free:3205784kB min:3372kB low:4212kB
high:5056kB active:21704kB inactive:15512kB present:3375100kB
pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 0 HighMem free:0kB min:128kB low:160kB high:192kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
2006-09-11 11:08:48 protections[]: 0 0 0
2006-09-11 11:08:48 Node 3 DMA: empty
2006-09-11 11:08:48 Node 3 Normal: 140*4kB 63*8kB 9*16kB 4*32kB 4*64kB
1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 999*4096kB = 4097208kB
2006-09-11 11:08:48 Node 3 HighMem: empty
2006-09-11 11:08:48 Node 2 DMA: empty
2006-09-11 11:08:48 Node 2 Normal: 164*4kB 40*8kB 6*16kB 4*32kB 2*64kB
0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 997*4096kB = 4088368kB
2006-09-11 11:08:48 Node 2 HighMem: empty
2006-09-11 11:08:48 Node 1 DMA: empty
2006-09-11 11:08:48 Node 1 Normal: 152*4kB 85*8kB 30*16kB 9*32kB 2*64kB
2*128kB 4*256kB 9*512kB 6*1024kB 1*2048kB 999*4096kB = 4108168kB
2006-09-11 11:08:48 Node 1 HighMem: empty
2006-09-11 11:08:48 Node 0 DMA: 4*4kB 7*8kB 4*16kB 3*32kB 2*64kB 2*128kB
2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11880kB
2006-09-11 11:08:48 Node 0 Normal: 0*4kB 3*8kB 0*16kB 12*32kB 8*64kB
4*128kB 3*256kB 1*512kB 0*1024kB 0*2048kB 782*4096kB = 3205784kB
2006-09-11 11:08:48 Node 0 HighMem: empty
2006-09-11 11:08:48 Swap cache: add 0, delete 0, find 0/0, race 0+0
2006-09-11 11:08:48 Free swap:            0kB
2006-09-11 11:08:48 3993596 pages of RAM
2006-09-11 11:08:48 79899 reserved pages
2006-09-11 11:08:48 4704 pages shared
2006-09-11 11:08:48 0 pages swap cached



This event sent from IssueTracker by woodard 
 issue 101232

Comment 21 Issue Tracker 2006-09-13 19:09:51 UTC
Hi, Larry

LLNL reports that running with numa=off doesn't resolve the issue.

However, it appears they're getting much better behavior when they bind
each task to a CPU and force memory affinity to the local node with
libnuma. The OOM killer goes off instead of the NMI watchdog.

So,they're confused, then, as to why numa=off didn't help.



This event sent from IssueTracker by kbaxley 
 issue 101232

Comment 22 Sirius Rayner-Karlsson 2006-09-13 20:37:55 UTC
Created attachment 136206 [details]
Bootlog and crash messages

Hi there,

x-posted from IT#99166

FSC will try numa=off, but would like to know if there is a HOWTO or some docs
on the NUMA API, and if we can elaborate on where the locked memory regions
come from.
They have tried the smp kernel, runlevel 3 and cpuspeed switched off and it
still caused a panic again, which is what the attachement is about.

/A

Comment 23 Issue Tracker 2006-09-13 22:06:36 UTC
Larry, one thing that we did do was not remove nmi-watchdog but extend the
timeout. The system does get unresponsive for a while but it does oomkill
in the end.

Mark is working on getting the Sysrq-W's for you. However, I don't
completely follow your logic. It seems to me, that if multiple CPUs start
hanging on zone->lru_lock then we could easily exceed the #of seconds for
the watchdog. You are scanning through 16GB of pages that have been
intentionally kept active.

14:55 <grondo> fyi: binding to local memory doesn't work everytime,
sometimes I still get an NMI timeout
14:55 <grondo> but it does seem to allieviate the hang somewhat



This event sent from IssueTracker by woodard 
 issue 101232

Comment 24 Issue Tracker 2006-09-13 23:09:24 UTC
Larry, 
When you tried to reproduce the problem were you doing it on a machine with
at least 8CPUs and at least 16GB RAM? Mark, was just able to reproduce the
problem using his usemem.c with the addition of a mlockall() at the
beginning of the code on a stock kernel, 2.6.9-42.ELsmp. I strongly believe
that the crux of this problem is the high CPU count and the relatively
large amount of mem.

Mark and I have been looking at the problem and your comments and we have
been discussing it. We believe that you may be looking for a REAL deadlock.
We don't believe that is what is happening. We don't believe that a
processor is truly getting stuck. We believe that the problem is a slow
race between the watchdog timer and the lru_lock. If a processor doesn't
hit a code path that does a touch_nmi_watchdog() or nmi_watchdog_tick()
before the 5 seconds is up, *poof*. When we first started looking at this
problem, we were able to work around it by increasing the length of the
watchdog timer to 60 sec and then things behaved as they should. IMHO this
is strong evidence that it isn't a case where a CPU is actually stuck, it
is a case where the CPU is not getting around to touching the watchdog
timer before it goes off.

As an experiment, we are adding a touch_nmi_watchdog() after the lru_lock
is dropped. We hypothesize that this will give an indication that
processors are moving through this code path but they aren't getting
around to the petting that the nmi_watchdog needs to keep from going off.
Mark is running that test now.


This event sent from IssueTracker by woodard 
 issue 101232

Comment 25 Issue Tracker 2006-09-13 23:45:19 UTC
Larry, 
Mark added some touch_nmi_watchdog() calls to the refill_inactive_list
AFTER the lru_lock was dropped. It seems to me that this suggests that my
hypothesis that we are moving through the critical section with the
lru_lock but failing to get to a place where we the watchdog is updated is
at least partially valid. I hope that this is useful clue for you and that
it doesn't get you chasing a wild goose dreamed up by a couple of 3rd rate
kernel hackers. ;-)

16:32 <grondo> Those touch_nmi_watchdog() calls seem to have helped. We get
OOM kills instead of NMI timeout
16:32 <grondo> however there was other collateral damage
16:32 <grondo> TCP: time wait bucket table overflow
16:32 <grondo> got a bunch of those on the console
16:33 <grondo> Tomorrow perhaps I'll try removing touch_nmi_watchdog()
calls and see at what point NMI watchdog detects timeout again
16:35 <neb> very cool.
16:36 <neb> It definitely does suggest that we are on the right track at
least.
16:36 <grondo> system was hung for about 45s at least then came back
16:36 <grondo> yes



This event sent from IssueTracker by woodard 
 issue 101232

Comment 26 Issue Tracker 2006-09-14 05:07:02 UTC
larry,
OK I see what you are saying. The sc->nr_to_scan gets set to the min of two
numbers of which one of them is SWAP_CLUSTER_MAX which is 32. Got it.
However unlikely, do you hve another explanation for:
1) increasing the watchdog timer allows the machine to recover
2) touching the watchdog after dropping the lock allows the machine to
recover.

Maybe I'm being too theoretical here bere but if the problem was a failure
to drop  lock somewhere wouldn't that lead to a real deadlock from which
the machine wouldn't recover. If that were not the case then you would
have to be doing a double decrement on the lock somewhere and I thought
that was a no-no.

In my mind the fact that we are have denonstrated that CPUs move through
the refill_inactive_list() and do drop the lock seems like a pretty
compelling argument that somewhere, with all the CPUs contending for the
lock and all the iterations through the critical section until enough mem
is freed, the watchdog timer is not getting updated. As unlikely as it may
be, I don't have a better explanation that satisfies the two items
above.

What I'm trying to figure out is where the the nearest place that the
watchdog timer actually gets poked is. Once Mark is able to capture an
alt-sysrq-W then I will be interested in looking through the various
backtraces to see how many levels must be peeled away before I get to a
function that pokes the watchdog. I suspect that the sysrq-w will show that
all cpus are heavily involved in trying to reclaim memory and several of
them happen to be hanging on the lru_lock.

I'm sorry, but I think that we my have forgotton to point out that these
are diskless, swapless nodes. That may be the key factor that has prevented
you from reproducing this problem. We seem to have no problem reproducing
it here.


This event sent from IssueTracker by woodard 
 issue 101232

Comment 27 Issue Tracker 2006-09-14 21:03:14 UTC
For the moment setting the overcommit policy to 2 and tuning the
overcommit_ratio up higher seems like it might be a satisfactory solution.


A key part of this solution was understanding why this is a good potential
solution to the problem. If we had proposed setting the overcommit policy
earlier, I doubt it would have been accepted.

They are going to do additional testing to see if this will work with thier
workloads. There is some concern that it will not. They ask that the IT
tickit stay open until more testing is done. If they run into further
problems, then they will use this case to introduce the problem.

As a general rule, we may want to document that overcommit policy needs to
be changed from the default on swapless nodes.

(we may have dodged a bullet on this one guys -- no scary kernel change.
Keep your fingers crossed)


This event sent from IssueTracker by woodard 
 issue 101232

Comment 28 Issue Tracker 2006-09-15 11:50:43 UTC
Hi,

FSC have tried with numa=off, but still experience crashes albeit with
different crash messages. I have enquired about access to hardware for our
engineering team to aid in debugging this, still waiting to hear back on
that.
FSC have tried this with, and without, swap - no difference. I have
suggested they try the overcommit policy trick, and also suggested about
increasing the watchdog timer. I hope to hear back soon about that from
them.

Kind Regards,

Anders Karlsson



This event sent from IssueTracker by akarlsso 
 issue 99166

Comment 29 Larry Woodman 2006-09-15 18:07:37 UTC
FYI Ben, BZ 200885 is another case in which disabling the nmi watchdog timer via
adding "nmi_watchdog=0" to the bootline prevents an 8-way Opteron from crashing
due to lock starvation.

Larry Woodman

Comment 30 Issue Tracker 2006-09-25 22:05:52 UTC
It seems like the new test kernel is going into hard lockup to the point
that sysrq's don't seem to be responding so gathering information is
difficult
power control seems to be working, just couldn't get any sysrq's. The
watchdog is the only way source of info. It might be a deadlock with irq
disabled I guess. It looks like it stuck in a spinlock
refill_inactive_zone()

15:03 <grondo> --- <exception stack> ---
15:03 <grondo> #15 [10208b39938] .text.lock.spinlock at ffffffff802edd69
15:03 <grondo> #16 [10208b39940] refill_inactive_zone at ffffffff80160eb2
15:03 <grondo> #17 [10208b399c0] get_writeback_state at ffffffff8015b730
15:03 <grondo> #18 [10208b399d0] get_dirty_limits at ffffffff8015b755
15:03 <grondo> #19 [10208b39a50] shrink_zone at ffffffff801614cc
15:03 <grondo> #20 [10208b39a80] fist_dprint_internal at ffffffffa00cde2d
15:03 <grondo> #21 [10208b39a90] fist_dprint_internal at ffffffffa00cde2d
15:03 <grondo> #22 [10208b39b00] fist_dprint_internal at ffffffffa00cde2d
15:03 <grondo> #23 [10208b39b10] vsnprintf at ffffffff801e2cf7
15:03 <grondo> #24 [10208b39b60] do_page_cache_readahead at
ffffffff8015d41d
15:03 <grondo> #25 [10208b39be0] __up_read at ffffffff801e1f8d
15:03 <grondo> #26 [10208b39c00] shrink_slab at ffffffff80160dbe
15:03 <grondo> #27 [10208b39c40] try_to_free_pages at ffffffff801621b9
15:03 <grondo> #28 [10208b39cf0] __alloc_pages at ffffffff8015a6a7
15:03 <grondo> #29 [10208b39d60] do_no_page at ffffffff80165e4f
15:03 <grondo> #30 [10208b39dc0] handle_mm_fault at ffffffff801663a2
15:03 <grondo> #31 [10208b39e00] follow_page at ffffffff80164a1f
15:03 <grondo> #32 [10208b39e40] get_user_pages at ffffffff80166aa7
15:03 <grondo> #33 [10208b39e90] make_pages_present at ffffffff80166d0e
15:03 <grondo> #34 [10208b39ec0] do_mmap_pgoff at ffffffff80169411
15:04 <grondo> #35 [10208b39f40] sys_mmap at ffffffff80115a75
15:04 <grondo> #36 [10208b39f80] system_call at ffffffff8010f262



This event sent from IssueTracker by woodard 
 issue 101232

Comment 32 Issue Tracker 2006-09-29 18:27:51 UTC
Larry, 

This works much much better but it doesn't work perfectly. We've done
about 20 tests so far. The first 10 tests were all 8 cpus. In every case so
far this has worked perfectly. The second 10 tests were 6 of the 8 CPUs. In
the case when we bound the processes to the first 6 cpus, then about 1/5
times we hit the NMI watchdog failure mode. We have yet to see a problem
when we don't bind the processes to the first 6 cpus. So something still
seems to be odd with the case where 3 of the 4 nodes run out of memory and
then all the CPUs seem to jump on the last node to get the memory that they
are asking for. We think that the unbound case works becuase all four nodes
are in use. We are going to continue testing it and try to get you
additional information but there still seems to be some bad corner cases.



This event sent from IssueTracker by woodard 
 issue 101232

Comment 33 Chris Lalancette 2006-11-13 20:08:03 UTC
Created attachment 141093 [details]
Larry Woodman's test patch from IT 101232

Comment 35 Larry Woodman 2006-11-14 19:22:45 UTC
JohnRay, what did Goldman start doing that suddenly caused this to start
happening?  This has been seen before at a few customer sites but I just want to
understand why they just started seeing it now.

And yes, the nmi_watchdog=0 workaround will prevent the system from panicing
when/if this happens.

Larry Woodman


Comment 36 Johnray Fuller 2006-11-20 22:40:37 UTC
Larry,

They did not roll out Dual Core DL585 systems until after the RHEL 3 DL585 crash
issue was resolved in late September. So, in effect, Four-way Dual Core RHEL 4
boxes are new to their environment. 

Most of the problems appear to be in the database group (mostly Sybase). I hope
that helps.

J

Comment 37 Issue Tracker 2006-11-28 00:46:29 UTC
16:27 <grondo> More info on node hangs during OOM kill
16:28 <grondo> looks like the same symptoms as before, all CPUs are stuck
in
               refill_inactive_zone
16:28 <grondo> However, it only seems to happen ~10% of the time
16:28 <grondo> however, for a big job running on 256 nodes, that makes
about 26
               nodes go down in SLURM after OOM kill event
16:31 <neb> so this is with lwoodman's patch or without.
16:32 <grondo> with the patch. Without, I think we hit this same problem
100%
               of the time
16:32 <grondo> IIRC



This event sent from IssueTracker by woodard 
 issue 101232

Comment 38 Issue Tracker 2006-12-04 16:59:49 UTC
Larry,

I know this is one of your top items right now, but, when you get the
chance, can you provide me with an update on where you are / what you're
looking at with this?  The natives are beginning to get a little restless
over at LLNL.

Thanks.


This event sent from IssueTracker by kbaxley 
 issue 101232

Comment 39 Larry Woodman 2006-12-11 19:57:16 UTC
*** Bug 180041 has been marked as a duplicate of this bug. ***

Comment 40 Johnray Fuller 2006-12-13 23:20:07 UTC
Is there a known reproducer for this? I have a system in the lab reasy to go,
but I wanted to make sure the usemem.c program in IT 101232 works as advertised
to reproduce the problem and verify that the instructions posted here:

https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=101232&gid=975&view_type=lifoall#eid_1059748

Thanks,
J

Comment 41 Ben Woodard 2006-12-14 00:45:21 UTC
So we have a reproducer that does trigger the problem with the nmi watchdog at
the longer timeout. However we do not have a reproducer which triggers the
problem even with Larry's patch.

Comment 42 Issue Tracker 2006-12-19 20:38:14 UTC
Hi Dave,

It seems like we've got two duplicate Issue Tracker tickets open.  I'm
going to close this one and work will continue on the other (98198).  I've
subscribed you to that ticket, so you'll receive e-mail update
notifications.

Please feel free to re-open this ticket if you don't think it's
appropriate to close it at this time.

Thanks.
-- Bryan

Internal Status set to 'Resolved'
Status set to: Closed by Tech
Resolution set to: 'NotABug'

This event sent from IssueTracker by bjmason 
 issue 92095

Comment 54 Larry Woodman 2007-01-18 18:11:31 UTC
I am currently testing some of the upstream changes that were made is area and
that I backported to RHEL4.  Basically the system is reactivating too many pages
and when it gete into a serious memory deficit all of the memory is active and
therefore all the cpus need to deactvate at the same time.  This causes all
cores to try to acquire the same lock and the system either takes an NMI
watchdog panic or it hangs.

Larry Woodman




Comment 55 Ben Woodard 2007-01-18 19:08:53 UTC
Larry,
Sounds like good progress. In addition to this problem. We are having two other
possibly related problems. Understanding how they are related and how the may be
related to this issue is not yet clear to me. They could very well be three
different manifestations of the same problem or they could be three distinct
problems. I really can't tell at the moment. In any case, I think that it is
worth mentioning to you in case they might give you insight into the problem.
1) We have that problem with Lustre where all the memory ends up being active.
The Lustre guys are working on that now.
2) We have reports that a similar problem may be happening on management nodes
which don't run Lustre and which do have swap. I don't have many details
regarding this problem at the moment. That is one of the things that I'm going
to be working on today.
3) There is another problem where lots of pages seem to be active and dirty but
pdflush doesn't seem to be running to clean these pages.

The common thread seems to be too many active pages.

-ben

Comment 56 Larry Woodman 2007-01-18 19:20:53 UTC
Thats right and the reason for that is 1.)anonymous pages start out active and
get re-activated when the system reclaims memory unless that activity is really
heavy or unless you increase /proc/sys/vm/swappiness.  2.)pagecache hits and all
filesystem writes activate the associated page.

BTW Ben, do you have any vmcores form the NMI watchdog panic ???

Larry


Comment 57 Ben Woodard 2007-01-19 00:06:08 UTC
sorry Larry, no I don't since this was happening on production machines we
quickly worked around the problem by increasing the watchdog timeout so we don't
hit the problem the same way. We also have your patch in our kernel. 

Comment 62 Larry Woodman 2007-03-02 19:56:03 UTC
I have a test kernel to try out if anyone want to run it.  Its in        
>>>http://people.redhat.com/~lwoodman/GS/

I added a new /proc/sys/vm/pagcache tunable parameter that needs to be
lowered to 10 so the system favors reclaiming unmapped pagecache memory
over mapped memory.  The default is 100 so if they dont set this new
parameter the kernel wont do anything new.

Larry

Comment 67 Jason Baron 2007-05-08 18:09:59 UTC
committed in stream U6 build 55.1. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 73 Don Domingo 2007-08-23 01:47:22 UTC
adding to RHEL4.6 release notes (under "Kernel-Related Updates"):

<quote>
/proc/sys/vm/drop_caches added to clear pagecache and slabcache on demand
</quote>

please advise if any revisions are in order. thanks!

Comment 74 John Poelstra 2007-08-28 23:52:51 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 75 John Poelstra 2007-08-29 04:03:52 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 76 John Poelstra 2007-09-05 22:19:58 UTC
A fix for this issue should have been included in the packages contained in 
the RHEL4.6-Snapshot1 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed to ensure that it is 
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed, 
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent 
symptoms of the problem you are having and change the status of the bug to 
FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test 
results to Issue Tracker.  If you need assistance accessing 
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 77 John Poelstra 2007-09-12 00:39:01 UTC
A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on
partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 78 John Poelstra 2007-09-20 04:28:31 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6-Snapshot3 on partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 79 John Poelstra 2007-09-26 23:34:15 UTC
A fix for this issue should be included in the packages contained in
RHEL4.6-Snapshot4--available now on partners.redhat.com.  

Please verify that your issue is fixed ASAP to ensure that it is included in
this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 80 John Poelstra 2007-10-05 02:56:23 UTC
A fix for this issue should be included in the packages contained in
RHEL4.6-Snapshot5--available now on partners.redhat.com.  

Please verify that your issue is fixed ASAP to ensure that it is included in
this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 83 errata-xmlrpc 2007-11-15 16:15:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html