171321 – NMI Watchdog triggers on EM64T-based systems under heavy load

Bug 171321 - NMI Watchdog triggers on EM64T-based systems under heavy load

Summary: NMI Watchdog triggers on EM64T-based systems under heavy load

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	ia32e
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Ernie Petrides
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	170417
TreeView+	depends on / blocked

Reported:	2005-10-20 19:17 UTC by Jeff Burke
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-04-19 21:56:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Netdump console output, While running 2.4.21-40.3.badelf2.ia32e kernel (29.72 KB, text/plain) 2006-03-07 14:41 UTC, Jeff Burke	no flags	Details
View All

Description Jeff Burke 2005-10-20 19:17:08 UTC

Description of problem:

NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80123682, registers:

Version-Release number of selected component (if applicable):
Kernel 2.4.21-37.6.EL

How reproducible:
Often

Steps to Reproduce:
1. Run the stress kernel RPM ona RHEL3 U7 kernel
  
Actual results:
[...network console startup...]
mtrr: type mismatch for c8000000,1000000 old: write-back new: write-combining
rw_swap_page: bad swap file
Unable to find swap-space signature
NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80123682, registers:
CPU 2
Pid: 18547, comm: dt Not tainted
RIP: 0010:[<ffffffff80123682>]{.text.lock.sched+26}
RSP: 0018:00000100bd96fe78  EFLAGS: 00000082
RAX: 0000000000000002 RBX: 00000100a4490000 RCX: 0000000000000000
RDX: ffffffff805ef800 RSI: 0000000000000001 RDI: 00000100a4490000
RBP: 00000100bd96fea8 R08: 0000007fbfffc364 R09: 0000000000000000
R10: 0000000000517040 R11: 0000000000000000 R12: ffffffff805efa00
R13: 0000000000000000 R14: 00000100bd96fe78 R15: 0000000000000000
FS:  0000002a958d34c0(0000) GS:ffffffff805e8e80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000d4a8c18 CR3: 0000000017e89000 CR4: 00000000000006e0

Call Trace:  <EOE> [<ffffffff801218bf>]{__wake_up+111}
       [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178}
       [<ffffffff801102a7>]{system_call+119}
Process dt (pid: 18547, stackpage=100bd96f000)
Stack: 00000100bd96fe78 0000000000000018 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:  <EOE> [<ffffffff801218bf>]{__wake_up+111}
       [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178}
       [<ffffffff801102a7>]{system_call+119}

Code: 7e f5 e9 6e c4 ff ff 80 be 00 fa 5e 80 00 f3 90 7e f5 e9 9b

CPU#2 is executing netdump.
CPU#3 is frozen.
< netdump activated - performing handshake with the server. >

Pid: 18547, comm: dt Not tainted
RIP: 0010:[<ffffffff80123682>]{.text.lock.sched+26}
RSP: 0018:00000100bd96fe78  EFLAGS: 00000082
RAX: 0000000000000002 RBX: 00000100a4490000 RCX: 0000000000000000
RDX: ffffffff805ef800 RSI: 0000000000000001 RDI: 00000100a4490000
RBP: 00000100bd96fea8 R08: 0000007fbfffc364 R09: 0000000000000000
R10: 0000000000517040 R11: 0000000000000000 R12: ffffffff805efa00
R13: 0000000000000000 R14: 00000100bd96fe78 R15: 0000000000000000
FS:  0000002a958d34c0(0000) GS:ffffffff805e8e80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000d4a8c18 CR3: 0000000017e89000 CR4: 00000000000006e0

Call Trace:  <EOE> [<ffffffff801218bf>]{__wake_up+111}
       [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178}
       [<ffffffff801102a7>]{system_call+119}

                                 free                        sibling
  task                 PC        stack   pid father child younger older
init          S ffffffff801577c9  1568     1      0     6       2       (NOTLB)

Call Trace: [<ffffffff80131d7f>]{schedule_timeout+175}
       [<ffffffff80131cc0>]{process_timeout+0} [<ffffffff80175576>]{do_select+550}
       [<ffffffff801759a3>]{sys_select+947} [<ffffffff801102a7>]{system_call+119}

migration/0   S ffffffff801216cc  4600     2      0             3     1 (L-TLB)

Call Trace: [<ffffffff801233a5>]{migration_task+421}
       [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0}
       [<ffffffff80110cb9>]{child_rip+0}
migration/1   S ffffffff8011f91c  4600     3      0             4     2 (L-TLB)

Call Trace: [<ffffffff801233a5>]{migration_task+421}
       [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0}
       [<ffffffff80110cb9>]{child_rip+0}
migration/2   S ffffffff801216cc  4600     4      0             5     3 (L-TLB)

Call Trace: [<ffffffff801233a5>]{migration_task+421}
       [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0}
       [<ffffffff80110cb9>]{child_rip+0}
migration/3   S ffffffff801216cc  4600     5      0                   4 (L-TLB)

Call Trace: [<ffffffff801233a5>]{migration_task+421}
       [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0}
       [<ffffffff80110cb9>]{child_rip+0}
keventd       S ffffffff801216cc  4320     6      1             7       (L-TLB)

Call Trace: [<ffffffff801394e0>]{context_thread+240}
       [<ffffffff80110cc1>]{child_rip+8} [<ffffffff801393f0>]{context_thread+0}
       [<ffffffff80110cb9>]{child_rip+0}
ksoftirqd/0   S 0000000000000212  4648     7      1             8     6 (L-TLB)

Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8}
       [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0}

ksoftirqd/1   S ffffffff8012063c  4648     8      1             9     7 (L-TLB)

Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8}
       [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0}

ksoftirqd/2   S ffffffff801216cc  4648     9      1            10     8 (L-TLB)

Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8}
       [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0}


Expected results:


Additional info:
   I do not believe this is a regression. It is only recently that I modified
the kernel to trigger the NMI watchdog by adding argumets to the kernel boot
line. Before adding the boot option the system would just hang (deadlock).

   I have seen this behavior for several releases now. But this is the first
time I have been able to collect data. 

   Here is a pointer to the vmcore file
ndnc-1.lab.boston.redhat.com:/var/crash/192.168.79.125-2005-10-20-09:44
In that directory is also a console log file.

   I can assist is setting this test up to reproduce.

Comment 1 Ernie Petrides 2005-10-20 21:51:29 UTC

This bug is having a serious negative impact on stress testing EM64T
systems.  Adding to U7 Proposed list and raising severity to "high".

Comment 2 Ernie Petrides 2005-10-25 02:44:23 UTC

A patch to fix the default setting for nmi_watchdog on EM64T systems
was just posted for internal review.  It will serve as an aid to debug
this problem further (obviating the need for special kernel boot options).

Comment 3 Ernie Petrides 2005-10-27 02:32:03 UTC

A fix for the default nmi_watchdog setting on EM64T systems has just
been  committed to the RHEL3 U7 patch pool this evening (in kernel
version 2.4.21-37.7.EL).

Note that this bug should remain in ASSIGNED state until the EM64T
deadlock problem (formerly hangs, now NMI watchdog triggers) is resolved.

Comment 4 Larry Woodman 2005-11-03 21:03:59 UTC

Jeff, we tracked the hang Shak was having on EM64T down to the mptfusion driver
update that was included in RHEL3-U6.  Specifically the upgrade from 2.05.16.01
to 2.06.16.01 causes his system to hang:

---system hangs with this driver-------------
Fusion MPT base driver 2.06.16.01
Fusion MPT SCSI Host driver 2.06.16.01

---system doesnt hang with this driver----
Fusion MPT base driver 2.05.16.02
Fusion MPT SCSI Host driver 2.05.16.02

Can you try backing up the mptfusuion drivers to what was used in RHEL3-U5 and
see if this problem goes away as well?

Thanks, Larry

Comment 5 Jeff Burke 2005-11-03 21:25:33 UTC

Larry,
  The systems I can duplicate the issue on do not load the mptfusion driver.
Also I don not believe this is a regression I have been seeing the issue since U5
But, have only started seeing the NMI triggers because they were broken and are
now fixed.

Comment 8 Jeff Burke 2005-11-04 17:27:51 UTC

Here is the latest nmi from pe2850he: 

Unable to handle kernel paging request at virtual address 000001060014bb88
 printing rip:
ffffffff8014dcf4
PML4 8063 PGD 0
Oops: 0002
CPU 2
Pid: 27720, comm: rawio Not tainted
RIP: 0010:[<ffffffff8014dcf4>]{kmem_cache_free+116}
RSP: 0000:00000100a0bd5d28  EFLAGS: 00010446
RAX: 0000010113ad6850 RBX: 0000010017e6e008 RCX: 0000000000000000
RDX: 00000000bd05bb70 RSI: 0000010017e6dc18 RDI: 0000010017e6da20
RBP: 0000010017e6e000 R08: 0000010017e6e008 R09: 0000000000000af0
R10: 0000000000000000 R11: ffffff0000000000 R12: 000000000000003f
R13: 0000010003a11730 R14: 0000000000000002 R15: 000001013f9d3800
FS:  0000002a958d34c0(0000) GS:ffffffff805e9040(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000001060014bb88 CR3: 0000000017e89000 CR4: 00000000000006e0

Call Trace: [<ffffffff8014dced>]{kmem_cache_free+109}
       [<ffffffff80166132>]{try_to_free_buffers+178}
[<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269}
       [<ffffffff801432e5>]{do_flushpage+37}
[<ffffffff80143302>]{truncate_complete_page+18}
       [<ffffffff801434e0>]{truncate_list_pages+416}
[<ffffffff801435ba>]{truncate_inode_pages+90}
       [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198}
       [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231}
       [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92}
       [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233}
       [<ffffffff801102a7>]{system_call+119}
Process rawio (pid: 27720, stackpage=100a0bd5000)
Stack: 00000100a0bd5d28 0000000000000000 ffffffff8014dced 0000010113ad6850
       0000000000000616 0000010113ad6850 0000010113ad6850 0000010113ad6850
       ffffffff80166132 000001000400ae78 0000010003a11730 0000010113ad6850
       0000010113ad6850 0000000000001000 ffffffffa005847d 000001013f9d3910
       0000000131989ad8 0000010113ad6850 0000000000000000 0000010003a11730
       0000010003a11730 0000000000001492 00000101381e76a0 0000000000000000
       00000100a0bd5e54 0000000000000000 ffffffff801432e5 0000000000000000
       ffffffff80143302 0000010003a11730 ffffffff801434e0 00000001032291b0
       0000000000000000 0000000000000000 00000101381e7690 00000101381e76b0
       00000101381e76a0 0000000000000000 ffffffff801435ba 00000100ae532c00
Call Trace: [<ffffffff8014dced>]{kmem_cache_free+109}
       [<ffffffff80166132>]{try_to_free_buffers+178}
[<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269}
       [<ffffffff801432e5>]{do_flushpage+37}
[<ffffffff80143302>]{truncate_complete_page+18}
       [<ffffffff801434e0>]{truncate_list_pages+416}
[<ffffffff801435ba>]{truncate_inode_pages+90}
       [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198}
       [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231}
       [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92}
       [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233}
       [<ffffffff801102a7>]{system_call+119}

Code: 48 89 44 d5 08 ff 45 00 eb 12 ba 01 00 00 00 48 89 e6 e8 b5

Kernel panic: Fatal exception
NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80166cf1, registers:
CPU 2
Pid: 27720, comm: rawio Not tainted
RIP: 0010:[<ffffffff80166cf1>]{.text.lock.buffer+7}
RSP: 0000:00000100a0bd5a18  EFLAGS: 00000082
RAX: 00000100a0bd4000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000006c48 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000ffffffff R11: 0000000000000010 R12: 0000000000000000
R13: 000001060014bb88 R14: 0000000000000002 R15: 00000100a0bd5c78
FS:  0000002a958d34c0(0000) GS:ffffffff805e9040(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000001060014bb88 CR3: 0000000017e89000 CR4: 00000000000006e0

Call Trace:  <EOE> [<ffffffff801626ee>]{sync_buffers+30}
       [<ffffffff8016285a>]{fsync_dev+10} [<ffffffff8016299b>]{sys_sync+11}
       [<ffffffff80125f48>]{panic+296} [<ffffffff80111510>]{show_trace+656}
       [<ffffffff80111668>]{show_stack+216} [<ffffffff801117c9>]{show_registers+329}
       [<ffffffff80111965>]{die+245} [<ffffffff801a9b4d>]{do_page_fault+1021}
       [<ffffffff8012170c>]{thread_return+0} [<ffffffff80110b06>]{error_exit+0}
       [<ffffffff8014dcf4>]{kmem_cache_free+116}
[<ffffffff8014dced>]{kmem_cache_free+109}
       [<ffffffff80166132>]{try_to_free_buffers+178}
[<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269}
       [<ffffffff801432e5>]{do_flushpage+37}
[<ffffffff80143302>]{truncate_complete_page+18}
       [<ffffffff801434e0>]{truncate_list_pages+416}
[<ffffffff801435ba>]{truncate_inode_pages+90}
       [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198}
       [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231}
       [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92}
       [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233}
       [<ffffffff801102a7>]{system_call+119}
Process rawio (pid: 27720, stackpage=100a0bd5000)
Stack: 00000100a0bd5a18 0000000000000000
Call Trace:  <EOE> [<ffffffff801626ee>]{sync_buffers+30}
       [<ffffffff8016285a>]{fsync_dev+10} [<ffffffff8016299b>]{sys_sync+11}
       [<ffffffff80125f48>]{panic+296} [<ffffffff80111510>]{show_trace+656}
       [<ffffffff80111668>]{show_stack+216} [<ffffffff801117c9>]{show_registers+329}
       [<ffffffff80111965>]{die+245} [<ffffffff801a9b4d>]{do_page_fault+1021}
       [<ffffffff8012170c>]{thread_return+0} [<ffffffff80110b06>]{error_exit+0}
       [<ffffffff8014dcf4>]{kmem_cache_free+116}
[<ffffffff8014dced>]{kmem_cache_free+109}
       [<ffffffff80166132>]{try_to_free_buffers+178}
[<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269}
       [<ffffffff801432e5>]{do_flushpage+37}
[<ffffffff80143302>]{truncate_complete_page+18}
       [<ffffffff801434e0>]{truncate_list_pages+416}
[<ffffffff801435ba>]{truncate_inode_pages+90}
       [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198}
       [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231}
       [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92}
       [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233}
       [<ffffffff801102a7>]{system_call+119}

Code: f3 90 7e f5 e9 c9 b8 ff ff 80 3d 3f 61 2d 00 00 f3 90 7e f5

console shuts up ...

Comment 9 Jeff Burke 2005-11-04 17:28:37 UTC

Larry,

Comment 11 Ernie Petrides 2006-03-03 03:52:13 UTC

Please try test kernel in kernel-2.4.21-40.3.EL.badelf.2.ia32e.rpm from
redline:~petrides/tmp.  It has debugging printk()'s that will identify
if a potentially troublesome code path was taken.

Comment 12 Jeff Burke 2006-03-07 14:41:01 UTC

Created attachment 125754 [details]
Netdump console output, While running 2.4.21-40.3.badelf2.ia32e kernel

Ernie,
  Here is the output of the netdump. I do not see the debug signature in the
out put. That leads me to believe that this may be a different issue.

Jeff

Comment 14 Jeff Burke 2006-03-14 01:32:20 UTC

Dave,
  I don't recall seeing this else where but perhaps you are correct. As of right
now the kernel stress test rpm is the only consistent way of reproducing this
that I know of.

Comment 15 Sebastien BLAISOT 2006-03-31 11:59:59 UTC

Can reproduce the hang here almost every time with a mysqldump (which add high
load and disk access to the server).

machine is an IBM x346 with a fusion mpt scsi controller.

If this can help, I have a netdump of the crash.

a support ticket has been opened under number 852722 (which contains sysreport
and log of crashdump)

Comment 16 Ernie Petrides 2006-03-31 22:47:24 UTC

Merci mille fois, Sebastien.  I will try to reproduce the problem here
by using mysqldump.

Comment 17 Ernie Petrides 2006-04-05 23:51:04 UTC

Jeff, in the log of comment #12, I see the following error message:

  rw_swap_page: bad swap file
  Unable to find swap-space signature

Any idea what brings that about?  And do you know if it immediately
precedes the NMI watchdog lock-up detection?

Coincidentally, this error is also generated by another problem I'm
looking into (IBM's pounder test suite in bug 185842).

Comment 18 Ernie Petrides 2006-04-06 01:50:10 UTC

I've discovered that in the "pounder" case, it's the LTP swapoff/swapon
tests that lead to these error messages.  Disabling those particular
tests allows LTP to complete without those errors.  I doubt they
immediately lead to the NMI lock-up detection.

Comment 19 Jeff Burke 2006-04-10 13:39:40 UTC

Ernie,
   I may have verbally mislead you. But, this test case is not the pounder test
suite.  This is the internal stress kernel RPM. This test suite also includes
LTP suite.

Comment 20 Ernie Petrides 2006-04-19 21:56:24 UTC

Sorry, I never had to time to investigate this before U8 code freeze.

Note You need to log in before you can comment on or make changes to this bug.