Description of problem: NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80123682, registers: Version-Release number of selected component (if applicable): Kernel 2.4.21-37.6.EL How reproducible: Often Steps to Reproduce: 1. Run the stress kernel RPM ona RHEL3 U7 kernel Actual results: [...network console startup...] mtrr: type mismatch for c8000000,1000000 old: write-back new: write-combining rw_swap_page: bad swap file Unable to find swap-space signature NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80123682, registers: CPU 2 Pid: 18547, comm: dt Not tainted RIP: 0010:[<ffffffff80123682>]{.text.lock.sched+26} RSP: 0018:00000100bd96fe78 EFLAGS: 00000082 RAX: 0000000000000002 RBX: 00000100a4490000 RCX: 0000000000000000 RDX: ffffffff805ef800 RSI: 0000000000000001 RDI: 00000100a4490000 RBP: 00000100bd96fea8 R08: 0000007fbfffc364 R09: 0000000000000000 R10: 0000000000517040 R11: 0000000000000000 R12: ffffffff805efa00 R13: 0000000000000000 R14: 00000100bd96fe78 R15: 0000000000000000 FS: 0000002a958d34c0(0000) GS:ffffffff805e8e80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000d4a8c18 CR3: 0000000017e89000 CR4: 00000000000006e0 Call Trace: <EOE> [<ffffffff801218bf>]{__wake_up+111} [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178} [<ffffffff801102a7>]{system_call+119} Process dt (pid: 18547, stackpage=100bd96f000) Stack: 00000100bd96fe78 0000000000000018 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: <EOE> [<ffffffff801218bf>]{__wake_up+111} [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178} [<ffffffff801102a7>]{system_call+119} Code: 7e f5 e9 6e c4 ff ff 80 be 00 fa 5e 80 00 f3 90 7e f5 e9 9b CPU#2 is executing netdump. CPU#3 is frozen. < netdump activated - performing handshake with the server. > Pid: 18547, comm: dt Not tainted RIP: 0010:[<ffffffff80123682>]{.text.lock.sched+26} RSP: 0018:00000100bd96fe78 EFLAGS: 00000082 RAX: 0000000000000002 RBX: 00000100a4490000 RCX: 0000000000000000 RDX: ffffffff805ef800 RSI: 0000000000000001 RDI: 00000100a4490000 RBP: 00000100bd96fea8 R08: 0000007fbfffc364 R09: 0000000000000000 R10: 0000000000517040 R11: 0000000000000000 R12: ffffffff805efa00 R13: 0000000000000000 R14: 00000100bd96fe78 R15: 0000000000000000 FS: 0000002a958d34c0(0000) GS:ffffffff805e8e80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000d4a8c18 CR3: 0000000017e89000 CR4: 00000000000006e0 Call Trace: <EOE> [<ffffffff801218bf>]{__wake_up+111} [<ffffffff8016cdae>]{pipe_read+526} [<ffffffff80160e12>]{sys_read+178} [<ffffffff801102a7>]{system_call+119} free sibling task PC stack pid father child younger older init S ffffffff801577c9 1568 1 0 6 2 (NOTLB) Call Trace: [<ffffffff80131d7f>]{schedule_timeout+175} [<ffffffff80131cc0>]{process_timeout+0} [<ffffffff80175576>]{do_select+550} [<ffffffff801759a3>]{sys_select+947} [<ffffffff801102a7>]{system_call+119} migration/0 S ffffffff801216cc 4600 2 0 3 1 (L-TLB) Call Trace: [<ffffffff801233a5>]{migration_task+421} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0} [<ffffffff80110cb9>]{child_rip+0} migration/1 S ffffffff8011f91c 4600 3 0 4 2 (L-TLB) Call Trace: [<ffffffff801233a5>]{migration_task+421} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0} [<ffffffff80110cb9>]{child_rip+0} migration/2 S ffffffff801216cc 4600 4 0 5 3 (L-TLB) Call Trace: [<ffffffff801233a5>]{migration_task+421} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0} [<ffffffff80110cb9>]{child_rip+0} migration/3 S ffffffff801216cc 4600 5 0 4 (L-TLB) Call Trace: [<ffffffff801233a5>]{migration_task+421} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff80123200>]{migration_task+0} [<ffffffff80110cb9>]{child_rip+0} keventd S ffffffff801216cc 4320 6 1 7 (L-TLB) Call Trace: [<ffffffff801394e0>]{context_thread+240} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff801393f0>]{context_thread+0} [<ffffffff80110cb9>]{child_rip+0} ksoftirqd/0 S 0000000000000212 4648 7 1 8 6 (L-TLB) Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0} ksoftirqd/1 S ffffffff8012063c 4648 8 1 9 7 (L-TLB) Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0} ksoftirqd/2 S ffffffff801216cc 4648 9 1 10 8 (L-TLB) Call Trace: [<ffffffff8012d05f>]{ksoftirqd+207} [<ffffffff80110cc1>]{child_rip+8} [<ffffffff8012cf90>]{ksoftirqd+0} [<ffffffff80110cb9>]{child_rip+0} Expected results: Additional info: I do not believe this is a regression. It is only recently that I modified the kernel to trigger the NMI watchdog by adding argumets to the kernel boot line. Before adding the boot option the system would just hang (deadlock). I have seen this behavior for several releases now. But this is the first time I have been able to collect data. Here is a pointer to the vmcore file ndnc-1.lab.boston.redhat.com:/var/crash/192.168.79.125-2005-10-20-09:44 In that directory is also a console log file. I can assist is setting this test up to reproduce.
This bug is having a serious negative impact on stress testing EM64T systems. Adding to U7 Proposed list and raising severity to "high".
A patch to fix the default setting for nmi_watchdog on EM64T systems was just posted for internal review. It will serve as an aid to debug this problem further (obviating the need for special kernel boot options).
A fix for the default nmi_watchdog setting on EM64T systems has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.7.EL). Note that this bug should remain in ASSIGNED state until the EM64T deadlock problem (formerly hangs, now NMI watchdog triggers) is resolved.
Jeff, we tracked the hang Shak was having on EM64T down to the mptfusion driver update that was included in RHEL3-U6. Specifically the upgrade from 2.05.16.01 to 2.06.16.01 causes his system to hang: ---system hangs with this driver------------- Fusion MPT base driver 2.06.16.01 Fusion MPT SCSI Host driver 2.06.16.01 ---system doesnt hang with this driver---- Fusion MPT base driver 2.05.16.02 Fusion MPT SCSI Host driver 2.05.16.02 Can you try backing up the mptfusuion drivers to what was used in RHEL3-U5 and see if this problem goes away as well? Thanks, Larry
Larry, The systems I can duplicate the issue on do not load the mptfusion driver. Also I don not believe this is a regression I have been seeing the issue since U5 But, have only started seeing the NMI triggers because they were broken and are now fixed.
Here is the latest nmi from pe2850he: Unable to handle kernel paging request at virtual address 000001060014bb88 printing rip: ffffffff8014dcf4 PML4 8063 PGD 0 Oops: 0002 CPU 2 Pid: 27720, comm: rawio Not tainted RIP: 0010:[<ffffffff8014dcf4>]{kmem_cache_free+116} RSP: 0000:00000100a0bd5d28 EFLAGS: 00010446 RAX: 0000010113ad6850 RBX: 0000010017e6e008 RCX: 0000000000000000 RDX: 00000000bd05bb70 RSI: 0000010017e6dc18 RDI: 0000010017e6da20 RBP: 0000010017e6e000 R08: 0000010017e6e008 R09: 0000000000000af0 R10: 0000000000000000 R11: ffffff0000000000 R12: 000000000000003f R13: 0000010003a11730 R14: 0000000000000002 R15: 000001013f9d3800 FS: 0000002a958d34c0(0000) GS:ffffffff805e9040(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000001060014bb88 CR3: 0000000017e89000 CR4: 00000000000006e0 Call Trace: [<ffffffff8014dced>]{kmem_cache_free+109} [<ffffffff80166132>]{try_to_free_buffers+178} [<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269} [<ffffffff801432e5>]{do_flushpage+37} [<ffffffff80143302>]{truncate_complete_page+18} [<ffffffff801434e0>]{truncate_list_pages+416} [<ffffffff801435ba>]{truncate_inode_pages+90} [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198} [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231} [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92} [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233} [<ffffffff801102a7>]{system_call+119} Process rawio (pid: 27720, stackpage=100a0bd5000) Stack: 00000100a0bd5d28 0000000000000000 ffffffff8014dced 0000010113ad6850 0000000000000616 0000010113ad6850 0000010113ad6850 0000010113ad6850 ffffffff80166132 000001000400ae78 0000010003a11730 0000010113ad6850 0000010113ad6850 0000000000001000 ffffffffa005847d 000001013f9d3910 0000000131989ad8 0000010113ad6850 0000000000000000 0000010003a11730 0000010003a11730 0000000000001492 00000101381e76a0 0000000000000000 00000100a0bd5e54 0000000000000000 ffffffff801432e5 0000000000000000 ffffffff80143302 0000010003a11730 ffffffff801434e0 00000001032291b0 0000000000000000 0000000000000000 00000101381e7690 00000101381e76b0 00000101381e76a0 0000000000000000 ffffffff801435ba 00000100ae532c00 Call Trace: [<ffffffff8014dced>]{kmem_cache_free+109} [<ffffffff80166132>]{try_to_free_buffers+178} [<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269} [<ffffffff801432e5>]{do_flushpage+37} [<ffffffff80143302>]{truncate_complete_page+18} [<ffffffff801434e0>]{truncate_list_pages+416} [<ffffffff801435ba>]{truncate_inode_pages+90} [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198} [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231} [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92} [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233} [<ffffffff801102a7>]{system_call+119} Code: 48 89 44 d5 08 ff 45 00 eb 12 ba 01 00 00 00 48 89 e6 e8 b5 Kernel panic: Fatal exception NMI Watchdog detected LOCKUP on CPU2, rip ffffffff80166cf1, registers: CPU 2 Pid: 27720, comm: rawio Not tainted RIP: 0010:[<ffffffff80166cf1>]{.text.lock.buffer+7} RSP: 0000:00000100a0bd5a18 EFLAGS: 00000082 RAX: 00000100a0bd4000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000006c48 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000ffffffff R11: 0000000000000010 R12: 0000000000000000 R13: 000001060014bb88 R14: 0000000000000002 R15: 00000100a0bd5c78 FS: 0000002a958d34c0(0000) GS:ffffffff805e9040(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000001060014bb88 CR3: 0000000017e89000 CR4: 00000000000006e0 Call Trace: <EOE> [<ffffffff801626ee>]{sync_buffers+30} [<ffffffff8016285a>]{fsync_dev+10} [<ffffffff8016299b>]{sys_sync+11} [<ffffffff80125f48>]{panic+296} [<ffffffff80111510>]{show_trace+656} [<ffffffff80111668>]{show_stack+216} [<ffffffff801117c9>]{show_registers+329} [<ffffffff80111965>]{die+245} [<ffffffff801a9b4d>]{do_page_fault+1021} [<ffffffff8012170c>]{thread_return+0} [<ffffffff80110b06>]{error_exit+0} [<ffffffff8014dcf4>]{kmem_cache_free+116} [<ffffffff8014dced>]{kmem_cache_free+109} [<ffffffff80166132>]{try_to_free_buffers+178} [<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269} [<ffffffff801432e5>]{do_flushpage+37} [<ffffffff80143302>]{truncate_complete_page+18} [<ffffffff801434e0>]{truncate_list_pages+416} [<ffffffff801435ba>]{truncate_inode_pages+90} [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198} [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231} [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92} [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233} [<ffffffff801102a7>]{system_call+119} Process rawio (pid: 27720, stackpage=100a0bd5000) Stack: 00000100a0bd5a18 0000000000000000 Call Trace: <EOE> [<ffffffff801626ee>]{sync_buffers+30} [<ffffffff8016285a>]{fsync_dev+10} [<ffffffff8016299b>]{sys_sync+11} [<ffffffff80125f48>]{panic+296} [<ffffffff80111510>]{show_trace+656} [<ffffffff80111668>]{show_stack+216} [<ffffffff801117c9>]{show_registers+329} [<ffffffff80111965>]{die+245} [<ffffffff801a9b4d>]{do_page_fault+1021} [<ffffffff8012170c>]{thread_return+0} [<ffffffff80110b06>]{error_exit+0} [<ffffffff8014dcf4>]{kmem_cache_free+116} [<ffffffff8014dced>]{kmem_cache_free+109} [<ffffffff80166132>]{try_to_free_buffers+178} [<ffffffffa005847d>]{:jbd:journal_flushpage_Rsmp_b97420e3+269} [<ffffffff801432e5>]{do_flushpage+37} [<ffffffff80143302>]{truncate_complete_page+18} [<ffffffff801434e0>]{truncate_list_pages+416} [<ffffffff801435ba>]{truncate_inode_pages+90} [<ffffffffa0078fe0>]{:ext3:ext3_sops+0} [<ffffffff8017c516>]{iput+198} [<ffffffff801790ef>]{dput+303} [<ffffffff801620b7>]{__fput+231} [<ffffffff80142e15>]{exit_mmap+261} [<ffffffff80123ebc>]{mmput+92} [<ffffffff8012ae88>]{do_exit+456} [<ffffffff8012b259>]{do_group_exit+233} [<ffffffff801102a7>]{system_call+119} Code: f3 90 7e f5 e9 c9 b8 ff ff 80 3d 3f 61 2d 00 00 f3 90 7e f5 console shuts up ...
Larry,
Please try test kernel in kernel-2.4.21-40.3.EL.badelf.2.ia32e.rpm from redline:~petrides/tmp. It has debugging printk()'s that will identify if a potentially troublesome code path was taken.
Created attachment 125754 [details] Netdump console output, While running 2.4.21-40.3.badelf2.ia32e kernel Ernie, Here is the output of the netdump. I do not see the debug signature in the out put. That leads me to believe that this may be a different issue. Jeff
Dave, I don't recall seeing this else where but perhaps you are correct. As of right now the kernel stress test rpm is the only consistent way of reproducing this that I know of.
Can reproduce the hang here almost every time with a mysqldump (which add high load and disk access to the server). machine is an IBM x346 with a fusion mpt scsi controller. If this can help, I have a netdump of the crash. a support ticket has been opened under number 852722 (which contains sysreport and log of crashdump)
Merci mille fois, Sebastien. I will try to reproduce the problem here by using mysqldump.
Jeff, in the log of comment #12, I see the following error message: rw_swap_page: bad swap file Unable to find swap-space signature Any idea what brings that about? And do you know if it immediately precedes the NMI watchdog lock-up detection? Coincidentally, this error is also generated by another problem I'm looking into (IBM's pounder test suite in bug 185842).
I've discovered that in the "pounder" case, it's the LTP swapoff/swapon tests that lead to these error messages. Disabling those particular tests allows LTP to complete without those errors. I doubt they immediately lead to the NMI lock-up detection.
Ernie, I may have verbally mislead you. But, this test case is not the pounder test suite. This is the internal stress kernel RPM. This test suite also includes LTP suite.
Sorry, I never had to time to investigate this before U8 code freeze.