Description of problem: Machine is 2x2 core 2.8Ghz AMD with 8G of RAM. Problem shows up on Java testsuite run. With i686 version of 2.6.24.3-29.el5rt, we have the following message on the console (keeps repeating): NMI show regs on CPU#0: apic_timer_irqs: 58692017 Pid: 9744, comm: java Tainted: G D (2.6.24.3-29.el5rt #1) EIP: 0060:[<c0448422>] EFLAGS: 00000286 CPU: 0 EIP is at __spin_lock+0x18/0x1e EAX: c0590b00 EBX: cc423da0 ECX: f3a70434 EDX: cc423000 ESI: 00000000 EDI: c021731b EBP: cc423cc0 ESP: cc423cc0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010003 CR0: 8005003b CR2: aa915000 CR3: 3344c000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c020507d>] show_trace_log_lvl+0x1a/0x2f [<c02058c9>] show_trace+0x12/0x14 [<c020247a>] show_regs+0x1c/0x1f [<c0219a04>] irq_show_regs_callback+0x62/0x72 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d [<c0448e06>] do_nmi+0x9d/0x259 NMI watchdog running again ... [<c04489a3>] nmi_stack_correct+0x26/0x2b [<c021708a>] native_smp_call_function_mask+0x46/0x1a2 [<c02187d7>] smp_call_function+0x44/0x4c [<c02301d4>] on_each_cpu+0x24/0x4a [<c0216cfa>] flush_tlb_all+0x1e/0x20 [<c0271e08>] kmap_high+0x29f/0x407 [<c021fffd>] kmap+0x41/0x4c [<c0275016>] handle_mm_fault+0x12b/0x8fc [<c044a0db>] do_page_fault+0x336/0x7d4 [<c04488fa>] error_code+0x72/0x78 ======================= NMI show regs on CPU#1: apic_timer_irqs: 58113825 Pid: 3773, comm: sge_execd Tainted: G D (2.6.24.3-29.el5rt #1) EIP: 0060:[<c02171bb>] EFLAGS: 00000293 CPU: 1 EIP is at native_smp_call_function_mask+0x177/0x1a2 EAX: 000008fb EBX: f2c63ce4 ECX: c05a77c0 EDX: ffffb300 ESI: f2c63cc4 EDI: 00000003 EBP: f2c63d44 ESP: f2c63ca0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010003 CR0: 8005003b CR2: ae4f4784 CR3: 341ed000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c020507d>] show_trace_log_lvl+0x1a/0x2f [<c02058c9>] show_trace+0x12/0x14 [<c020247a>] show_regs+0x1c/0x1f [<c0219a04>] irq_show_regs_callback+0x62/0x72 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d [<c0448e06>] do_nmi+0x9d/0x259 [<c04489a3>] nmi_stack_correct+0x26/0x2b [<c02187d7>] smp_call_function+0x44/0x4c [<c02301d4>] on_each_cpu+0x24/0x4a [<c0216cfa>] flush_tlb_all+0x1e/0x20 [<c0271e08>] kmap_high+0x29f/0x407 [<c021fffd>] kmap+0x41/0x4c [<c028cf3b>] pipe_write+0x234/0x3c4 [<c0286e16>] do_sync_write+0xc5/0x102 [<c02875b8>] vfs_write+0xa8/0x131 [<c0287b99>] sys_write+0x3d/0x61 [<c02040fa>] syscall_call+0x7/0xb ======================= NMI show regs on CPU#3: apic_timer_irqs: 57510689 Pid: 9793, comm: java Tainted: G D (2.6.24.3-29.el5rt #1) EIP: 0060:[<c044841d>] EFLAGS: 00000082 CPU: 3 EIP is at __spin_lock+0x13/0x1e EAX: c8e56300 EBX: 00000040 ECX: 38638424 EDX: f2453000 ESI: f347b2f0 EDI: f39de1b0 EBP: f2453b8c ESP: f2453b8c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010005 CR0: 8005003b CR2: ffffffd0 CR3: 3286e000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c020507d>] show_trace_log_lvl+0x1a/0x2f [<c02058c9>] show_trace+0x12/0x14 [<c020247a>] show_regs+0x1c/0x1f [<c0219a04>] irq_show_regs_callback+0x62/0x72 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d [<c0448e06>] do_nmi+0x9d/0x259 [<c04489a3>] nmi_stack_correct+0x26/0x2b [<c0446b31>] __schedule+0x100/0x7a7 [<c022e7f1>] do_exit+0x6ad/0x706 [<c02054e5>] die+0x1f2/0x1fa [<c044a493>] do_page_fault+0x6ee/0x7d4 [<c04488fa>] error_code+0x72/0x78 [<c0447354>] schedule+0xe0/0xfa [<c0245e01>] futex_wait+0x21c/0x2f4 [<c0246db0>] do_futex+0x59/0x923 [<c0247761>] sys_futex+0xe7/0xfa [<c02040fa>] syscall_call+0x7/0xb ======================= Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Good morning! Can I please get some more information about this crash: *). How are you booting? Parameters? *). Test suite. Can you tell us what you were running, what options it uses, and where to get ahold of that testsuite for reproducibility. *). Command Line for the kernel in question? Did you try with a debug kernel (yet)? :) Thanks! Jon.
The kernel is booted with: kernel /vmlinuz-2.6.24.3-29.el5rt ro root=LABEL=/1 console=ttyS0,9600 rhgb quiet The crash happens over night during a run of an internal java test suite. We haven't figured out if one particular test triggers the crash. I will install the debug kernel. I'll let you know what happens.
We don't have -debug rpm package for 2.6.24.3-29. Can you provide us one?
Here is what I got with the debug kernel: NMI show regs on CPU#0: apic_timer_irqs: 31045376 NMI watchdog running again ... Pid: 17705, comm: java Tainted: G D (2.6.24.3-29.el5rtdebug #1) EIP: 0060:[<c0322cec>] EFLAGS: 00000086 CPU: 0 EIP is at delay_tsc+0x1d/0x43 EAX: aad7ee10 EBX: 00000001 ECX: aad7ee07 EDX: 000093cd ESI: 04f4c966 EDI: 00000000 EBP: eddf0c54 ESP: eddf0c50 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010005 CR0: 8005003b CR2: b7fe4000 CR3: 2bcbd000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c02052f9>] show_trace_log_lvl+0x22/0x3f [<c0205b94>] show_trace+0x17/0x19 [<c02024b0>] show_regs+0x21/0x24 [<c021ae2c>] irq_show_regs_callback+0x62/0x72 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d [<c0471d30>] do_nmi+0xde/0x2ac [<c04718cb>] nmi_stack_correct+0x26/0x2b [<c0322c98>] __delay+0xe/0x10 [<c0471589>] _raw_spin_lock+0x82/0xe7 [<c04708f7>] __spin_lock+0x59/0x67 [<c0224c31>] double_lock_balance+0x54/0x5c [<c0224e7d>] pull_rt_task+0x81/0x1ad [<c022c32a>] pre_schedule_rt+0x22/0x2b [<c046e7ce>] __schedule+0x1bc/0x84b [<c046eff8>] schedule+0xea/0x109 [<c025263f>] futex_wait+0x231/0x309 [<c0253661>] do_futex+0x60/0x9a4 [<c0254091>] sys_futex+0xec/0xff [<c02042c6>] syscall_call+0x7/0xb ======================= --------------------------- | preempt count: 00010005 ] | 5-level deep critical section nesting: ---------------------------------------- .. [<c046e641>] .... __schedule+0x2f/0x84b .....[<c046eff8>] .. ( <= schedule+0xea/0x109) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c046e739>] .. ( <= __schedule+0x127/0x84b) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c0224c31>] .. ( <= double_lock_balance+0x54/0x5c) .. [<c0322ce4>] .... delay_tsc+0x15/0x43 .....[<c0322c98>] .. ( <= __delay+0xe/0x10) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c021adf5>] .. ( <= irq_show_regs_callback+0x2b/0x72) NMI show regs on CPU#2: apic_timer_irqs: 30774107 Pid: 32, comm: softirq-timer/2 Tainted: G D (2.6.24.3-29.el5rtdebug #1) EIP: 0060:[<c02299d0>] EFLAGS: 00000006 CPU: 2 EIP is at add_preempt_count+0x98/0x132 EAX: 00000003 EBX: c0322c98 ECX: ab8d8863 EDX: f78dc000 ESI: 00000001 EDI: c0322ce4 EBP: f78dcdfc ESP: f78dcde0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010004 CR0: 8005003b CR2: aa7590d8 CR3: 2bcbd000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c02052f9>] show_trace_log_lvl+0x22/0x3f [<c0205b94>] show_trace+0x17/0x19 [<c02024b0>] show_regs+0x21/0x24 [<c021ae2c>] irq_show_regs_callback+0x62/0x72 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d [<c0471d30>] do_nmi+0xde/0x2ac [<c04718cb>] nmi_stack_correct+0x26/0x2b [<c0322ce4>] delay_tsc+0x15/0x43 [<c0322c98>] __delay+0xe/0x10 [<c0471589>] _raw_spin_lock+0x82/0xe7 [<c04708f7>] __spin_lock+0x59/0x67 [<c0224c31>] double_lock_balance+0x54/0x5c [<c02251ad>] push_rt_task+0x95/0x1e8 [<c0225312>] push_rt_tasks+0x12/0x19 [<c0225339>] post_schedule_rt+0x20/0x30 [<c0229810>] finish_task_switch+0x6e/0xb7 [<c046edd5>] __schedule+0x7c3/0x84b [<c046eff8>] schedule+0xea/0x109 [<c0235587>] ksoftirqd+0xbf/0x26b [<c0242bbb>] kthread+0x40/0x69 [<c0204f13>] kernel_thread_helper+0x7/0x10 ======================= --------------------------- | preempt count: 00010004 ] | 4-level deep critical section nesting: ---------------------------------------- .. [<c046e641>] .... __schedule+0x2f/0x84b .....[<c046eff8>] .. ( <= schedule+0xea/0x109) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c0225332>] .. ( <= post_schedule_rt+0x19/0x30) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c0224c31>] .. ( <= double_lock_balance+0x54/0x5c) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c021adf5>] .. ( <= irq_show_regs_callback+0x2b/0x72) NMI show regs on CPU#3: apic_timer_irqs: 30619233 Pid: 45, comm: softirq-timer/3 Tainted: G D (2.6.24.3-29.el5rtdebug #1) EIP: 0060:[<c0322c92>] EFLAGS: 00000046 CPU: 3 EIP is at __delay+0x8/0x10 EAX: 00000001 EBX: d28ced80 ECX: b0208e5d EDX: 000093cd ESI: 1c537f1a EDI: 00000000 EBP: f7908cc0 ESP: f7908cc0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010005 CR0: 8005003b CR2: ffffffd0 CR3: 2bcbd000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c02052f9>] show_trace_log_lvl+0x22/0x3f [<c0205b94>] show_trace+0x17/0x19 [<c02024b0>] show_regs+0x21/0x24 [<c021ae2c>] irq_show_regs_callback+0x62/0x72 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d [<c0471d30>] do_nmi+0xde/0x2ac [<c04718cb>] nmi_stack_correct+0x26/0x2b [<c0471589>] _raw_spin_lock+0x82/0xe7 [<c04708f7>] __spin_lock+0x59/0x67 [<c046e739>] __schedule+0x127/0x84b [<c0233696>] do_exit+0x70f/0x768 [<c020577e>] die+0x1f6/0x1fe [<c047356f>] do_page_fault+0x74e/0x834 [<c0471822>] error_code+0x72/0x78 [<c046eff8>] schedule+0xea/0x109 [<c0235587>] ksoftirqd+0xbf/0x26b [<c0242bbb>] kthread+0x40/0x69 [<c0204f13>] kernel_thread_helper+0x7/0x10 ======================= --------------------------- | preempt count: 00010005 ] | 5-level deep critical section nesting: ---------------------------------------- .. [<c046e641>] .... __schedule+0x2f/0x84b .....[<c046eff8>] .. ( <= schedule+0xea/0x109) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c046e739>] .. ( <= __schedule+0x127/0x84b) .. [<c046e641>] .... __schedule+0x2f/0x84b .....[<c0233696>] .. ( <= do_exit+0x70f/0x768) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c046e739>] .. ( <= __schedule+0x127/0x84b) .. [<c04708b7>] .... __spin_lock+0x19/0x67 .....[<c021adf5>] .. ( <= irq_show_regs_callback+0x2b/0x72)
*** Bug 442828 has been marked as a duplicate of this bug. ***
Sorry, but #442828 (F9 beta Installation in santa rosa platform failed) is related with this issue?
No, that looks like a mistake. This is a MRG Realtime bug, so I don't see any way that F9 could intersect with it. Clark
Potential bug in the highmem handling; we implemented kmap_atomic using kmap, and kmap uses flush_tlb_range() which uses on_each_cpu() which can deadlock when called under irq disabled. However on -rt it should not be called from such a context, nor do the above NMI traces suggest it is - so this is likely a red-herring, still worth making a note of, hence this message.