Bug 437933 - crash with 32bit 2.6.24.3-29.el5rt
Summary: crash with 32bit 2.6.24.3-29.el5rt
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 1.0
Hardware: i386
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Jon Masters
QA Contact:
URL:
Whiteboard:
: 442828 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-18 09:10 UTC by Roland Westrelin
Modified: 2008-06-03 14:22 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-06-03 14:22:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Roland Westrelin 2008-03-18 09:10:50 UTC
Description of problem:

Machine is 2x2 core 2.8Ghz AMD with 8G of RAM. Problem shows up on Java
testsuite run.

With i686 version of 2.6.24.3-29.el5rt, we have the following message on the
console (keeps repeating):

NMI show regs on CPU#0:
apic_timer_irqs: 58692017

Pid: 9744, comm: java Tainted: G      D  (2.6.24.3-29.el5rt #1)
EIP: 0060:[<c0448422>] EFLAGS: 00000286 CPU: 0
EIP is at __spin_lock+0x18/0x1e
EAX: c0590b00 EBX: cc423da0 ECX: f3a70434 EDX: cc423000
ESI: 00000000 EDI: c021731b EBP: cc423cc0 ESP: cc423cc0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010003
CR0: 8005003b CR2: aa915000 CR3: 3344c000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c020507d>] show_trace_log_lvl+0x1a/0x2f
 [<c02058c9>] show_trace+0x12/0x14
 [<c020247a>] show_regs+0x1c/0x1f
 [<c0219a04>] irq_show_regs_callback+0x62/0x72
 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d
 [<c0448e06>] do_nmi+0x9d/0x259
NMI watchdog running again ...
 [<c04489a3>] nmi_stack_correct+0x26/0x2b
 [<c021708a>] native_smp_call_function_mask+0x46/0x1a2
 [<c02187d7>] smp_call_function+0x44/0x4c
 [<c02301d4>] on_each_cpu+0x24/0x4a
 [<c0216cfa>] flush_tlb_all+0x1e/0x20
 [<c0271e08>] kmap_high+0x29f/0x407
 [<c021fffd>] kmap+0x41/0x4c
 [<c0275016>] handle_mm_fault+0x12b/0x8fc
 [<c044a0db>] do_page_fault+0x336/0x7d4
 [<c04488fa>] error_code+0x72/0x78
 =======================
NMI show regs on CPU#1:
apic_timer_irqs: 58113825

Pid: 3773, comm: sge_execd Tainted: G      D  (2.6.24.3-29.el5rt #1)
EIP: 0060:[<c02171bb>] EFLAGS: 00000293 CPU: 1
EIP is at native_smp_call_function_mask+0x177/0x1a2
EAX: 000008fb EBX: f2c63ce4 ECX: c05a77c0 EDX: ffffb300
ESI: f2c63cc4 EDI: 00000003 EBP: f2c63d44 ESP: f2c63ca0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010003
CR0: 8005003b CR2: ae4f4784 CR3: 341ed000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c020507d>] show_trace_log_lvl+0x1a/0x2f
 [<c02058c9>] show_trace+0x12/0x14
 [<c020247a>] show_regs+0x1c/0x1f
 [<c0219a04>] irq_show_regs_callback+0x62/0x72
 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d
 [<c0448e06>] do_nmi+0x9d/0x259
 [<c04489a3>] nmi_stack_correct+0x26/0x2b
 [<c02187d7>] smp_call_function+0x44/0x4c
 [<c02301d4>] on_each_cpu+0x24/0x4a
 [<c0216cfa>] flush_tlb_all+0x1e/0x20
 [<c0271e08>] kmap_high+0x29f/0x407
 [<c021fffd>] kmap+0x41/0x4c
 [<c028cf3b>] pipe_write+0x234/0x3c4
 [<c0286e16>] do_sync_write+0xc5/0x102
 [<c02875b8>] vfs_write+0xa8/0x131
 [<c0287b99>] sys_write+0x3d/0x61
 [<c02040fa>] syscall_call+0x7/0xb
 =======================
NMI show regs on CPU#3:
apic_timer_irqs: 57510689

Pid: 9793, comm: java Tainted: G      D  (2.6.24.3-29.el5rt #1)
EIP: 0060:[<c044841d>] EFLAGS: 00000082 CPU: 3
EIP is at __spin_lock+0x13/0x1e
EAX: c8e56300 EBX: 00000040 ECX: 38638424 EDX: f2453000
ESI: f347b2f0 EDI: f39de1b0 EBP: f2453b8c ESP: f2453b8c
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010005
CR0: 8005003b CR2: ffffffd0 CR3: 3286e000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c020507d>] show_trace_log_lvl+0x1a/0x2f
 [<c02058c9>] show_trace+0x12/0x14
 [<c020247a>] show_regs+0x1c/0x1f
 [<c0219a04>] irq_show_regs_callback+0x62/0x72
 [<c04492c3>] nmi_watchdog_tick+0xc2/0x20d
 [<c0448e06>] do_nmi+0x9d/0x259
 [<c04489a3>] nmi_stack_correct+0x26/0x2b
 [<c0446b31>] __schedule+0x100/0x7a7
 [<c022e7f1>] do_exit+0x6ad/0x706
 [<c02054e5>] die+0x1f2/0x1fa
 [<c044a493>] do_page_fault+0x6ee/0x7d4
 [<c04488fa>] error_code+0x72/0x78
 [<c0447354>] schedule+0xe0/0xfa
 [<c0245e01>] futex_wait+0x21c/0x2f4
 [<c0246db0>] do_futex+0x59/0x923
 [<c0247761>] sys_futex+0xe7/0xfa
 [<c02040fa>] syscall_call+0x7/0xb
 =======================


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jon Masters 2008-03-19 14:12:46 UTC
Good morning!

Can I please get some more information about this crash:

*). How are you booting? Parameters?
*). Test suite. Can you tell us what you were running, what options it uses, and
where to get ahold of that testsuite for reproducibility.
*). Command Line for the kernel in question?

Did you try with a debug kernel (yet)? :)

Thanks!

Jon.


Comment 2 Roland Westrelin 2008-03-19 14:40:23 UTC
The kernel is booted with:
kernel /vmlinuz-2.6.24.3-29.el5rt ro root=LABEL=/1 console=ttyS0,9600 rhgb quiet

The crash happens over night during a run of an internal java test suite. We
haven't figured out if one particular test triggers the crash.

I will install the debug kernel. I'll let you know what happens.

Comment 3 Roland Westrelin 2008-03-19 14:50:21 UTC
We don't have -debug rpm package for 2.6.24.3-29. Can you provide us one?

Comment 4 Roland Westrelin 2008-03-26 08:12:45 UTC
Here is what I got with the debug kernel:

NMI show regs on CPU#0:
apic_timer_irqs: 31045376
NMI watchdog running again ...

Pid: 17705, comm: java Tainted: G      D  (2.6.24.3-29.el5rtdebug #1)
EIP: 0060:[<c0322cec>] EFLAGS: 00000086 CPU: 0
EIP is at delay_tsc+0x1d/0x43
EAX: aad7ee10 EBX: 00000001 ECX: aad7ee07 EDX: 000093cd
ESI: 04f4c966 EDI: 00000000 EBP: eddf0c54 ESP: eddf0c50
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 preempt:00010005
CR0: 8005003b CR2: b7fe4000 CR3: 2bcbd000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c02052f9>] show_trace_log_lvl+0x22/0x3f
 [<c0205b94>] show_trace+0x17/0x19
 [<c02024b0>] show_regs+0x21/0x24
 [<c021ae2c>] irq_show_regs_callback+0x62/0x72
 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d
 [<c0471d30>] do_nmi+0xde/0x2ac
 [<c04718cb>] nmi_stack_correct+0x26/0x2b
 [<c0322c98>] __delay+0xe/0x10
 [<c0471589>] _raw_spin_lock+0x82/0xe7
 [<c04708f7>] __spin_lock+0x59/0x67
 [<c0224c31>] double_lock_balance+0x54/0x5c
 [<c0224e7d>] pull_rt_task+0x81/0x1ad
 [<c022c32a>] pre_schedule_rt+0x22/0x2b
 [<c046e7ce>] __schedule+0x1bc/0x84b
 [<c046eff8>] schedule+0xea/0x109
 [<c025263f>] futex_wait+0x231/0x309
 [<c0253661>] do_futex+0x60/0x9a4
 [<c0254091>] sys_futex+0xec/0xff
 [<c02042c6>] syscall_call+0x7/0xb
 =======================
---------------------------
| preempt count: 00010005 ]
| 5-level deep critical section nesting:
----------------------------------------
.. [<c046e641>] .... __schedule+0x2f/0x84b
.....[<c046eff8>] ..   ( <= schedule+0xea/0x109)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c046e739>] ..   ( <= __schedule+0x127/0x84b)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c0224c31>] ..   ( <= double_lock_balance+0x54/0x5c)
.. [<c0322ce4>] .... delay_tsc+0x15/0x43
.....[<c0322c98>] ..   ( <= __delay+0xe/0x10)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c021adf5>] ..   ( <= irq_show_regs_callback+0x2b/0x72)

NMI show regs on CPU#2:
apic_timer_irqs: 30774107

Pid: 32, comm: softirq-timer/2 Tainted: G      D  (2.6.24.3-29.el5rtdebug #1)
EIP: 0060:[<c02299d0>] EFLAGS: 00000006 CPU: 2
EIP is at add_preempt_count+0x98/0x132
EAX: 00000003 EBX: c0322c98 ECX: ab8d8863 EDX: f78dc000
ESI: 00000001 EDI: c0322ce4 EBP: f78dcdfc ESP: f78dcde0
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010004
CR0: 8005003b CR2: aa7590d8 CR3: 2bcbd000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c02052f9>] show_trace_log_lvl+0x22/0x3f
 [<c0205b94>] show_trace+0x17/0x19
 [<c02024b0>] show_regs+0x21/0x24
 [<c021ae2c>] irq_show_regs_callback+0x62/0x72
 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d
 [<c0471d30>] do_nmi+0xde/0x2ac
 [<c04718cb>] nmi_stack_correct+0x26/0x2b
 [<c0322ce4>] delay_tsc+0x15/0x43
 [<c0322c98>] __delay+0xe/0x10
 [<c0471589>] _raw_spin_lock+0x82/0xe7
 [<c04708f7>] __spin_lock+0x59/0x67
 [<c0224c31>] double_lock_balance+0x54/0x5c
 [<c02251ad>] push_rt_task+0x95/0x1e8
 [<c0225312>] push_rt_tasks+0x12/0x19
 [<c0225339>] post_schedule_rt+0x20/0x30
 [<c0229810>] finish_task_switch+0x6e/0xb7
 [<c046edd5>] __schedule+0x7c3/0x84b
 [<c046eff8>] schedule+0xea/0x109
 [<c0235587>] ksoftirqd+0xbf/0x26b
 [<c0242bbb>] kthread+0x40/0x69
 [<c0204f13>] kernel_thread_helper+0x7/0x10
 =======================
---------------------------
| preempt count: 00010004 ]
| 4-level deep critical section nesting:
----------------------------------------
.. [<c046e641>] .... __schedule+0x2f/0x84b
.....[<c046eff8>] ..   ( <= schedule+0xea/0x109)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c0225332>] ..   ( <= post_schedule_rt+0x19/0x30)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c0224c31>] ..   ( <= double_lock_balance+0x54/0x5c)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c021adf5>] ..   ( <= irq_show_regs_callback+0x2b/0x72)

NMI show regs on CPU#3:
apic_timer_irqs: 30619233

Pid: 45, comm: softirq-timer/3 Tainted: G      D  (2.6.24.3-29.el5rtdebug #1)
EIP: 0060:[<c0322c92>] EFLAGS: 00000046 CPU: 3
EIP is at __delay+0x8/0x10
EAX: 00000001 EBX: d28ced80 ECX: b0208e5d EDX: 000093cd
ESI: 1c537f1a EDI: 00000000 EBP: f7908cc0 ESP: f7908cc0
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 preempt:00010005
CR0: 8005003b CR2: ffffffd0 CR3: 2bcbd000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c02052f9>] show_trace_log_lvl+0x22/0x3f
 [<c0205b94>] show_trace+0x17/0x19
 [<c02024b0>] show_regs+0x21/0x24
 [<c021ae2c>] irq_show_regs_callback+0x62/0x72
 [<c04722a8>] nmi_watchdog_tick+0xc2/0x20d
 [<c0471d30>] do_nmi+0xde/0x2ac
 [<c04718cb>] nmi_stack_correct+0x26/0x2b
 [<c0471589>] _raw_spin_lock+0x82/0xe7
 [<c04708f7>] __spin_lock+0x59/0x67
 [<c046e739>] __schedule+0x127/0x84b
 [<c0233696>] do_exit+0x70f/0x768
 [<c020577e>] die+0x1f6/0x1fe
 [<c047356f>] do_page_fault+0x74e/0x834
 [<c0471822>] error_code+0x72/0x78
 [<c046eff8>] schedule+0xea/0x109
 [<c0235587>] ksoftirqd+0xbf/0x26b
 [<c0242bbb>] kthread+0x40/0x69
 [<c0204f13>] kernel_thread_helper+0x7/0x10
 =======================
---------------------------
| preempt count: 00010005 ]
| 5-level deep critical section nesting:
----------------------------------------
.. [<c046e641>] .... __schedule+0x2f/0x84b
.....[<c046eff8>] ..   ( <= schedule+0xea/0x109)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c046e739>] ..   ( <= __schedule+0x127/0x84b)
.. [<c046e641>] .... __schedule+0x2f/0x84b
.....[<c0233696>] ..   ( <= do_exit+0x70f/0x768)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c046e739>] ..   ( <= __schedule+0x127/0x84b)
.. [<c04708b7>] .... __spin_lock+0x19/0x67
.....[<c021adf5>] ..   ( <= irq_show_regs_callback+0x2b/0x72)



Comment 6 Chris Lumens 2008-04-18 13:58:34 UTC
*** Bug 442828 has been marked as a duplicate of this bug. ***

Comment 7 austin 2008-04-21 05:39:17 UTC
Sorry, but #442828 (F9 beta Installation in santa rosa platform failed) is 
related with this issue? 

Comment 8 Clark Williams 2008-04-23 20:20:19 UTC
No, that looks like a mistake. This is a MRG Realtime bug, so I don't see any
way that F9 could intersect with it.

Clark


Comment 9 Peter Zijlstra 2008-05-01 14:52:37 UTC
Potential bug in the highmem handling; we implemented kmap_atomic using kmap,
and kmap uses flush_tlb_range() which uses on_each_cpu() which can deadlock when
called under irq disabled.

However on -rt it should not be called from such a context, nor do the above NMI
traces suggest it is - so this is likely a red-herring, still worth making a
note of, hence this message.


Note You need to log in before you can comment on or make changes to this bug.