453677 – nmi_watchdog=2 crashes the RT kernel on boot up

Bug 453677 - nmi_watchdog=2 crashes the RT kernel on boot up

Summary: nmi_watchdog=2 crashes the RT kernel on boot up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	1.0.1
Target Release:	---
Assignee:	Clark Williams
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-07-01 20:37 UTC by Steven Rostedt
Modified:	2008-08-26 19:57 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-08-26 19:57:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
boot log of dual Opteron running GA (2.6.24.7-65.el5rt) kernel (8.84 KB, text/plain) 2008-07-02 17:14 UTC, Clark Williams	no flags	Details
patch to allocate exception stacks (1.64 KB, patch) 2008-07-23 16:53 UTC, Clark Williams	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0585	0	normal	SHIPPED_LIVE	Important: kernel security and bug fix update	2008-08-26 19:56:57 UTC

Description Steven Rostedt 2008-07-01 20:37:03 UTC

Description of problem:
Booting up the RT kernel (tried -65 and -69) with the kernel command line option
 nmi_watchdog=2 causes the following to happen:

SMP alternatives: switching to UP code
ACPI: Core revision 20070126
Unable to handle kernel paging request at fffffffffffffff8 RIP: 
 [<ffffffff8113c429>] __delay+0x2f/0xb6
PGD 203067 PUD 204067 PMD 0 
Oops: 0002 [1] PREEMPT SMP 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24.7-69.el5rt #1
RIP: 0010:[<ffffffff8113c429>]  [<ffffffff8113c429>] __delay+0x2f/0xb6
RSP: 0000:ffff81003fa13d90  EFLAGS: 00010246
RAX: 00000000173c2d40 RBX: 00000000002aa54d RCX: 0000000000000000
RDX: 0000000000000011 RSI: 0000000000000046 RDI: 00000000002aa54d
RBP: ffff81003fa13db0 R08: 0000000000000000 R09: 000000000000000f
R10: 000000000000000f R11: 0000000000000000 R12: 0000000000167b81
R13: 0000000000000000 R14: 000000001725b12c R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff813f3100(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff81003fa12000, task ffff81003fa11480)
Stack:  0000000000000082 0000000000000008 00000000fffb6e5f 00000000ffffffff
 ffff81003fa13dc0 ffffffff8113c4f0 ffff81003fa13df0 ffffffff814b3da8
 ffff81003fa13df0 0000000000000002 0000000000000000 000000000000d800
Call Trace:
 [<ffffffff8113c4f0>] __const_udelay+0x40/0x42
 [<ffffffff814b3da8>] timer_irq_works+0x2a/0x4e
 [<ffffffff814b41ac>] setup_IO_APIC+0x3e0/0x8ae
 [<ffffffff8101b620>] ? setup_p4_watchdog+0xa8/0xde
 [<ffffffff8101b2ec>] ? lapic_watchdog_init+0xe3/0x117
 [<ffffffff814b2bc8>] smp_prepare_cpus+0x454/0x48e
 [<ffffffff814a856b>] kernel_init+0x69/0x350
 [<ffffffff810368a0>] ? finish_task_switch+0x32/0xdc
 [<ffffffff8100d048>] child_rip+0xa/0x12
 [<ffffffff8116b538>] ? acpi_ds_init_one_object+0x0/0x88
 [<ffffffff814a8502>] ? kernel_init+0x0/0x350
 [<ffffffff8100d03e>] ? child_rip+0x0/0x12

Code: 04 25 10 00 00 00 48 89 e5 41 56 41 55 41 54 53 48 89 fb ff 80 44 e0 ff ff
65 8b 0c 25 24 00 00 00 0f 31 41 89 c6 41 89 cd 0f 31 <41> 89 c4 45 29 f4 49 39
dc 73 4c 65 48 8b 0
4 25 10 00 00 00 ff 

This is very weird since, 1) it thinks my box is UP and not SMP, and 2) it fails
 to do the alternatives. I highly suspect that the "disable NMI default" patch
has something to do with this.

Comment 1 Clark Williams 2008-07-01 22:28:37 UTC

we dropped that patch back in the 2.6.24-31 kernel. I'm thinking its something
in the IOAPIC munging we do. Looking at that now and adding jcm to the CC...

Comment 2 Clark Williams 2008-07-02 17:14:42 UTC

Created attachment 310832 [details]
boot log of dual Opteron running GA (2.6.24.7-65.el5rt) kernel

Note that this system does not claim that the box is UP (like rostedt's), so
it's not that the kernel is mis-understanding the number of cores or type of
system

Comment 3 Clark Williams 2008-07-02 17:19:11 UTC

Ok, I'm wrong; I missed the "SMP alternatives: switching to UP code" in the log

Comment 4 Clark Williams 2008-07-04 03:43:14 UTC

bisected this down to two patches:

   hotplug-page_alloc.patch
   hotplug-smp-bootstrap.patch

The panic originally reported is actually caused by hotplug-smp-bootstrap.patch;
the other patch by itself causes a dual Opteron system to lock-up booting the
second processor (which is later than the panic site above).

Still working on what nmi_watchdog=2 does to cause problems with the code
modified by hotplug-smp-bootstrap.patch.

Comment 5 Clark Williams 2008-07-07 15:47:18 UTC

works fine on 32-bit kernel (2.6.24.7-70)

Comment 6 Clark Williams 2008-07-08 22:30:51 UTC

Very odd. Both stack traces (Steven's original) and mine, point to the function
__delay() in arch/x86/lib/delay_64.c at the following line of code:

                if (unlikely(cpu != smp_processor_id())) {

If I decoded the macro expansion correctly for a 64-bit SMP kernel,
smp_processor_id() expands to raw_smp_processor_id() which turns into
read_pda(cpunumber). If it is in fact this asm function that is failing, my
guess is that something has hosed the per-cpu data. Think I'll go poke at that a
bit...

Comment 7 Dave Anderson 2008-07-15 19:48:39 UTC

FYI, here's what I saw on a 2.6.24.7-72.el5rt kernel booting with
nmi_watchdog=2, where it died in the same location as yours in
__delay+0x2f, although the actual RIP is slightly different in my
kernel than yours:

SMP alternatives: switching to UP code
ACPI: Core revision 20070126
Unable to handle kernel paging request at fffffffffffffff8 RIP:
 [<ffffffff8113c349>] __delay+0x2f/0xb6
PGD 203067 PUD 204067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24.7-72.el5rt #1
RIP: 0010:[<ffffffff8113c349>]  [<ffffffff8113c349>] __delay+0x2f/0xb6
RSP: 0000:ffff81003fa13d90  EFLAGS: 00010246
RAX: 00000000714d6290 RBX: 00000000002aa61d RCX: 0000000000000000
RDX: 0000000000000020 RSI: 0000000000000046 RDI: 00000000002aa61d
RBP: ffff81003fa13db0 R08: 0000000000000000 R09: 000000000000000f
R10: 000000000000000f R11: 0000000000000000 R12: 0000000000176dda
R13: 0000000000000000 R14: 000000007135f41c R15: 00000000ffffffff
FS:  0000000000000000(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff81003fa12000, task ffff81003fa11480)
Stack:  0000000000000082 0000000000000008 00000000fffb6e69 00000000ffffffff
 ffff81003fa13dc0 ffffffff8113c410 ffff81003fa13df0 ffffffff814afd09
 ffff81003fa13df0 0000000000000002 0000000000000000 000000000000d800
Call Trace:
 [<ffffffff8113c410>] __const_udelay+0x40/0x42
 [<ffffffff814afd09>] timer_irq_works+0x2a/0x4f
 [<ffffffff814b010e>] setup_IO_APIC+0x3e0/0x8ae
 [<ffffffff8101b628>] ? setup_p4_watchdog+0xa8/0xde
 [<ffffffff8101b2f4>] ? lapic_watchdog_init+0xe3/0x117
 [<ffffffff814aeb2a>] smp_prepare_cpus+0x454/0x48e
 [<ffffffff814a4565>] kernel_init+0x69/0x34e
 [<ffffffff81036730>] ? finish_task_switch+0x32/0xdc
 [<ffffffff8100d048>] child_rip+0xa/0x12
 [<ffffffff81168db4>] ? acpi_ds_init_one_object+0x0/0x88
 [<ffffffff814a44fc>] ? kernel_init+0x0/0x34e
 [<ffffffff8100d03e>] ? child_rip+0x0/0x12

But, what I don't understand, is that the disassembly of __delay()
shows that the exception occurred on a register move (which makes
no sense unless the RIP itself got corrupted?), but __delay+0x2f
is beyond the smp_processor_id() that you reference:

crash> dis __delay -l
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
30
0xffffffff8113c31a <__delay>:   push   %rbp
include/asm/thread_info_64.h: 63
0xffffffff8113c31b <__delay+0x1>:       mov    %gs:0x10,%rax
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
30
0xffffffff8113c324 <__delay+0xa>:       mov    %rsp,%rbp
0xffffffff8113c327 <__delay+0xd>:       push   %r14
0xffffffff8113c329 <__delay+0xf>:       push   %r13
0xffffffff8113c32b <__delay+0x11>:      push   %r12
0xffffffff8113c32d <__delay+0x13>:      push   %rbx
0xffffffff8113c32e <__delay+0x14>:      mov    %rdi,%rbx
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
34
0xffffffff8113c331 <__delay+0x17>:      incl   0xffffffffffffe044(%rax)
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
35
0xffffffff8113c337 <__delay+0x1d>:      mov    %gs:0x24,%ecx
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
36
0xffffffff8113c33f <__delay+0x25>:      rdtsc  
0xffffffff8113c341 <__delay+0x27>:      mov    %eax,%r14d
0xffffffff8113c344 <__delay+0x2a>:      mov    %ecx,%r13d
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
38
0xffffffff8113c347 <__delay+0x2d>:      rdtsc  
/usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c:
39
0xffffffff8113c349 <__delay+0x2f>:      mov    %eax,%r12d
0xffffffff8113c34c <__delay+0x32>:      sub    %r14d,%r12d
0xffffffff8113c34f <__delay+0x35>:      cmp    %rbx,%r12
0xffffffff8113c352 <__delay+0x38>:      jae    0xffffffff8113c3a0 <__delay+0x86>

And here's __delay(), where it appears to be dying at
the instruction immediately following the second rdtcl():

     29 void __delay(unsigned long loops)
     30 {
     31         unsigned bclock, now;
     32         int cpu;
     33 
     34         preempt_disable();
     35         cpu = smp_processor_id();
     36         rdtscl(bclock);
     37         for (;;) {
     38                 rdtscl(now);
     39                 if ((now - bclock) >= loops)
     40                         break;
     
Out of curiousity, how did you pinpoint the exception in 
smp_processor_id()?  

Dave

Comment 8 Clark Williams 2008-07-15 20:52:44 UTC

I screwed up and looked at the wrong 'if' statement. I went back and caused
another panic, ran it through addr2line and got the same line number as you
(delay_64.c:39). I probably skipped down while I was looking at the function and
decided that function call was problematic, since it's hard to see where there's
an opportunity to go wrong in the first if statement. Sigh...

I'm not very familiar with x86_64 assembly; is there a memory reference in that
mov/sub/cmp block at line 39? They just look like straight register operations
to me. 

I'd say it's almost like an FP got zeroed (maybe something going wrong in a
return from interrupt?) and a register+offset local variable reference caused a
fault at fffffffffffffff8. I just don't see where that could happen from the
assembly.

Comment 9 Dave Anderson 2008-07-15 21:18:48 UTC

OK, good, so we're equally stumped.

From the appearance of things, it looks like the exception 
occurred during the "mov %eax,%r12d", which doesn't make sense.
I was also thinking maybe an interrupt occurred and things 
weren't restored correctly, but that's just a wild guess on
my part, and even then I don't know how it would lead to the
page fault seen here.  The two nearest memory references are the
preceding preempt_enable() and smp_processor_id() calls, well
before the exception address, but I still don't see how the
"0-minus-8" address (fffffffffffffff8) gets formulated?
The GS register does have the correct pointer at exception time:

crash> sym ffffffff813ef100
ffffffff813ef100 (D) boot_cpu_pda  
crash> 

So I have no explanation, nor have I ever seen such an anomoly
like this before.

Comment 10 Clark Williams 2008-07-15 21:29:28 UTC

Has to be something to do with the IO-APIC manipulations from the patches in
comment #4. I'll go back and look at the hotplug-smp-bootstrap.patch, where I
think this manfests. It's a rework of setting up exception stacks during kernel
boot.

Comment 11 Clark Williams 2008-07-23 16:53:51 UTC

Created attachment 312496 [details]
patch to allocate exception stacks

Comment 15 errata-xmlrpc 2008-08-26 19:57:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0585.html

Note You need to log in before you can comment on or make changes to this bug.