Description of problem: Booting up the RT kernel (tried -65 and -69) with the kernel command line option nmi_watchdog=2 causes the following to happen: SMP alternatives: switching to UP code ACPI: Core revision 20070126 Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff8113c429>] __delay+0x2f/0xb6 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.24.7-69.el5rt #1 RIP: 0010:[<ffffffff8113c429>] [<ffffffff8113c429>] __delay+0x2f/0xb6 RSP: 0000:ffff81003fa13d90 EFLAGS: 00010246 RAX: 00000000173c2d40 RBX: 00000000002aa54d RCX: 0000000000000000 RDX: 0000000000000011 RSI: 0000000000000046 RDI: 00000000002aa54d RBP: ffff81003fa13db0 R08: 0000000000000000 R09: 000000000000000f R10: 000000000000000f R11: 0000000000000000 R12: 0000000000167b81 R13: 0000000000000000 R14: 000000001725b12c R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff813f3100(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 1, threadinfo ffff81003fa12000, task ffff81003fa11480) Stack: 0000000000000082 0000000000000008 00000000fffb6e5f 00000000ffffffff ffff81003fa13dc0 ffffffff8113c4f0 ffff81003fa13df0 ffffffff814b3da8 ffff81003fa13df0 0000000000000002 0000000000000000 000000000000d800 Call Trace: [<ffffffff8113c4f0>] __const_udelay+0x40/0x42 [<ffffffff814b3da8>] timer_irq_works+0x2a/0x4e [<ffffffff814b41ac>] setup_IO_APIC+0x3e0/0x8ae [<ffffffff8101b620>] ? setup_p4_watchdog+0xa8/0xde [<ffffffff8101b2ec>] ? lapic_watchdog_init+0xe3/0x117 [<ffffffff814b2bc8>] smp_prepare_cpus+0x454/0x48e [<ffffffff814a856b>] kernel_init+0x69/0x350 [<ffffffff810368a0>] ? finish_task_switch+0x32/0xdc [<ffffffff8100d048>] child_rip+0xa/0x12 [<ffffffff8116b538>] ? acpi_ds_init_one_object+0x0/0x88 [<ffffffff814a8502>] ? kernel_init+0x0/0x350 [<ffffffff8100d03e>] ? child_rip+0x0/0x12 Code: 04 25 10 00 00 00 48 89 e5 41 56 41 55 41 54 53 48 89 fb ff 80 44 e0 ff ff 65 8b 0c 25 24 00 00 00 0f 31 41 89 c6 41 89 cd 0f 31 <41> 89 c4 45 29 f4 49 39 dc 73 4c 65 48 8b 0 4 25 10 00 00 00 ff This is very weird since, 1) it thinks my box is UP and not SMP, and 2) it fails to do the alternatives. I highly suspect that the "disable NMI default" patch has something to do with this.
we dropped that patch back in the 2.6.24-31 kernel. I'm thinking its something in the IOAPIC munging we do. Looking at that now and adding jcm to the CC...
Created attachment 310832 [details] boot log of dual Opteron running GA (2.6.24.7-65.el5rt) kernel Note that this system does not claim that the box is UP (like rostedt's), so it's not that the kernel is mis-understanding the number of cores or type of system
Ok, I'm wrong; I missed the "SMP alternatives: switching to UP code" in the log
bisected this down to two patches: hotplug-page_alloc.patch hotplug-smp-bootstrap.patch The panic originally reported is actually caused by hotplug-smp-bootstrap.patch; the other patch by itself causes a dual Opteron system to lock-up booting the second processor (which is later than the panic site above). Still working on what nmi_watchdog=2 does to cause problems with the code modified by hotplug-smp-bootstrap.patch.
works fine on 32-bit kernel (2.6.24.7-70)
Very odd. Both stack traces (Steven's original) and mine, point to the function __delay() in arch/x86/lib/delay_64.c at the following line of code: if (unlikely(cpu != smp_processor_id())) { If I decoded the macro expansion correctly for a 64-bit SMP kernel, smp_processor_id() expands to raw_smp_processor_id() which turns into read_pda(cpunumber). If it is in fact this asm function that is failing, my guess is that something has hosed the per-cpu data. Think I'll go poke at that a bit...
FYI, here's what I saw on a 2.6.24.7-72.el5rt kernel booting with nmi_watchdog=2, where it died in the same location as yours in __delay+0x2f, although the actual RIP is slightly different in my kernel than yours: SMP alternatives: switching to UP code ACPI: Core revision 20070126 Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff8113c349>] __delay+0x2f/0xb6 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.24.7-72.el5rt #1 RIP: 0010:[<ffffffff8113c349>] [<ffffffff8113c349>] __delay+0x2f/0xb6 RSP: 0000:ffff81003fa13d90 EFLAGS: 00010246 RAX: 00000000714d6290 RBX: 00000000002aa61d RCX: 0000000000000000 RDX: 0000000000000020 RSI: 0000000000000046 RDI: 00000000002aa61d RBP: ffff81003fa13db0 R08: 0000000000000000 R09: 000000000000000f R10: 000000000000000f R11: 0000000000000000 R12: 0000000000176dda R13: 0000000000000000 R14: 000000007135f41c R15: 00000000ffffffff FS: 0000000000000000(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 1, threadinfo ffff81003fa12000, task ffff81003fa11480) Stack: 0000000000000082 0000000000000008 00000000fffb6e69 00000000ffffffff ffff81003fa13dc0 ffffffff8113c410 ffff81003fa13df0 ffffffff814afd09 ffff81003fa13df0 0000000000000002 0000000000000000 000000000000d800 Call Trace: [<ffffffff8113c410>] __const_udelay+0x40/0x42 [<ffffffff814afd09>] timer_irq_works+0x2a/0x4f [<ffffffff814b010e>] setup_IO_APIC+0x3e0/0x8ae [<ffffffff8101b628>] ? setup_p4_watchdog+0xa8/0xde [<ffffffff8101b2f4>] ? lapic_watchdog_init+0xe3/0x117 [<ffffffff814aeb2a>] smp_prepare_cpus+0x454/0x48e [<ffffffff814a4565>] kernel_init+0x69/0x34e [<ffffffff81036730>] ? finish_task_switch+0x32/0xdc [<ffffffff8100d048>] child_rip+0xa/0x12 [<ffffffff81168db4>] ? acpi_ds_init_one_object+0x0/0x88 [<ffffffff814a44fc>] ? kernel_init+0x0/0x34e [<ffffffff8100d03e>] ? child_rip+0x0/0x12 But, what I don't understand, is that the disassembly of __delay() shows that the exception occurred on a register move (which makes no sense unless the RIP itself got corrupted?), but __delay+0x2f is beyond the smp_processor_id() that you reference: crash> dis __delay -l /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 30 0xffffffff8113c31a <__delay>: push %rbp include/asm/thread_info_64.h: 63 0xffffffff8113c31b <__delay+0x1>: mov %gs:0x10,%rax /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 30 0xffffffff8113c324 <__delay+0xa>: mov %rsp,%rbp 0xffffffff8113c327 <__delay+0xd>: push %r14 0xffffffff8113c329 <__delay+0xf>: push %r13 0xffffffff8113c32b <__delay+0x11>: push %r12 0xffffffff8113c32d <__delay+0x13>: push %rbx 0xffffffff8113c32e <__delay+0x14>: mov %rdi,%rbx /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 34 0xffffffff8113c331 <__delay+0x17>: incl 0xffffffffffffe044(%rax) /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 35 0xffffffff8113c337 <__delay+0x1d>: mov %gs:0x24,%ecx /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 36 0xffffffff8113c33f <__delay+0x25>: rdtsc 0xffffffff8113c341 <__delay+0x27>: mov %eax,%r14d 0xffffffff8113c344 <__delay+0x2a>: mov %ecx,%r13d /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 38 0xffffffff8113c347 <__delay+0x2d>: rdtsc /usr/src/debug/kernel-rt-2.6.24.7-72.el5rt/linux-2.6.24.7.x86_64/arch/x86/lib/delay_64.c: 39 0xffffffff8113c349 <__delay+0x2f>: mov %eax,%r12d 0xffffffff8113c34c <__delay+0x32>: sub %r14d,%r12d 0xffffffff8113c34f <__delay+0x35>: cmp %rbx,%r12 0xffffffff8113c352 <__delay+0x38>: jae 0xffffffff8113c3a0 <__delay+0x86> And here's __delay(), where it appears to be dying at the instruction immediately following the second rdtcl(): 29 void __delay(unsigned long loops) 30 { 31 unsigned bclock, now; 32 int cpu; 33 34 preempt_disable(); 35 cpu = smp_processor_id(); 36 rdtscl(bclock); 37 for (;;) { 38 rdtscl(now); 39 if ((now - bclock) >= loops) 40 break; Out of curiousity, how did you pinpoint the exception in smp_processor_id()? Dave
I screwed up and looked at the wrong 'if' statement. I went back and caused another panic, ran it through addr2line and got the same line number as you (delay_64.c:39). I probably skipped down while I was looking at the function and decided that function call was problematic, since it's hard to see where there's an opportunity to go wrong in the first if statement. Sigh... I'm not very familiar with x86_64 assembly; is there a memory reference in that mov/sub/cmp block at line 39? They just look like straight register operations to me. I'd say it's almost like an FP got zeroed (maybe something going wrong in a return from interrupt?) and a register+offset local variable reference caused a fault at fffffffffffffff8. I just don't see where that could happen from the assembly.
OK, good, so we're equally stumped. From the appearance of things, it looks like the exception occurred during the "mov %eax,%r12d", which doesn't make sense. I was also thinking maybe an interrupt occurred and things weren't restored correctly, but that's just a wild guess on my part, and even then I don't know how it would lead to the page fault seen here. The two nearest memory references are the preceding preempt_enable() and smp_processor_id() calls, well before the exception address, but I still don't see how the "0-minus-8" address (fffffffffffffff8) gets formulated? The GS register does have the correct pointer at exception time: crash> sym ffffffff813ef100 ffffffff813ef100 (D) boot_cpu_pda crash> So I have no explanation, nor have I ever seen such an anomoly like this before.
Has to be something to do with the IO-APIC manipulations from the patches in comment #4. I'll go back and look at the hotplug-smp-bootstrap.patch, where I think this manfests. It's a rework of setting up exception stacks during kernel boot.
Created attachment 312496 [details] patch to allocate exception stacks
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html