Hide Forgot
Description of problem: happened while working/starting lots of linux6 vms (qcow on block), server stopped responding and created a crash file once. KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2011-11-22-08:50:10/vmcore [PARTIAL DUMP] CPUS: 64 DATE: Tue Nov 22 08:48:56 2011 UPTIME: 00:23:41 LOAD AVERAGE: 8.34, 8.05, 5.75 TASKS: 2157 NODENAME: rhev-i32c-04.mpc.lab.eng.bos.redhat.com RELEASE: 2.6.32-220.el6.x86_64 VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011 MACHINE: x86_64 (1862 Mhz) MEMORY: 64 GB PANIC: "Oops: 0002 [#1] SMP " (check log for details) PID: 0 COMMAND: "swapper" TASK: ffff880875c83540 (1 of 64) [THREAD_INFO: ffff880c75cd0000] CPU: 35 STATE: TASK_RUNNING (PANIC) crash> bt PID: 0 TASK: ffff880875c83540 CPU: 35 COMMAND: "swapper" #0 [ffff880c8e603a50] machine_kexec at ffffffff81031fcb #1 [ffff880c8e603ab0] crash_kexec at ffffffff810b8f72 #2 [ffff880c8e603b80] oops_end at ffffffff814f04b0 #3 [ffff880c8e603bb0] no_context at ffffffff8104230b #4 [ffff880c8e603c00] __bad_area_nosemaphore at ffffffff81042595 #5 [ffff880c8e603c50] bad_area_nosemaphore at ffffffff81042663 #6 [ffff880c8e603c60] __do_page_fault at ffffffff81042d1d #7 [ffff880c8e603d80] do_page_fault at ffffffff814f248e #8 [ffff880c8e603db0] page_fault at ffffffff814ef845 [exception RIP: search_extable+77] RIP: ffffffff8104350d RSP: ffff880c8e603e68 RFLAGS: 00010002 RAX: 0000000000000215 RBX: ffffffff814f4de6 RCX: 0000000000000000 RDX: ffffffff814f4de6 RSI: ffffffff814fb640 RDI: ffffffff814f73a0 RBP: ffff880c8e603e68 R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c8e603ee8 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000023 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880c8e603e70] search_exception_tables at ffffffff8108d64a #10 [ffff880c8e603e90] fixup_exception at ffffffff8104355d #11 [ffff880c8e603eb0] do_general_protection at ffffffff814effe8 #12 [ffff880c8e603ee0] general_protection at ffffffff814ef815 [exception RIP: smp_apic_timer_interrupt+86] RIP: ffffffff814f4de6 RSP: ffff880c8e603f98 RFLAGS: 00010083 RAX: 0000000000000023 RBX: ffff10191cc03fa8 RCX: 0000000000000016 RDX: ffff880c8e600000 RSI: 0000000000000023 RDI: 0000000000000083 RBP: ffff880c8e603fa8 R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 000000000000011b R14: 0000000000000002 R15: 0000000000000023 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #13 [ffff880c8e603fb0] apic_timer_interrupt at ffffffff8100bc13 --- <IRQ stack> --- #14 [ffff880c75cd1db8] apic_timer_interrupt at ffffffff8100bc13 [exception RIP: intel_idle+222] RIP: ffffffff812c4a6e RSP: ffff880c75cd1e68 RFLAGS: 00000206 RAX: 0000000000000000 RBX: ffff880c75cd1ed8 RCX: 0000000000000000 RDX: 000000000000011b RSI: 0000000000000000 RDI: 00000000000453bb RBP: ffffffff8100bc0e R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff814f2545 R13: ffff880c75cd1e08 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #15 [ffff880c75cd1ee0] cpuidle_idle_call at ffffffff813f9f67 #16 [ffff880c75cd1f00] cpu_idle at ffffffff81009e06 cpuinfo:Intel(R) Xeon(R) CPU L7555 @ 1.87GHz, 64 processors Version-Release number of selected component (if applicable): 2.6.32-220.el6.x86_64 How reproducible: happened several times on this host Steps to Reproduce: 1.run 120 vms on that host 2. 3. Actual results: Expected results: Additional info:
crash> bt PID: 0 TASK: ffff880875c83540 CPU: 35 COMMAND: "swapper" #0 [ffff880c8e603a50] machine_kexec at ffffffff81031fcb #1 [ffff880c8e603ab0] crash_kexec at ffffffff810b8f72 #2 [ffff880c8e603b80] oops_end at ffffffff814f04b0 #3 [ffff880c8e603bb0] no_context at ffffffff8104230b #4 [ffff880c8e603c00] __bad_area_nosemaphore at ffffffff81042595 #5 [ffff880c8e603c50] bad_area_nosemaphore at ffffffff81042663 #6 [ffff880c8e603c60] __do_page_fault at ffffffff81042d1d #7 [ffff880c8e603d80] do_page_fault at ffffffff814f248e #8 [ffff880c8e603db0] page_fault at ffffffff814ef845 [exception RIP: search_extable+77] RIP: ffffffff8104350d RSP: ffff880c8e603e68 RFLAGS: 00010002 RAX: 0000000000000215 RBX: ffffffff814f4de6 RCX: 0000000000000000 RDX: ffffffff814f4de6 RSI: ffffffff814fb640 RDI: ffffffff814f73a0 RBP: ffff880c8e603e68 R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000001 R12: ffff880c8e603ee8 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000023 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880c8e603e70] search_exception_tables at ffffffff8108d64a #10 [ffff880c8e603e90] fixup_exception at ffffffff8104355d #11 [ffff880c8e603eb0] do_general_protection at ffffffff814effe8 #12 [ffff880c8e603ee0] general_protection at ffffffff814ef815 [exception RIP: smp_apic_timer_interrupt+86] RIP: ffffffff814f4de6 RSP: ffff880c8e603f98 RFLAGS: 00010083 RAX: 0000000000000023 RBX: ffff10191cc03fa8 RCX: 0000000000000016 RDX: ffff880c8e600000 RSI: 0000000000000023 RDI: 0000000000000083 RBP: ffff880c8e603fa8 R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 000000000000011b R14: 0000000000000002 R15: 0000000000000023 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #13 [ffff880c8e603fb0] apic_timer_interrupt at ffffffff8100bc13 --- <IRQ stack> --- #14 [ffff880c75cd1db8] apic_timer_interrupt at ffffffff8100bc13 [exception RIP: intel_idle+222] RIP: ffffffff812c4a6e RSP: ffff880c75cd1e68 RFLAGS: 00000206 RAX: 0000000000000000 RBX: ffff880c75cd1ed8 RCX: 0000000000000000 RDX: 000000000000011b RSI: 0000000000000000 RDI: 00000000000453bb RBP: ffffffff8100bc0e R8: 0000000000000000 R9: 00000000000000dc R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff814f2545 R13: ffff880c75cd1e08 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #15 [ffff880c75cd1ee0] cpuidle_idle_call at ffffffff813f9f67 #16 [ffff880c75cd1f00] cpu_idle at ffffffff81009e06 Looking at frames 12-13: (gdb) disassemble smp_apic_timer_interrupt Dump of assembler code for function smp_apic_timer_interrupt: 0xffffffff814f4d90 <+0>: push %rbp 0xffffffff814f4d91 <+1>: mov %rsp,%rbp 0xffffffff814f4d94 <+4>: push %r12 0xffffffff814f4d96 <+6>: push %rbx 0xffffffff814f4d97 <+7>: callq 0xffffffff8100adc0 0xffffffff814f4d9c <+12>: mov 0x709f7d(%rip),%rax # 0xffffffff81bfed20 0xffffffff814f4da3 <+19>: xor %esi,%esi 0xffffffff814f4da5 <+21>: mov $0xdb00,%rbx 0xffffffff814f4dac <+28>: mov %gs:0xc6d0,%r12 0xffffffff814f4db5 <+37>: mov %rdi,%gs:0xc6d0 0xffffffff814f4dbe <+46>: mov $0xb0,%edi 0xffffffff814f4dc3 <+51>: callq *0x138(%rax) 0xffffffff814f4dc9 <+57>: callq 0xffffffff81009cd0 <exit_idle> 0xffffffff814f4dce <+62>: callq 0xffffffff81071f50 <irq_enter> 0xffffffff814f4dd3 <+67>: mov %gs:0xdac0,%esi 0xffffffff814f4ddb <+75>: movslq %esi,%rax 0xffffffff814f4dde <+78>: add -0x7e4092e0(,%rax,8),%rbx 0xffffffff814f4de6 <+86>: cmpq $0x0,0x50(%rbx) 0xffffffff814f4deb <+91>: je 0xffffffff814f4e0e <smp_apic_timer_interrupt+126> 0xffffffff814f4ded <+93>: incl %gs:0x15e0c 0xffffffff814f4df5 <+101>: mov %rbx,%rdi 0xffffffff814f4df8 <+104>: callq *0x50(%rbx) 0xffffffff814f4dfb <+107>: callq 0xffffffff81071ec0 <irq_exit> 0xffffffff814f4e00 <+112>: pop %rbx 0xffffffff814f4e01 <+113>: mov %r12,%gs:0xc6d0 0xffffffff814f4e0a <+122>: pop %r12 0xffffffff814f4e0c <+124>: leaveq 0xffffffff814f4e0d <+125>: retq 0xffffffff814f4e0e <+126>: mov $0xffffffff81782970,%rdi 0xffffffff814f4e15 <+133>: xor %eax,%eax 0xffffffff814f4e17 <+135>: callq 0xffffffff814ec40c <printk> 0xffffffff814f4e1c <+140>: mov %rbx,%rsi 0xffffffff814f4e1f <+143>: mov $0x1,%edi 0xffffffff814f4e24 <+148>: callq 0xffffffff8102ad40 <lapic_timer_setup> 0xffffffff814f4e29 <+153>: jmp 0xffffffff814f4dfb <smp_apic_timer_interrupt+107> Crash is at 0xffffffff814f4de6; %rbx (ffff10191cc03fa8) is non-canonical so #GP. The code in question is: static void local_apic_timer_interrupt(void) { int cpu = smp_processor_id(); struct clock_event_device *evt = &per_cpu(lapic_events, cpu); /* * Normally we should not be here till LAPIC has been initialized but * in some cases like kdump, its possible that there is a pending LAPIC * timer interrupt from previous kernel's context and is delivered in * new kernel the moment interrupts are enabled. * * Interrupts are enabled early and LAPIC is setup much later, hence * its possible that when we get here evt->event_handler is NULL. * Check for event_handler being NULL and discard the interrupt as * spurious. */ if (!evt->event_handler) { pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu); /* Switch it off */ lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); return; } /* * the NMI deadlock-detector uses this. */ inc_irq_stat(apic_timer_irqs); evt->event_handler(evt); } The crash occurs while loading evt->event_handler in order to compare it to 0; the ->event_handler field is at offset 0x50. %rax and %rsi are both 35, the cpu number, as expected. Reevaluating the load instruction that computed %rbx crash> x/2x 0-0x7e4092e0 + 0x23 * 8 0xffffffff81bf6e38: 0x8e600000 0xffff880c So %rbx should be 0xffff880c8e600000, but is 0xffff10191cc03fa8, not even close. I don't know if the value crash gives is from the core file or from vmlinux (it was a partial dump).
Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.