Bug 610463

Summary: rhel6 kvm_intel failed to boot smp guest
Product: Red Hat Enterprise Linux 6 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Gleb Natapov <gleb>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: clalance, jburke, knoel, syeghiay, tburke
Target Milestone: rcKeywords: Regression
Target Release: ---Flags: gleb: needinfo?
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-04 08:23:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
guest xml
none
full dmesg none

Description Qian Cai 2010-07-02 07:32:41 UTC
Created attachment 428743 [details]
guest xml

Description of problem:
mce: CPU supports 10 MCE banks
Performance Events: unsupported p6 CPU model 2 no PMU driver, software events only.
alternatives: switching to unfair spinlock
ACPI: Core revision 20090903
ftrace: converting mcount calls to 0f 1f 44 00 00
ftrace: allocating 20453 entries in 81 pages
Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel QEMU Virtual CPU version 0.12.1 stepping 03
Booting Node   0, Processors  #1 Ok.
kvm-clock: cpu 1, msr 0:123167c1, secondary cpu clock
Brought up 2 CPUs
Total of 2 processors activated (9576.02 BogoMIPS).
Testing NMI watchdog ... 
WARNING: CPU#0: NMI appears to be stuck (0->0)!
Please report this to bugzilla.kernel.org,
and attach the output of the 'dmesg' command.

WARNING: CPU#1: NMI appears to be stuck (0->0)!
Please report this to bugzilla.kernel.org,
and attach the output of the 'dmesg' command.
devtmpfs: initialized

Version-Release number of selected component (if applicable):
both guest and host (use kernel-debug):
qemu-kvm-0.12.1.2-2.90.el6.x86_64
kernel-2.6.32-42.el6

How reproducible:
always

Comment 1 Qian Cai 2010-07-02 07:41:14 UTC
Adding nmi_watchdog=0 did not solve the problem.

Comment 2 Qian Cai 2010-07-02 07:43:28 UTC
Created attachment 428746 [details]
full dmesg

Comment 3 Qian Cai 2010-07-02 07:50:31 UTC
This is a host problem.

Comment 4 Qian Cai 2010-07-02 08:44:26 UTC
non-debug kernel has the same problem. The host is a T400 laptop and both host and guest are using 20100701 tree, if that would be any help.

Comment 5 Qian Cai 2010-07-02 10:05:03 UTC
RHEL5 smp guest also failed,

CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
QEMU Virtual CPU version 0.12.1 stepping 03
kvm-clock: cpu 1, msr 0:957da81, secondary cpu clock
Brought up 2 CPUs
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer.
time.c: Detected 2393.860 MHz processor.

Comment 6 Qian Cai 2010-07-02 10:06:26 UTC
RHEL6 i386 smp guest also failed.

Comment 9 Qian Cai 2010-07-02 13:23:36 UTC
In summary, something in kvm subsystem looks like go wrong. Suspected the fix for bug 596223 causes the regression here. Here is the test matrix,

                         | host-42.el6 | host-37.el6 |
------------------------------------------------------
guest-42.el6.x86_64-SMP  |    FAIL     |    PASS     |
guest-42.el6.x86_64-UP   |    PASS     |    PASS     |
guest-42.el6.i386-SMP    |    FAIL     |    PASS     |
guest-37.el6.x86_64-SMP  |    FAIL     |    PASS     |
guest-rhel5.5.x86_64-SMP |    FAIL     |    PASS     |

Comment 10 Chris Lalancette 2010-07-02 14:57:47 UTC
Hello Cai,
     While it is certainly possible that the patches for 596223, there are a lot of patches that went in between -37 and -42.  Could you try the intervening kernels (-38, -39, -40, and -41) to narrow down where the problem started happening?
     Also, while this is certainly a problem on the T400, this problem doesn't seem to happen across the board in my testing.  Could you repeat your tests on another Intel box to see if you see the problem elsewhere?
     Unfortunately I'm going to be off until Wednesday of next week, so in the interim I will try to get somebody else to take a look at it.  If we haven't figured it out by the time I get back on Wednesday, I will pick it back up.

Thanks,
Chris Lalancette

Comment 11 Qian Cai 2010-07-03 01:55:17 UTC
>      While it is certainly possible that the patches for 596223, there are a
> lot of patches that went in between -37 and -42.  Could you try the intervening
> kernels (-38, -39, -40, and -41) to narrow down where the problem started
> happening?
-38 had the problem.

>      Also, while this is certainly a problem on the T400, this problem doesn't
> seem to happen across the board in my testing.  Could you repeat your tests on
> another Intel box to see if you see the problem elsewhere?
Correction - saw it on a X200 laptop not T400. I'll try the T400 next Monday.

Comment 12 Qian Cai 2010-07-03 17:08:25 UTC
Sorry Chris, it is not your patches' fault. I have narrowed it down by pulled out the following 2 patches from -38.el6 kernel, everything is working fine again.

- [virt] account only for IRQ injected into BSP (Gleb Natapov) [601564]
- [virt] KVM: read apic->irr with ioapic lock held (Marcelo Tosatti) [579970]

I am compiling another kernel to find out which one is at fault.

Comment 13 Qian Cai 2010-07-03 17:54:16 UTC
OK, it is working again after pulled this single one.

- [virt] account only for IRQ injected into BSP (Gleb Natapov) [601564]
virt-account-only-for-IRQ-injected-into-BSP.patch

Comment 14 Dor Laor 2010-07-04 08:15:13 UTC
So, should this bug be verified/closed?

Comment 15 Gleb Natapov 2010-07-04 08:23:18 UTC

*** This bug has been marked as a duplicate of bug 609082 ***