Bug 610463

Summary:

rhel6 kvm_intel failed to boot smp guest

Product:

Red Hat Enterprise Linux 6

Reporter:

Qian Cai <qcai>

Component:

kernel

Assignee:

Gleb Natapov <gleb>

Status:

CLOSED DUPLICATE

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.0

CC:

clalance, jburke, knoel, syeghiay, tburke

Target Milestone:

Keywords:

Regression

Target Release:

---

Flags:

gleb: needinfo?

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-07-04 08:23:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
guest xml	none
full dmesg	none

Description Qian Cai 2010-07-02 07:32:41 UTC

Created attachment 428743 [details]
guest xml

Description of problem:
mce: CPU supports 10 MCE banks
Performance Events: unsupported p6 CPU model 2 no PMU driver, software events only.
alternatives: switching to unfair spinlock
ACPI: Core revision 20090903
ftrace: converting mcount calls to 0f 1f 44 00 00
ftrace: allocating 20453 entries in 81 pages
Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel QEMU Virtual CPU version 0.12.1 stepping 03
Booting Node   0, Processors  #1 Ok.
kvm-clock: cpu 1, msr 0:123167c1, secondary cpu clock
Brought up 2 CPUs
Total of 2 processors activated (9576.02 BogoMIPS).
Testing NMI watchdog ... 
WARNING: CPU#0: NMI appears to be stuck (0->0)!
Please report this to bugzilla.kernel.org,
and attach the output of the 'dmesg' command.

WARNING: CPU#1: NMI appears to be stuck (0->0)!
Please report this to bugzilla.kernel.org,
and attach the output of the 'dmesg' command.
devtmpfs: initialized

Version-Release number of selected component (if applicable):
both guest and host (use kernel-debug):
qemu-kvm-0.12.1.2-2.90.el6.x86_64
kernel-2.6.32-42.el6

How reproducible:
always

Comment 1 Qian Cai 2010-07-02 07:41:14 UTC

Adding nmi_watchdog=0 did not solve the problem.

Comment 2 Qian Cai 2010-07-02 07:43:28 UTC

Created attachment 428746 [details]
full dmesg

Comment 3 Qian Cai 2010-07-02 07:50:31 UTC

This is a host problem.

Comment 4 Qian Cai 2010-07-02 08:44:26 UTC

non-debug kernel has the same problem. The host is a T400 laptop and both host and guest are using 20100701 tree, if that would be any help.

Comment 5 Qian Cai 2010-07-02 10:05:03 UTC

RHEL5 smp guest also failed,

CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
QEMU Virtual CPU version 0.12.1 stepping 03
kvm-clock: cpu 1, msr 0:957da81, secondary cpu clock
Brought up 2 CPUs
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer.
time.c: Detected 2393.860 MHz processor.

Comment 6 Qian Cai 2010-07-02 10:06:26 UTC

RHEL6 i386 smp guest also failed.

Comment 9 Qian Cai 2010-07-02 13:23:36 UTC

In summary, something in kvm subsystem looks like go wrong. Suspected the fix for bug 596223 causes the regression here. Here is the test matrix,

                         | host-42.el6 | host-37.el6 |
------------------------------------------------------
guest-42.el6.x86_64-SMP  |    FAIL     |    PASS     |
guest-42.el6.x86_64-UP   |    PASS     |    PASS     |
guest-42.el6.i386-SMP    |    FAIL     |    PASS     |
guest-37.el6.x86_64-SMP  |    FAIL     |    PASS     |
guest-rhel5.5.x86_64-SMP |    FAIL     |    PASS     |

Comment 10 Chris Lalancette 2010-07-02 14:57:47 UTC

Hello Cai,
     While it is certainly possible that the patches for 596223, there are a lot of patches that went in between -37 and -42.  Could you try the intervening kernels (-38, -39, -40, and -41) to narrow down where the problem started happening?
     Also, while this is certainly a problem on the T400, this problem doesn't seem to happen across the board in my testing.  Could you repeat your tests on another Intel box to see if you see the problem elsewhere?
     Unfortunately I'm going to be off until Wednesday of next week, so in the interim I will try to get somebody else to take a look at it.  If we haven't figured it out by the time I get back on Wednesday, I will pick it back up.

Thanks,
Chris Lalancette

Comment 11 Qian Cai 2010-07-03 01:55:17 UTC

>      While it is certainly possible that the patches for 596223, there are a
> lot of patches that went in between -37 and -42.  Could you try the intervening
> kernels (-38, -39, -40, and -41) to narrow down where the problem started
> happening?
-38 had the problem.

>      Also, while this is certainly a problem on the T400, this problem doesn't
> seem to happen across the board in my testing.  Could you repeat your tests on
> another Intel box to see if you see the problem elsewhere?
Correction - saw it on a X200 laptop not T400. I'll try the T400 next Monday.

Comment 12 Qian Cai 2010-07-03 17:08:25 UTC

Sorry Chris, it is not your patches' fault. I have narrowed it down by pulled out the following 2 patches from -38.el6 kernel, everything is working fine again.

- [virt] account only for IRQ injected into BSP (Gleb Natapov) [601564]
- [virt] KVM: read apic->irr with ioapic lock held (Marcelo Tosatti) [579970]

I am compiling another kernel to find out which one is at fault.

Comment 13 Qian Cai 2010-07-03 17:54:16 UTC

OK, it is working again after pulled this single one.

- [virt] account only for IRQ injected into BSP (Gleb Natapov) [601564]
virt-account-only-for-IRQ-injected-into-BSP.patch

Comment 14 Dor Laor 2010-07-04 08:15:13 UTC

So, should this bug be verified/closed?

Comment 15 Gleb Natapov 2010-07-04 08:23:18 UTC


*** This bug has been marked as a duplicate of bug 609082 ***