Bug 663912 - RHEL6 guest: Uhhuh. NMI received for unknown reason 00 on CPU 1.
Summary: RHEL6 guest: Uhhuh. NMI received for unknown reason 00 on CPU 1.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.5.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Gleb Natapov
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: Rhel5KvmTier2
TreeView+ depends on / blocked
 
Reported: 2010-12-17 10:27 UTC by Amos Kong
Modified: 2015-05-25 00:06 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Currently enabling of nmi watch dog within the guest does not work. Please do not run with nmi_watchdog=1
Clone Of:
Environment:
Last Closed: 2011-01-13 19:13:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
guest-serial-log before enabling nmi_watchdog (29.25 KB, text/plain)
2010-12-17 10:30 UTC, Amos Kong
no flags Details
guest serial log after enabling nmi_watchdog (10.84 KB, text/plain)
2010-12-17 10:32 UTC, Amos Kong
no flags Details

Description Amos Kong 2010-12-17 10:27:22 UTC
Description of problem:
Enable nmi_watchdog of rhel6.0 guest, guest hung during boot up.

This bug can only be reproduced in 5.5.z host (kvm-83-164.el5_5.30 + 2.6.18-194.30.1.el5), & intel machine.

This bug could not reproduce on rhel5.6 host. (kvm-83-223.el5 + 2.6.18-236.el5)

Version-Release number of selected component (if applicable):
host kernel: 2.6.18-194.30.1.el5
guest kernel: 2.6.32-71.7.1.el6.x86_64
# rpm -qa |grep kvm
kmod-kvm-83-164.el5_5.30
kvm-debuginfo-83-164.el5_5.30
kvm-83-164.el5_5.30
kvm-qemu-img-83-164.el5_5.30
etherboot-zroms-kvm-5.4.4-13.el5
etherboot-roms-kvm-5.4.4-13.el5
kvm-tools-83-164.el5_5.30

How reproducible:
always

Steps to Reproduce:
1. boot rhel6 guest
2. enable nmi_watchdog in guest kernel cmdline
# grubby --update-kernel=/boot/vmlinuz-2.6.32-71.7.1.el6.x86_64 --args='nmi_watchdog=1'
3. reboot guest
  
Actual results:
guest hung.

Expected results:
guest can boot up successfully after enabling nmi_watchdog.

Additional info:
1. command line
# qemu-kvm -name vm1 -monitor unix:/tmp/monitor-humanmonitor1-20101217-132447-1fFn,server,nowait -serial unix:/tmp/serial-20101217-132447-1fFn,server,nowait -drive file=/home/autotest/client/tests/kvm/images/RHEL-Server-6.0-64-virtio.qcow2,index=0,if=virtio,media=disk,cache=none,snapshot=on,boot=on,format=qcow2 -net nic,vlan=0,model=virtio,macaddr=9a:38:a9:db:66:79 -net tap,vlan=0,ifname=t0-132447-1fFn,script=/home/autotest/client/tests/kvm/scripts/qemu-ifup-switch,downscript=no -m 512 -smp 2,cores=1,threads=1,sockets=2 -cpu qemu64,+sse2 -soundhw ac97 -vnc :1 -rtc-td-hack -M rhel5.5.0 -usbdevice tablet -no-kvm-pit-reinjection

2. host cpuinfo:
processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
cpu MHz         : 2992.133
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips        : 5984.24
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Comment 1 Amos Kong 2010-12-17 10:30:40 UTC
Created attachment 469330 [details]
guest-serial-log before enabling nmi_watchdog

Comment 2 Amos Kong 2010-12-17 10:32:02 UTC
Created attachment 469332 [details]
guest serial log after enabling nmi_watchdog

2010-12-17 16:13:23: Setting APIC routing to physical flat
2010-12-17 16:13:23: ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
2010-12-17 16:13:23: CPU0: Intel QEMU Virtual CPU version 0.9.1 stepping 03
2010-12-17 16:13:23: APIC timer registered as dummy, due to nmi_watchdog=1!
2010-12-17 16:13:23: Booting Node   0, Processors  #1
2010-12-17 16:13:26: 
2010-12-17 16:13:26: mce: CPU supports 0 MCE banks
2010-12-17 16:13:26: kvm-clock: cpu 1, msr 0:1e36901, secondary cpu clock
2010-12-17 16:13:26: Brought up 2 CPUs
2010-12-17 16:13:26: Total of 2 processors activated (11968.52 BogoMIPS).
2010-12-17 16:13:26: Testing NMI watchdog ...
2010-12-17 16:13:26: WARNING: CPU#1: NMI appears to be stuck (1->1)!
2010-12-17 16:13:26: Please report this to bugzilla.kernel.org,
2010-12-17 16:13:26: and attach the output of the 'dmesg' command.
2010-12-17 16:13:26: OK.
2010-12-17 16:13:26: Uhhuh. NMI received for unknown reason 00 on CPU 1.
2010-12-17 16:13:26: Do you have a strange power saving mode enabled?
2010-12-17 16:13:26: Dazed and confused, but trying to continue
2010-12-17 16:13:27: Uhhuh. NMI received for unknown reason 00 on CPU 1.
2010-12-17 16:13:27: Do you have a strange power saving mode enabled?

Comment 4 Amos Kong 2010-12-17 10:58:45 UTC
This bug could not be reproduced with kvm-83-164.el5_5.29.

I've re-tested in amd machine with kvm-83-164.el5_5.30.
guest hung after enabling nmi_watchdog, the reproduce rate is very low, less than 1/10 (not sure)

2010-12-17 05:16:59: Performance Events: AMD PMU driver.
2010-12-17 05:16:59: ... version:                0
2010-12-17 05:16:59: ... bit width:              48
2010-12-17 05:16:59: ... generic registers:      4
2010-12-17 05:16:59: ... value mask:             0000ffffffffffff
2010-12-17 05:16:59: ... max period:             00007fffffffffff
2010-12-17 05:16:59: ... fixed-purpose events:   0
2010-12-17 05:16:59: ... event mask:             000000000000000f
2010-12-17 05:16:59: alternatives: switching to unfair spinlock
2010-12-17 05:16:59: SMP alternatives: switching to UP code
2010-12-17 05:17:00: ACPI: Core revision 20090903
2010-12-17 05:17:00: ftrace: converting mcount calls to 0f 1f 44 00 00
2010-12-17 05:17:00: ftrace: allocating 20282 entries in 80 pages
2010-12-17 05:17:00: Setting APIC routing to physical flat
2010-12-17 05:17:00: ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
2010-12-17 05:17:00: CPU0: AMD QEMU Virtual CPU version 0.9.1 stepping 03
(no other log produced........)

Comment 9 Gleb Natapov 2010-12-19 12:54:05 UTC
(In reply to comment #0)
> Description of problem:
> Enable nmi_watchdog of rhel6.0 guest, guest hung during boot up.
> 
> This bug can only be reproduced in 5.5.z host (kvm-83-164.el5_5.30 +
> 2.6.18-194.30.1.el5), & intel machine.
> 
> This bug could not reproduce on rhel5.6 host. (kvm-83-223.el5 + 2.6.18-236.el5)

The diff between kvm-83-164.el5_5.30 and kvm-83-223.el5 is minimal and does not include anything that may cause the difference in behaviour. Have you tested on exactly same hardware?

Comment 10 Gleb Natapov 2010-12-19 12:59:18 UTC
(In reply to comment #4)
> This bug could not be reproduced with kvm-83-164.el5_5.29.
> 
> I've re-tested in amd machine with kvm-83-164.el5_5.30.
> guest hung after enabling nmi_watchdog, the reproduce rate is very low, less
> than 1/10 (not sure)
If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi related. May be clock related. The only guest visible difference between 30 and 29 is:

Zachary Amsden (1):
      KVM: fix AMD initial TSC offset problems (additional fix)

Comment 11 Zachary Amsden 2010-12-20 01:21:02 UTC
(In reply to comment #10)
> (In reply to comment #4)
> > This bug could not be reproduced with kvm-83-164.el5_5.29.
> > 
> > I've re-tested in amd machine with kvm-83-164.el5_5.30.
> > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less
> > than 1/10 (not sure)
> If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi
> related. May be clock related. The only guest visible difference between 30 and
> 29 is:
> 
> Zachary Amsden (1):
>       KVM: fix AMD initial TSC offset problems (additional fix)

Well, this one's easy.

My fix touches only AMD specific code.  This bug is reported only on Intel hosts.

I'm guessing it will reproduce on 5_5.29, but since it's only 10% probability, it was missed.

Comment 12 Gleb Natapov 2010-12-20 05:09:30 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #4)
> > > This bug could not be reproduced with kvm-83-164.el5_5.29.
> > > 
> > > I've re-tested in amd machine with kvm-83-164.el5_5.30.
> > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less
> > > than 1/10 (not sure)
> > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi
> > related. May be clock related. The only guest visible difference between 30 and
> > 29 is:
> > 
> > Zachary Amsden (1):
> >       KVM: fix AMD initial TSC offset problems (additional fix)
> 
> Well, this one's easy.
> 
> My fix touches only AMD specific code.  This bug is reported only on Intel
> hosts.
Comment #4 is about AMD. But it looks like different issue than what reported in comment #0.

Comment 13 Amos Kong 2010-12-20 05:31:06 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #10)
> > > (In reply to comment #4)
> > > > This bug could not be reproduced with kvm-83-164.el5_5.29.
> > > > 
> > > > I've re-tested in amd machine with kvm-83-164.el5_5.30.
> > > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less
> > > > than 1/10 (not sure)
> > > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi
> > > related. May be clock related. The only guest visible difference between 30 and
> > > 29 is:
> > > 
> > > Zachary Amsden (1):
> > >       KVM: fix AMD initial TSC offset problems (additional fix)
> > 
> > Well, this one's easy.
> > 
> > My fix touches only AMD specific code.  This bug is reported only on Intel
> > hosts.
> Comment #4 is about AMD. But it looks like different issue than what reported
> in comment #0.

Hi gleb, 

Comment #4 looks like Bug 660561 - cpu stuck after reboot.

Comment 14 Amos Kong 2010-12-20 08:40:41 UTC
This bug was found on intel-8400, I reinstalled that machine, and tested more than 100 times with kvm-83-164.el5_5.30, bug was reproduced,  but reproduce rate is very low (3%).

I'm testing with kvm-83-164.el5_5.29 now.. (hasn't been reproduced until now)
I will report the final result tomorrow.

Comment 15 Amos Kong 2010-12-20 08:43:49 UTC
(In reply to comment #14)
> This bug was found on intel-8400, I reinstalled that machine, and tested more
> than 100 times with kvm-83-164.el5_5.30, bug was reproduced,  but reproduce
> rate is very low (3%).
> 
> I'm testing with kvm-83-164.el5_5.29 now.. (hasn't been reproduced until now)
> I will report the final result tomorrow.

Bug has been reproduced in kvm-83-164.el5_5.29 !!
so removing 'regression'

Comment 16 Gleb Natapov 2010-12-20 08:59:37 UTC
(In reply to comment #14)
> This bug was found on intel-8400, I reinstalled that machine, and tested more
> than 100 times with kvm-83-164.el5_5.30, bug was reproduced,  but reproduce
> rate is very low (3%).
At this point I do not know what bug are you talking about. The one from comment #4 or the one from comment #0. This is what happens when you report different issues in the same bugzilla.

Comment 17 Gleb Natapov 2010-12-20 09:04:34 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > (In reply to comment #10)
> > > > (In reply to comment #4)
> > > > > This bug could not be reproduced with kvm-83-164.el5_5.29.
> > > > > 
> > > > > I've re-tested in amd machine with kvm-83-164.el5_5.30.
> > > > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less
> > > > > than 1/10 (not sure)
> > > > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi
> > > > related. May be clock related. The only guest visible difference between 30 and
> > > > 29 is:
> > > > 
> > > > Zachary Amsden (1):
> > > >       KVM: fix AMD initial TSC offset problems (additional fix)
> > > 
> > > Well, this one's easy.
> > > 
> > > My fix touches only AMD specific code.  This bug is reported only on Intel
> > > hosts.
> > Comment #4 is about AMD. But it looks like different issue than what reported
> > in comment #0.
> 
> Hi gleb, 
> 
> Comment #4 looks like Bug 660561 - cpu stuck after reboot.

No, not at all. Comment #4 was with -smp 1 and the dmesg is different. There is no any "CPU1: Stuck" in comment #4 (how can it be if there is only one vcpu).

Comment 18 Gleb Natapov 2010-12-20 12:24:44 UTC
Okey. I tried to reproduce the problem from comment #0 and initially I failed to do so. Then I looked at the code that prints "WARNING: CPU#1: NMI appears to be stuck". It waits for nmi for 40 ticks and if nmi is not received during this period it marks nmi watchdog as disabled on a cpu. If nmi watchdog is active on at least one cpu, timer is not stopped and later, when cpu with disabled nmi watchdog receives nmi it become confused. So I added load on the host in hope that nmi will not be injected in time and with load 10 I am able to reproduce this problem pretty frequently.

The best and simplest thing we can do is to not enable nmi watchdog in a guest. Not for rhel5 and not for rhel6.

Comment 19 Amos Kong 2010-12-21 01:55:13 UTC
(In reply to comment #16)
> (In reply to comment #14)
> > This bug was found on intel-8400, I reinstalled that machine, and tested more
> > than 100 times with kvm-83-164.el5_5.30, bug was reproduced,  but reproduce
> > rate is very low (3%).
> At this point I do not know what bug are you talking about. The one from
> comment #4 or the one from comment #0. This is what happens when you report
> different issues in the same bugzilla.

ok, I'll test on AMD machine, if the problem is reproduced, I'll report another bug.

Comment 20 Dor Laor 2010-12-23 09:44:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Currently enabling of nmi watch dog within the guest does not work. Please do not run with nmi_watchdog=1

Comment 21 Joy Pu 2010-12-29 08:49:23 UTC
Try this test in 2.6.18-237.el5 with kvm-83-224:
rpm -qa |grep kvm
kvm-tools-83-224.el5
etherboot-zroms-kvm-5.4.4-13.el5
etherboot-roms-kvm-5.4.4-13.el5
kmod-kvm-83-224.el5
kmod-kvm-debug-83-224.el5
kvm-83-224.el5
kvm-qemu-img-83-224.el5
kvm-debuginfo-83-224.el5

Find something similar with this one, the guest can boot up successful with nmi_watchdog=1 but there will be a LOCKUP triggered by get guest's NMI counter
 with command:
# grep NMI /proc/interrupts
Hope there some useful informations.
here is the call tree:
BUG: NMI Watchdog detected LOCKUP on CPU1, ip ffffffff81013b4e, registers:
CPU 1
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: freq_table]

Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.32-71.7.1.el6.x86_64 #1 KVM
RIP: 0010:[<ffffffff81013b4e>]  [<ffffffff81013b4e>] irq_return+0x0/0x12
RSP: 0018:ffff880028227fd8  EFLAGS: 00000086
RAX: 0000000000000000 RBX: ffff88013b7fffd8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81a101e8
RBP: ffff88013b7ffed8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff818a1b60
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007fa53ddf4700(0000) GS:ffff880028220000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fe72c2bd098 CR3: 000000013719a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88013b7fe000, task ffff88013b7d80c0)
Stack:
 ffffffff81013b4e 0000000000000010 0000000000000086 ffff880028227fd8
<0> 0000000000000018
Call Trace:
 <NMI>
 [<ffffffff81013b4e>] ? irq_return+0x0/0x12
 <<EOE>>
Code: 8b 54 24 08 4c 8b 4c 24 10 4c 8b 44 24 18 48 8b 44 24 20 48 8b 4c 24 28 48 8b 54 24 30 48 8b 74 24 38 48 8b 7c 24 40 48 83 c4 50 <48> cf 66 66 66 90 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 cf 0f

Comment 23 Eduardo Habkost 2011-01-13 19:13:23 UTC
Release notes set, closing as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.