Bug 663912
Summary: | RHEL6 guest: Uhhuh. NMI received for unknown reason 00 on CPU 1. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Amos Kong <akong> | ||||||
Component: | kvm | Assignee: | Gleb Natapov <gleb> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.5.z | CC: | ailan, ehabkost, knoel, llim, mkenneth, tburke, virt-maint, ypu, zamsden | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Currently enabling of nmi watch dog within the guest does not work. Please do not run with nmi_watchdog=1
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-01-13 19:13:23 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 580948 | ||||||||
Attachments: |
|
Description
Amos Kong
2010-12-17 10:27:22 UTC
Created attachment 469330 [details]
guest-serial-log before enabling nmi_watchdog
Created attachment 469332 [details]
guest serial log after enabling nmi_watchdog
2010-12-17 16:13:23: Setting APIC routing to physical flat
2010-12-17 16:13:23: ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
2010-12-17 16:13:23: CPU0: Intel QEMU Virtual CPU version 0.9.1 stepping 03
2010-12-17 16:13:23: APIC timer registered as dummy, due to nmi_watchdog=1!
2010-12-17 16:13:23: Booting Node 0, Processors #1
2010-12-17 16:13:26:
2010-12-17 16:13:26: mce: CPU supports 0 MCE banks
2010-12-17 16:13:26: kvm-clock: cpu 1, msr 0:1e36901, secondary cpu clock
2010-12-17 16:13:26: Brought up 2 CPUs
2010-12-17 16:13:26: Total of 2 processors activated (11968.52 BogoMIPS).
2010-12-17 16:13:26: Testing NMI watchdog ...
2010-12-17 16:13:26: WARNING: CPU#1: NMI appears to be stuck (1->1)!
2010-12-17 16:13:26: Please report this to bugzilla.kernel.org,
2010-12-17 16:13:26: and attach the output of the 'dmesg' command.
2010-12-17 16:13:26: OK.
2010-12-17 16:13:26: Uhhuh. NMI received for unknown reason 00 on CPU 1.
2010-12-17 16:13:26: Do you have a strange power saving mode enabled?
2010-12-17 16:13:26: Dazed and confused, but trying to continue
2010-12-17 16:13:27: Uhhuh. NMI received for unknown reason 00 on CPU 1.
2010-12-17 16:13:27: Do you have a strange power saving mode enabled?
This bug could not be reproduced with kvm-83-164.el5_5.29. I've re-tested in amd machine with kvm-83-164.el5_5.30. guest hung after enabling nmi_watchdog, the reproduce rate is very low, less than 1/10 (not sure) 2010-12-17 05:16:59: Performance Events: AMD PMU driver. 2010-12-17 05:16:59: ... version: 0 2010-12-17 05:16:59: ... bit width: 48 2010-12-17 05:16:59: ... generic registers: 4 2010-12-17 05:16:59: ... value mask: 0000ffffffffffff 2010-12-17 05:16:59: ... max period: 00007fffffffffff 2010-12-17 05:16:59: ... fixed-purpose events: 0 2010-12-17 05:16:59: ... event mask: 000000000000000f 2010-12-17 05:16:59: alternatives: switching to unfair spinlock 2010-12-17 05:16:59: SMP alternatives: switching to UP code 2010-12-17 05:17:00: ACPI: Core revision 20090903 2010-12-17 05:17:00: ftrace: converting mcount calls to 0f 1f 44 00 00 2010-12-17 05:17:00: ftrace: allocating 20282 entries in 80 pages 2010-12-17 05:17:00: Setting APIC routing to physical flat 2010-12-17 05:17:00: ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1 2010-12-17 05:17:00: CPU0: AMD QEMU Virtual CPU version 0.9.1 stepping 03 (no other log produced........) (In reply to comment #0) > Description of problem: > Enable nmi_watchdog of rhel6.0 guest, guest hung during boot up. > > This bug can only be reproduced in 5.5.z host (kvm-83-164.el5_5.30 + > 2.6.18-194.30.1.el5), & intel machine. > > This bug could not reproduce on rhel5.6 host. (kvm-83-223.el5 + 2.6.18-236.el5) The diff between kvm-83-164.el5_5.30 and kvm-83-223.el5 is minimal and does not include anything that may cause the difference in behaviour. Have you tested on exactly same hardware? (In reply to comment #4) > This bug could not be reproduced with kvm-83-164.el5_5.29. > > I've re-tested in amd machine with kvm-83-164.el5_5.30. > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less > than 1/10 (not sure) If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi related. May be clock related. The only guest visible difference between 30 and 29 is: Zachary Amsden (1): KVM: fix AMD initial TSC offset problems (additional fix) (In reply to comment #10) > (In reply to comment #4) > > This bug could not be reproduced with kvm-83-164.el5_5.29. > > > > I've re-tested in amd machine with kvm-83-164.el5_5.30. > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less > > than 1/10 (not sure) > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi > related. May be clock related. The only guest visible difference between 30 and > 29 is: > > Zachary Amsden (1): > KVM: fix AMD initial TSC offset problems (additional fix) Well, this one's easy. My fix touches only AMD specific code. This bug is reported only on Intel hosts. I'm guessing it will reproduce on 5_5.29, but since it's only 10% probability, it was missed. (In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #4) > > > This bug could not be reproduced with kvm-83-164.el5_5.29. > > > > > > I've re-tested in amd machine with kvm-83-164.el5_5.30. > > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less > > > than 1/10 (not sure) > > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi > > related. May be clock related. The only guest visible difference between 30 and > > 29 is: > > > > Zachary Amsden (1): > > KVM: fix AMD initial TSC offset problems (additional fix) > > Well, this one's easy. > > My fix touches only AMD specific code. This bug is reported only on Intel > hosts. Comment #4 is about AMD. But it looks like different issue than what reported in comment #0. (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #10) > > > (In reply to comment #4) > > > > This bug could not be reproduced with kvm-83-164.el5_5.29. > > > > > > > > I've re-tested in amd machine with kvm-83-164.el5_5.30. > > > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less > > > > than 1/10 (not sure) > > > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi > > > related. May be clock related. The only guest visible difference between 30 and > > > 29 is: > > > > > > Zachary Amsden (1): > > > KVM: fix AMD initial TSC offset problems (additional fix) > > > > Well, this one's easy. > > > > My fix touches only AMD specific code. This bug is reported only on Intel > > hosts. > Comment #4 is about AMD. But it looks like different issue than what reported > in comment #0. Hi gleb, Comment #4 looks like Bug 660561 - cpu stuck after reboot. This bug was found on intel-8400, I reinstalled that machine, and tested more than 100 times with kvm-83-164.el5_5.30, bug was reproduced, but reproduce rate is very low (3%). I'm testing with kvm-83-164.el5_5.29 now.. (hasn't been reproduced until now) I will report the final result tomorrow. (In reply to comment #14) > This bug was found on intel-8400, I reinstalled that machine, and tested more > than 100 times with kvm-83-164.el5_5.30, bug was reproduced, but reproduce > rate is very low (3%). > > I'm testing with kvm-83-164.el5_5.29 now.. (hasn't been reproduced until now) > I will report the final result tomorrow. Bug has been reproduced in kvm-83-164.el5_5.29 !! so removing 'regression' (In reply to comment #14) > This bug was found on intel-8400, I reinstalled that machine, and tested more > than 100 times with kvm-83-164.el5_5.30, bug was reproduced, but reproduce > rate is very low (3%). At this point I do not know what bug are you talking about. The one from comment #4 or the one from comment #0. This is what happens when you report different issues in the same bugzilla. (In reply to comment #13) > (In reply to comment #12) > > (In reply to comment #11) > > > (In reply to comment #10) > > > > (In reply to comment #4) > > > > > This bug could not be reproduced with kvm-83-164.el5_5.29. > > > > > > > > > > I've re-tested in amd machine with kvm-83-164.el5_5.30. > > > > > guest hung after enabling nmi_watchdog, the reproduce rate is very low, less > > > > > than 1/10 (not sure) > > > > If you are absolutely sure you can't reproduce in 5.29 then it is unlikely nmi > > > > related. May be clock related. The only guest visible difference between 30 and > > > > 29 is: > > > > > > > > Zachary Amsden (1): > > > > KVM: fix AMD initial TSC offset problems (additional fix) > > > > > > Well, this one's easy. > > > > > > My fix touches only AMD specific code. This bug is reported only on Intel > > > hosts. > > Comment #4 is about AMD. But it looks like different issue than what reported > > in comment #0. > > Hi gleb, > > Comment #4 looks like Bug 660561 - cpu stuck after reboot. No, not at all. Comment #4 was with -smp 1 and the dmesg is different. There is no any "CPU1: Stuck" in comment #4 (how can it be if there is only one vcpu). Okey. I tried to reproduce the problem from comment #0 and initially I failed to do so. Then I looked at the code that prints "WARNING: CPU#1: NMI appears to be stuck". It waits for nmi for 40 ticks and if nmi is not received during this period it marks nmi watchdog as disabled on a cpu. If nmi watchdog is active on at least one cpu, timer is not stopped and later, when cpu with disabled nmi watchdog receives nmi it become confused. So I added load on the host in hope that nmi will not be injected in time and with load 10 I am able to reproduce this problem pretty frequently. The best and simplest thing we can do is to not enable nmi watchdog in a guest. Not for rhel5 and not for rhel6. (In reply to comment #16) > (In reply to comment #14) > > This bug was found on intel-8400, I reinstalled that machine, and tested more > > than 100 times with kvm-83-164.el5_5.30, bug was reproduced, but reproduce > > rate is very low (3%). > At this point I do not know what bug are you talking about. The one from > comment #4 or the one from comment #0. This is what happens when you report > different issues in the same bugzilla. ok, I'll test on AMD machine, if the problem is reproduced, I'll report another bug. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Currently enabling of nmi watch dog within the guest does not work. Please do not run with nmi_watchdog=1 Try this test in 2.6.18-237.el5 with kvm-83-224: rpm -qa |grep kvm kvm-tools-83-224.el5 etherboot-zroms-kvm-5.4.4-13.el5 etherboot-roms-kvm-5.4.4-13.el5 kmod-kvm-83-224.el5 kmod-kvm-debug-83-224.el5 kvm-83-224.el5 kvm-qemu-img-83-224.el5 kvm-debuginfo-83-224.el5 Find something similar with this one, the guest can boot up successful with nmi_watchdog=1 but there will be a LOCKUP triggered by get guest's NMI counter with command: # grep NMI /proc/interrupts Hope there some useful informations. here is the call tree: BUG: NMI Watchdog detected LOCKUP on CPU1, ip ffffffff81013b4e, registers: CPU 1 Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: freq_table] Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log ppdev parport_pc parport virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: freq_table] Pid: 0, comm: swapper Not tainted 2.6.32-71.7.1.el6.x86_64 #1 KVM RIP: 0010:[<ffffffff81013b4e>] [<ffffffff81013b4e>] irq_return+0x0/0x12 RSP: 0018:ffff880028227fd8 EFLAGS: 00000086 RAX: 0000000000000000 RBX: ffff88013b7fffd8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81a101e8 RBP: ffff88013b7ffed8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff818a1b60 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fa53ddf4700(0000) GS:ffff880028220000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fe72c2bd098 CR3: 000000013719a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff88013b7fe000, task ffff88013b7d80c0) Stack: ffffffff81013b4e 0000000000000010 0000000000000086 ffff880028227fd8 <0> 0000000000000018 Call Trace: <NMI> [<ffffffff81013b4e>] ? irq_return+0x0/0x12 <<EOE>> Code: 8b 54 24 08 4c 8b 4c 24 10 4c 8b 44 24 18 48 8b 44 24 20 48 8b 4c 24 28 48 8b 54 24 30 48 8b 74 24 38 48 8b 7c 24 40 48 83 c4 50 <48> cf 66 66 66 90 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 cf 0f Release notes set, closing as WONTFIX. |