Bug 592446 - [RHEL6] qemu-kvm BUG: NMI Watchdog detected LOCKUP on CPU6
[RHEL6] qemu-kvm BUG: NMI Watchdog detected LOCKUP on CPU6
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Don Zickus
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-14 17:10 EDT by Jeff Burke
Modified: 2013-01-09 17:34 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-06 11:24:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jeff Burke 2010-05-14 17:10:27 EDT
Description of problem:
 While running tests with the latest R6 kernel we received a BUG: NMI Watchdog detected LOCKUP on CPU6

Version-Release number of selected component (if applicable):
 2.6.32-26.el6.x86_64

How reproducible:
 Unknown

Actual results:

BUG: NMI Watchdog detected LOCKUP on CPU6, ip 7fff6f5ff850, registers:
CPU 6 
Modules linked in: tun(U) nls_utf8(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) ipt_MASQUERADE(U) iptable_nat(U) nf_nat(U) autofs4(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) bridge(U) stp(U) llc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) kvm_intel(U) kvm(U) ibmpex(U) ibmaem(U) ipmi_msghandler(U) bnx2(U) i5k_amb(U) hwmon(U) ics932s401(U) serio_raw(U) iTCO_wdt(U) iTCO_vendor_support(U) i5000_edac(U) ioatdma(U) edac_core(U) shpchp(U) sg(U) ses(U) sr_mod(U) cdrom(U) enclosure(U) i2c_i801(U) e1000e(U) ixgbe(U) dca(U) mdio(U) ext4(U) mbcache(U) jbd2(U) dm_multipath(U) sd_mod(U) crc_t10dif(U) ata_generic(U) pata_acpi(U) lpfc(U) scsi_transport_fc(U) ata_piix(U) aacraid(U) scsi_tgt(U) radeon(U) ttm(U) drm_kms_helper(U) drm(U) i2c_algo_bit(U) i2c_core(U) dm_mod(U) [last unloaded: scsi_wait_scan]
Pid: 9572, comm: qemu-kvm Not tainted 2.6.32-26.el6.x86_64 #1 IBM System x3650 -[7979AC1]-
RIP: 0033:[<00007fff6f5ff850>]  [<00007fff6f5ff850>] 0x7fff6f5ff850
RSP: 002b:00007fff6f563b00  EFLAGS: 00000212
RAX: 27ae055a179d8618 RBX: 00007fff6f563b50 RCX: 0000000000000000
RDX: 00000000a116940f RSI: 000000004bedbfd8 RDI: 00007fff6f563b50
RBP: 00007fff6f563b10 R08: 00007fff6f563a70 R09: 0000000000002564
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000a15fe8
R13: 0000000000d27c88 R14: 0000000000000001 R15: 0000000000000001
FS:  00007fd16d7c3740(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa88f149000 CR3: 0000000852d61000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-kvm (pid: 9572, threadinfo ffff88085d404000, task ffff880821324b30)

---[ end trace 64c3b6d72cb86555 ]---
Kernel panic - not syncing: Non maskable interrupt
Pid: 9572, comm: qemu-kvm Tainted: G      D    2.6.32-26.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff814c7fb5>] panic+0x78/0x137
 [<ffffffff81066f73>] ? print_oops_end_marker+0x23/0x30
 [<ffffffff814cc13c>] die_nmi+0xfc/0x100
 [<ffffffff814cc6ea>] nmi_watchdog_tick+0x1aa/0x200
 [<ffffffff814cbc73>] do_nmi+0x1a3/0x2d0
 [<ffffffff814cb550>] nmi+0x20/0x30
 <<EOE>> 
[drm:drm_fb_helper_panic] *ERROR* panic occurred, switching back to text console

Additional info:
 While the test was running, I noticed the guests were not making any progress. I ssh'd into the system to look at a few things. I launched virt-manager then lost connection. I looked at the serial console for the host and saw the above.
Comment 2 Don Zickus 2010-05-19 10:06:36 EDT
Just to add my notes to this bug after I tried debugging it a little bit

The RIP instruction looked a little strange for an NMI watchdog, but Dave A. pointed out that this fits within the VDSO of the guest.

Now normally when I see an NMI lockup message from a userspace app, I just assume the nmi watchdog is broken.  But because it is hung in the VDSO, I wouldn't be surprised if kvm played some tricks here to speed up guest/HV communication.  The only thing though is can a guest disable interrupts from userspace? or does it have to pass that message to the HV and have the HV do it?

Cheers,
Don
Comment 3 RHEL Product and Program Management 2010-05-28 06:55:41 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 4 Marcelo Tosatti 2010-06-04 17:20:58 EDT
Don,

(In reply to comment #2)
> Just to add my notes to this bug after I tried debugging it a little bit
> 
> The RIP instruction looked a little strange for an NMI watchdog, but Dave A.
> pointed out that this fits within the VDSO of the guest.

RIP is from the host.

> Now normally when I see an NMI lockup message from a userspace app, I just
> assume the nmi watchdog is broken.  But because it is hung in the VDSO, I
> wouldn't be surprised if kvm played some tricks here to speed up guest/HV
> communication.  

No. The RIP is from qemu-kvm process.

> The only thing though is can a guest disable interrupts from
> userspace? or does it have to pass that message to the HV and have the HV do
> it?

It can't. The guest interruptibility state is separate from the host and only affects interrupt/nmi injection to the guest.

NMI's are never blocked and always cause a vmexit immediately so the host can handle it.

Reassigning to you as this report seems to indicate NMI watchdog from userspace.
Comment 5 Marcelo Tosatti 2010-07-06 11:24:29 EDT
I can't see anything wrong with KVM. Will consider this a spurious NMI warning.

Please reopen bug if this is reproducible.

Note You need to log in before you can comment on or make changes to this bug.