Bug 500845

Summary: [RHEL5-U4] Kernel - testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Burke <jburke>
Component: kernel-xenAssignee: Don Zickus <dzickus>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: arozansk, bpeck, clalance, dmair, dzickus, gozen, jburke, llim, lwang, mgahagan, mjenner, pbunyan, phan, prarit, qcai, tcamuso, xen-maint
Target Milestone: rc   
Target Release: 5.5   
Hardware: All   
OS: Linux   
URL: http://rhts.redhat.com/testlogs/58610/195983/1633506/boot.messages
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When booting a fully virtualized Xen guest, the following message may be displayed on the guest console: testing NMI watchdog ... <4> WARNING: CPU#0: NMI appears to be stuck (0->0)! This issue is caused by an implementation issue with the Xen hypervisor and can be safely ignored. (BZ#500845)
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-08 22:33:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 513501, 514491    
Attachments:
Description Flags
Program to detect when running in an HVM guest none

Description Jeff Burke 2009-05-14 14:00:36 UTC
Description of problem:
 While booting an HVM guest with the bare-metal kernel we are getting a NMI stuck message.

Version-Release number of selected component (if applicable):
2.6.18-147.el5

How reproducible:
Always

Steps to Reproduce:
1. Install the RRHEL5.4-Server-20090412.nightly tree on sun-x4600m2-01.rhts.bos.redhat.com
2. Install the kernel-xen create a HMV guest.
3. 
  
Actual results:
Quad-Core AMD Opteron(tm) Processor 8356 stepping 03
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff -6 cycles, maxerr 2052 cycles)
Brought up 2 CPUs
testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
time.c: Using 71.685220 MHz WALL HPET GTOD HPET timer.
time.c: Detected 2293.901 MHz processor.

Expected results:
We should not get a NMI stuck message on boot.

Additional info:
This is also seen with hp-xw6800-01.rhts.lab.bos.redhat.com

Comment 1 Chris Lalancette 2009-05-21 14:30:32 UTC
Yeah, this is unfortunately expected.  If I remember properly from my last foray into looking at this, we don't properly emulate the MSR writes to the performance counters.  What happens is that the Linux kernel does writes to the MSR performance counters, and then expects an interrupt later on when the performance counters drop to 0.  However, the hypervisor more-or-less just drops the writes to the performance counter MSR's on the ground, so a later interrupt is never generated, and then you get the "NMI appears to be stuck" message.
     I don't know what the current upstream status of this is, since I haven't looked in a while.

Chris Lalancette

Comment 2 Gurhan Ozen 2009-05-29 18:41:03 UTC
Just as a side note, this problem isn't confined to kernel-xen on AMDs, they are happening all over the place for x86_64 hvm guests . I see them on Intel boxen and on x86_64 kvm guests as well,

Comment 3 Don Zickus 2009-05-29 19:50:38 UTC
(In reply to comment #2)
> Just as a side note, this problem isn't confined to kernel-xen on AMDs, they
> are happening all over the place for x86_64 hvm guests . I see them on Intel
> boxen and on x86_64 kvm guests as well,  

Well the Intel ones are related to bz 500892, which is basically defective chips.  The AMD ones maybe defective too, just need to find the errata sheets on it.

Comment 5 Don Zickus 2009-06-01 14:27:21 UTC
Gurhan is it possible to test a 5.3 distro on the AMD shanghai machines.  If the problem is there then this isn't new and we may have to figure out how to deal with this.  Otherwise it is a regression and will need to be fixed.

Comment 6 Gurhan Ozen 2009-06-01 14:47:50 UTC
Ok, I submitted jobs to amd-shanghai-0[12] for a rhel5.3 tree . Will let you know of the results.

Comment 7 Gurhan Ozen 2009-06-01 20:47:07 UTC
The same thing does happen on 5.3 tree on shanghai box too.. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8356616

Comment 15 Chris Lalancette 2009-06-02 13:46:05 UTC
Created attachment 346251 [details]
Program to detect when running in an HVM guest

Comment 25 Chris Lalancette 2009-06-30 12:23:11 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
When booting a fully virtualized Xen or KVM guest, the message "testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!" may be displayed on the guest console.  This is due to an implementation issue with the Xen and KVM hypervisors, and can be safely ignored.  This implementation issue may be addressed in a future RHEL-5 release.

Comment 29 Ryan Lerch 2009-08-18 03:23:27 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,6 @@
-When booting a fully virtualized Xen or KVM guest, the message "testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!" may be displayed on the guest console.  This is due to an implementation issue with the Xen and KVM hypervisors, and can be safely ignored.  This implementation issue may be addressed in a future RHEL-5 release.+When booting a fully virtualized Xen guest, the following message may be displayed on the guest console:
+
+testing NMI watchdog ... <4>
+WARNING: CPU#0: NMI appears to be stuck (0->0)!
+
+This issue is caused by an implementation issue with the Xen hypervisor and can be safely ignored. (BZ#500845)

Comment 34 Don Zickus 2010-04-21 14:52:44 UTC
Status update:

After talking with Chris L., implementing perfctr msr emulation in xen and kvm probably will never happen for RHEL-5 as it is to difficult to do.  Implementing a check this early in boot to determine if we are on a virtualized guest is difficult to do too.

Current recommendation is to workaround it in the scripts and close this as WONT_FIX.

Opinions?

Cheers,
Don

Comment 40 Don Zickus 2010-05-10 17:07:45 UTC
Considering I posted a patch for it, might as well own the bug

Comment 41 RHEL Program Management 2010-06-30 19:50:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 42 Jarod Wilson 2010-11-08 22:33:29 UTC

*** This bug has been marked as a duplicate of bug 455323 ***

Comment 43 Bill Peck 2011-01-25 16:22:10 UTC
This bug also shows up when running on HyperV Guests.  Gurhan's hvmdetect.c doesn't work for these virt machines.

Comment 44 Don Zickus 2011-01-25 16:58:46 UTC
I guess I am confused.  Isn't HyperV Guests as Microsoft guest?  How does a linux kernel message ending up on a Microsoft guest?

Cheers,
Don

Comment 45 Bill Peck 2011-01-25 18:07:38 UTC
To be clear...

This is running Regular RHEL distro under HyperV.  Similar to how we would run under VMWARE.

Comment 46 Don Zickus 2011-01-25 19:25:50 UTC
Ok, yes, there is no code to check whether or not a RHEL guest is running on vmware or hyperV.  I don't even know how to check for that.  We were able to check cpu strings fro Xen and KVM.

I might have to switch checks if this is going to be an issue and instead output a message stating that the nmi watchdog is disabled because the perf counters are not available.

Cheers,
Don

Comment 47 Chris Lalancette 2011-01-25 20:50:45 UTC
Don,
    I do not have direct access to VMware or Hyper-V hypervisors, but if google is to be believed, we should be able to check for both of those using similar mechanisms to Xen and KVM.
    In particular, hypervisors commonly put an easily identifiable string in CPUID leaf 0x40000000, and bare-metal machines leave this blank.  Therefore, you should be able to call cpuid, get the output, and check for:

VMware - "VMwareVMware"
Hyper-V - "Microsoft HV"
Xen - "XenVMMXenVMM"
KVM - "KVMKVMKVM"

(the latter two are already implemented, as you said).  All of that being said, we already have a perfectly legitimate test for whether the perf counters are working, and that is the test that causes this message to be printed.  The other option here is just to turn that "NMI appears to be stuck" message into a KERN_DEBUG statement, so it is not so obvious.  I'll leave it up to you which way you want to go.

Chris Lalancette

Comment 48 Chris Lalancette 2011-01-25 20:51:57 UTC
Oh, I forgot to mention: for gory details of detecting the various hypervisors, have a look at virt-what: http://people.redhat.com/~rjones/virt-what/

Chris Lalancette