Bug 494120 - XEN NMI detection fails on Dell 1950 server
Summary: XEN NMI detection fails on Dell 1950 server
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.3
Hardware: i686
OS: Linux
low
urgent
Target Milestone: rc
: ---
Assignee: Miroslav Rezanina
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 526775
TreeView+ depends on / blocked
 
Reported: 2009-04-04 17:17 UTC by kerdosa
Modified: 2010-04-08 16:21 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:45:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xm dmesg when successful (5.31 KB, application/octet-stream)
2009-04-30 22:47 UTC, kerdosa
no flags Details
xm dmesg when failed case (5.23 KB, application/octet-stream)
2009-04-30 22:47 UTC, kerdosa
no flags Details
xm dmesg when failing on SuperMicro X7DBi+ (6.42 KB, text/plain)
2009-07-01 14:01 UTC, Tamas Vincze
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description kerdosa 2009-04-04 17:17:45 UTC
Description of problem: check_nmi_watchdog() of xen/arch/x86/nmi.c fails to test NMI on X86_64 kernel-xen


Version-Release number of selected component (if applicable):
RHEL 5.2

How reproducible:
Enable NMI watchdog in XEN on X86_*4 kernel and see XEN boot messgage.

Steps to Reproduce:
1. Added watchdog=1 in XEN kernel of grub.conf
2. Reboot
3. xm dmesg | grep NMI
  
Actual results:
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck.

Expected results:
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Additional info: X86_32 kernel worked OK on the same server hardware, but X86_64 kernel fails. So I think it's a bug on X86_64.

Comment 1 kerdosa 2009-04-05 21:25:44 UTC
Some of description above is wrong. The check_nmi_watchdog() fails in i386 and X86_64. Only XEN-3.3.1 from xensource is working OK. This is a very critical function to debug system deadlock. Since NMI is not working, our options are very limited to debug deadlock.

Comment 3 Chris Lalancette 2009-04-27 11:51:56 UTC
Can you try to add "watchdog=1 apic_verbosity=debug" to the hypervisor command-line, and give the full output of xm dmesg after you've booted?  It might give us a clue as to where the APIC NMI programming is going wrong, since the code in check_nmi_watchdog() is exactly the same in RHEL and in upstream Xen 3.3.

Thanks,
Chris Lalancette

Comment 4 kerdosa 2009-04-30 22:47:18 UTC
Created attachment 342018 [details]
xm dmesg when successful

Comment 5 kerdosa 2009-04-30 22:47:55 UTC
Created attachment 342019 [details]
xm dmesg when failed case

Comment 6 kerdosa 2009-04-30 22:51:26 UTC
Hi,

I attached xm dmesg for both success and failed cases. The check_nmi_watchdog() is same, but many apic (or acpi) source code are different between two source trees.

Thanks

Comment 8 Chris Lalancette 2009-05-04 07:48:37 UTC
(In reply to comment #6)
> Hi,
> 
> I attached xm dmesg for both success and failed cases. The check_nmi_watchdog()
> is same, but many apic (or acpi) source code are different between two source
> trees.

That's actually not true either, I looked through that code and the apic code between upstream Xen and RHEL-5 Xen is more-or-less the same too.  So something else is going on, I'll have to look at logs to see if it tells us anything.

Chris Lalancette

Comment 10 Tamas Vincze 2009-07-01 14:01:58 UTC
Created attachment 350118 [details]
xm dmesg when failing on SuperMicro X7DBi+

Same problem here running Xen version 3.1.2-128.1.14.el5 on a SuperMicro X7DBi+ board.
The server started rebooting in a random fashion, that's why I'm experimenting with the watchdog option.

Comment 11 Chris Lalancette 2009-08-25 09:57:45 UTC
I've uploaded a test kernel that should have a fix for this problem here:

http://people.redhat.com/clalance/virttest/

Can the reporters who are having problems please download and try out this test kernel?

Thanks,
Chris Lalancette

Comment 12 Don Zickus 2009-10-13 16:08:23 UTC
in kernel-2.6.18-169.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 16 Jan Tluka 2010-03-04 16:57:27 UTC
I've reproduced on -164.el5 and verified on -190.el5xen, saw these results,

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-164.el5xen i686
(XEN) Command line: com2=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck.

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-190.el5xen i686
(XEN) Command line: com2=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Also I have checked x86_64,

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-164.el5xen x86_64
(XEN) Command line: com1=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck. 

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-190.el5xen x86_64
(XEN) Command line: watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Comment 18 errata-xmlrpc 2010-03-30 07:45:00 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.