494120 – XEN NMI detection fails on Dell 1950 server

Bug 494120 - XEN NMI detection fails on Dell 1950 server

Summary: XEN NMI detection fails on Dell 1950 server

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.3
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Rezanina
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	526775
TreeView+	depends on / blocked

Reported:	2009-04-04 17:17 UTC by kerdosa
Modified:	2010-04-08 16:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 07:45:00 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
xm dmesg when successful (5.31 KB, application/octet-stream) 2009-04-30 22:47 UTC, kerdosa	no flags	Details
xm dmesg when failed case (5.23 KB, application/octet-stream) 2009-04-30 22:47 UTC, kerdosa	no flags	Details
xm dmesg when failing on SuperMicro X7DBi+ (6.42 KB, text/plain) 2009-07-01 14:01 UTC, Tamas Vincze	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description kerdosa 2009-04-04 17:17:45 UTC

Description of problem: check_nmi_watchdog() of xen/arch/x86/nmi.c fails to test NMI on X86_64 kernel-xen


Version-Release number of selected component (if applicable):
RHEL 5.2

How reproducible:
Enable NMI watchdog in XEN on X86_*4 kernel and see XEN boot messgage.

Steps to Reproduce:
1. Added watchdog=1 in XEN kernel of grub.conf
2. Reboot
3. xm dmesg | grep NMI
  
Actual results:
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck.

Expected results:
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Additional info: X86_32 kernel worked OK on the same server hardware, but X86_64 kernel fails. So I think it's a bug on X86_64.

Comment 1 kerdosa 2009-04-05 21:25:44 UTC

Some of description above is wrong. The check_nmi_watchdog() fails in i386 and X86_64. Only XEN-3.3.1 from xensource is working OK. This is a very critical function to debug system deadlock. Since NMI is not working, our options are very limited to debug deadlock.

Comment 3 Chris Lalancette 2009-04-27 11:51:56 UTC

Can you try to add "watchdog=1 apic_verbosity=debug" to the hypervisor command-line, and give the full output of xm dmesg after you've booted?  It might give us a clue as to where the APIC NMI programming is going wrong, since the code in check_nmi_watchdog() is exactly the same in RHEL and in upstream Xen 3.3.

Thanks,
Chris Lalancette

Comment 4 kerdosa 2009-04-30 22:47:18 UTC

Created attachment 342018 [details]
xm dmesg when successful

Comment 5 kerdosa 2009-04-30 22:47:55 UTC

Created attachment 342019 [details]
xm dmesg when failed case

Comment 6 kerdosa 2009-04-30 22:51:26 UTC

Hi,

I attached xm dmesg for both success and failed cases. The check_nmi_watchdog() is same, but many apic (or acpi) source code are different between two source trees.

Thanks

Comment 8 Chris Lalancette 2009-05-04 07:48:37 UTC

(In reply to comment #6)
> Hi,
> 
> I attached xm dmesg for both success and failed cases. The check_nmi_watchdog()
> is same, but many apic (or acpi) source code are different between two source
> trees.

That's actually not true either, I looked through that code and the apic code between upstream Xen and RHEL-5 Xen is more-or-less the same too.  So something else is going on, I'll have to look at logs to see if it tells us anything.

Chris Lalancette

Comment 10 Tamas Vincze 2009-07-01 14:01:58 UTC

Created attachment 350118 [details]
xm dmesg when failing on SuperMicro X7DBi+

Same problem here running Xen version 3.1.2-128.1.14.el5 on a SuperMicro X7DBi+ board.
The server started rebooting in a random fashion, that's why I'm experimenting with the watchdog option.

Comment 11 Chris Lalancette 2009-08-25 09:57:45 UTC

I've uploaded a test kernel that should have a fix for this problem here:

http://people.redhat.com/clalance/virttest/

Can the reporters who are having problems please download and try out this test kernel?

Thanks,
Chris Lalancette

Comment 12 Don Zickus 2009-10-13 16:08:23 UTC

in kernel-2.6.18-169.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 16 Jan Tluka 2010-03-04 16:57:27 UTC

I've reproduced on -164.el5 and verified on -190.el5xen, saw these results,

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-164.el5xen i686
(XEN) Command line: com2=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck.

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-190.el5xen i686
(XEN) Command line: com2=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Also I have checked x86_64,

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-164.el5xen x86_64
(XEN) Command line: com1=115200n8 watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck. 

[root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog
2.6.18-190.el5xen x86_64
(XEN) Command line: watchdog=1
(XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.

Comment 18 errata-xmlrpc 2010-03-30 07:45:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.