Description of problem: check_nmi_watchdog() of xen/arch/x86/nmi.c fails to test NMI on X86_64 kernel-xen Version-Release number of selected component (if applicable): RHEL 5.2 How reproducible: Enable NMI watchdog in XEN on X86_*4 kernel and see XEN boot messgage. Steps to Reproduce: 1. Added watchdog=1 in XEN kernel of grub.conf 2. Reboot 3. xm dmesg | grep NMI Actual results: (XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck. Expected results: (XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay. Additional info: X86_32 kernel worked OK on the same server hardware, but X86_64 kernel fails. So I think it's a bug on X86_64.
Some of description above is wrong. The check_nmi_watchdog() fails in i386 and X86_64. Only XEN-3.3.1 from xensource is working OK. This is a very critical function to debug system deadlock. Since NMI is not working, our options are very limited to debug deadlock.
Can you try to add "watchdog=1 apic_verbosity=debug" to the hypervisor command-line, and give the full output of xm dmesg after you've booted? It might give us a clue as to where the APIC NMI programming is going wrong, since the code in check_nmi_watchdog() is exactly the same in RHEL and in upstream Xen 3.3. Thanks, Chris Lalancette
Created attachment 342018 [details] xm dmesg when successful
Created attachment 342019 [details] xm dmesg when failed case
Hi, I attached xm dmesg for both success and failed cases. The check_nmi_watchdog() is same, but many apic (or acpi) source code are different between two source trees. Thanks
(In reply to comment #6) > Hi, > > I attached xm dmesg for both success and failed cases. The check_nmi_watchdog() > is same, but many apic (or acpi) source code are different between two source > trees. That's actually not true either, I looked through that code and the apic code between upstream Xen and RHEL-5 Xen is more-or-less the same too. So something else is going on, I'll have to look at logs to see if it tells us anything. Chris Lalancette
Created attachment 350118 [details] xm dmesg when failing on SuperMicro X7DBi+ Same problem here running Xen version 3.1.2-128.1.14.el5 on a SuperMicro X7DBi+ board. The server started rebooting in a random fashion, that's why I'm experimenting with the watchdog option.
I've uploaded a test kernel that should have a fix for this problem here: http://people.redhat.com/clalance/virttest/ Can the reporters who are having problems please download and try out this test kernel? Thanks, Chris Lalancette
in kernel-2.6.18-169.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I've reproduced on -164.el5 and verified on -190.el5xen, saw these results, [root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog 2.6.18-164.el5xen i686 (XEN) Command line: com2=115200n8 watchdog=1 (XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck. [root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog 2.6.18-190.el5xen i686 (XEN) Command line: com2=115200n8 watchdog=1 (XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay. Also I have checked x86_64, [root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog 2.6.18-164.el5xen x86_64 (XEN) Command line: com1=115200n8 watchdog=1 (XEN) Testing NMI watchdog --- CPU#0 stuck. CPU#1 stuck. CPU#2 stuck. CPU#3 stuck. [root@dell-pe1950-06 ~]# uname -rm; xm dmesg | grep -i watchdog 2.6.18-190.el5xen x86_64 (XEN) Command line: watchdog=1 (XEN) Testing NMI watchdog --- CPU#0 okay. CPU#1 okay. CPU#2 okay. CPU#3 okay.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html