+++ This bug was initially created as a clone of Bug #499848 +++ Description of problem: While testing systems in RTHS. We noticed some specific hosts get the following message. testing NMI watchdog ... CPU#0: NMI appears to be stuck (0)! ibm-maple.rhts.bos.redhat.com hp-xw6400-01.rhts.bos.redhat.com Version-Release number of selected component (if applicable): 2.6.9-89.ELsmp How reproducible: Alays Steps to Reproduce: 1. Boot and install the latest version of RHEL4.8. on either host x86-64 Actual results: Brought up 4 CPUs time.c: Using PIT/TSC based timekeeping. testing NMI watchdog ... CPU#0: NMI appears to be stuck (0)! checking if image is initramfs... it is Expected results: NMI should be working. Additional info: We also see the NMI message with hp-xw6400-01.rhts.bos.redhat.com is RHEL5 --- Additional comment from prarit on 2009-05-12 10:49:30 EDT --- >Additional info: >We also see the NMI message with hp-xw6400-01.rhts.bos.redhat.com is RHEL5 Is there a RHEL5 counterpart to this BZ? P.
After some (lengthy) initial investigation it appears that these HP systems require MSR_ARCH_PERFMON_PERFCTR0, and not MSR_ARCH_PERFMON_PERFCTR1 to be used in the NMI code. The idea for this came from an investigation of the code prior to the new NMI code being put into the kernel. The old code booted correctly and used MSR_ARCH_PERFMON_PERFCTR0. Doing (sorry for the cut-and-paste): diff --git a/arch/x86_64/kernel/perfctr-watchdog.c b/arch/x86_64/kernel/perfctr- index f68e71c..e89b4da 100644 --- a/arch/x86_64/kernel/perfctr-watchdog.c +++ b/arch/x86_64/kernel/perfctr-watchdog.c @@ -625,8 +625,8 @@ static struct wd_ops intel_arch_wd_ops = { .setup = setup_intel_arch_watchdog, .rearm = p6_rearm, .stop = single_msr_stop_watchdog, - .perfctr = MSR_ARCH_PERFMON_PERFCTR1, - .evntsel = MSR_ARCH_PERFMON_EVENTSEL1, + .perfctr = MSR_ARCH_PERFMON_PERFCTR0, + .evntsel = MSR_ARCH_PERFMON_EVENTSEL0, }; resolves the problem, however, this patch doesn't explain *WHY* it resolves the problem. I have no idea what it fixed -- AFAICT it's magic :) tcamuso -- there are a couple of possibilities here. I have not booted an upstream kernel on this system (yet). It could be broken upstream for all I know. Of more interest, however, is this piece of code: static void probe_nmi_watchdog(void) { switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_AMD: if (boot_cpu_data.x86 != 6 && boot_cpu_data.x86 != 15 && boot_cpu_data.x86 != 16) return; wd_ops = &k7_wd_ops; break; case X86_VENDOR_INTEL: /* Work around Core Duo (Yonah) errata AE49 where perfctr1 doesn't have a working enable bit. */ if (boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 14) { intel_arch_wd_ops.perfctr = MSR_ARCH_PERFMON_PERFCTR0; intel_arch_wd_ops.evntsel = MSR_ARCH_PERFMON_EVENTSEL0; } if (cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) { wd_ops = &intel_arch_wd_ops; break; } switch (boot_cpu_data.x86) { case 6: if (boot_cpu_data.x86_model > 0xd) in which the Core Duo's are special cased. AFAICT, the xw6400 does not have the processors, from /proc/cpuinfo vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Genuine Intel(R) CPU @ 1.86GHz Is it possible that HP noted issues with the Core Duo and have done something in the BIOS for all HP xw series systems that make the use of MSR_ARCH_PERFMON_PERFCTR1 invalid? P.
This same problem was seen on a dell380-2.rhts.bos.redhat.com this morning. P.
I have added Jeff.Burrell, hp workstations, to the CC list. I can ask my BIOS contacts if they know anything about this, but apparently the manifestation of this same problem in a Dell gives the problem a new twist. Maybe intel can shed some light on this.
Back to the xw6400-01.rhts -- the upstream kernel fails to initialize the NMI when booted with "nmi_watchdog=2" (lapic nmi). P.
If I apply the patch in comment #1 upstream, the upstream kernel properly initializes the NMI: x7040600070406, new 0x7010600070106 CPU3: Genuine Intel(R) CPU @ 1.86GHz stepping 04 checking TSC synchronization [CPU#0 -> CPU#3]: passed. Brought up 4 CPUs Total of 4 processors activated (14895.44 BogoMIPS). native_smp_cpus_done: calling check_nmi_watchdog <<< my debug check_nmi_watchdog: called <<< my debug Testing NMI watchdog ... OK. net_namespace: 1552 bytes Booting paravirtualized kernel on bare hardware Time: 12:52:27 Date: 05/15/09 NET: Registered protocol family 16 No dock devices found. ACPI: bus type pci registered P.
http://download.intel.com/design/mobile/SPECUPDT/30922214.pdf The AE49 message in this errata appears in other erratas for Intel processors, which makes it likely that this is the culprit. I do not, however, have an easy way of searching all the errata documents for the register to see if a particular processor suffers from this problem. I've identified at least two other processors: http://download.intel.com/design/mobile/SPECUPDT/31651509.pdf and http://sunsite.rediris.es/pub/mirror/intel/intarch/SPECUPDT/31139202.pdf This leads me to believe that this is a HW issue rather than a software issue. P.
It looks like the erratas AE49 (Intel® CoreTM Duo processor and Intel® CoreTM Solo processor on 65nm process), AN49 (Intel® Pentium® dual-core processor), and AF48 (Dual-Core Intel® Xeon® processor LV) would affect the operation of oprofile. Spec update states that some versions Intel® Pentium® dual-core processor do not suffer from the errata.
(In reply to comment #7) > It looks like the erratas AE49 (Intel® CoreTM Duo processor and Intel® CoreTM > Solo processor on 65nm process), AN49 (Intel® Pentium® dual-core processor), > and AF48 (Dual-Core Intel® Xeon® processor LV) would affect the operation of > oprofile. Spec update states that some versions Intel® Pentium® dual-core > processor do not suffer from the errata. Yes -- jvillalo & I did a google search on IA32_CR_PerfEvtSel0 to find other processors that require the workaround. The issue is that the workaround implies that perfctr1 should *work* if the event select register for perfctr0 is reset -- but empirically it doesn't work... This patch chunk should resolve the problem, however, I'm worried about other systems failing because we do not have access to all the processor erratas. I also cannot find a Low Voltage Xeon to determine it's model # and family #. Sorry for the cut-and-paste: @@ -639,9 +642,14 @@ static void probe_nmi_watchdog(void) wd_ops = &k7_wd_ops; break; case X86_VENDOR_INTEL: - /* Work around Core Duo (Yonah) errata AE49 where perfctr1 - doesn't have a working enable bit. */ - if (boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 14) { + /* Work around for where perfctr1 doesn't have a working + * enable bit as described in the following errata: + * AE49 Core Duo and Intel Core Solo 65 nm + * AN49 Intel Pentium Dual-Core + * AF49 Dual-Core Intel Xeon Processor LV + */ + if ((boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 14) || + (boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model == 15)) { intel_arch_wd_ops.perfctr = MSR_ARCH_PERFMON_PERFCTR0; intel_arch_wd_ops.evntsel = MSR_ARCH_PERFMON_EVENTSEL0; }
Ah-ha! Found a Xeon LV :) /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 14 model name : Intel(R) Xeon(TM) CPU 000 @ 2.00GHz stepping : 8 cpu MHz : 2000.395 So the above patch covers this case as well. P.
I'm going to sit on this patch for a few days. An upstream fix is required and I'm waiting for Intel's feedback on the patch chunk in comment #8. IMO, it's LKML-worthy, but I would like to get a sign-off from Venki & Suresh before continuing. P.
Created attachment 345749 [details] RHEL5 fix for this issue
Created attachment 345750 [details] Upstream patch that fixes this issue
Created attachment 345898 [details] RHEL5 fix for this issue
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-152.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html