Bug 688547
Summary: | RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Petr Beňas <pbenas> | |
Component: | kernel | Assignee: | Don Zickus <dzickus> | |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.1 | CC: | balkov, bmarson, bugproxy, charlotte.richardson, cye, david.bulkow, dbayly, eddie.williams, jparadis, kevin.paetzold, leamhall, mike, mxnovo, nstraz, phan, pstehlik, robert.evans, syeghiay | |
Target Milestone: | rc | Keywords: | Regression | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.32-131.0.5.el6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 692677 1020769 (view as bug list) | Environment: | ||
Last Closed: | 2011-05-19 12:42:57 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 692677, 694811, 1020769, 1300182 |
Description
Petr Beňas
2011-03-17 10:55:38 UTC
Yeah, we accepted a patch into -119 to fix p4 machines from swallowing all the nmis in the perf layer. All we did is expose how broken the perf nmi handler is on a p4 machine. Sucks. I poked at this code before and it is convoluted. Looks like I will have to poke at it again to finally fix it (or I will just revert the patch that exposed this problem and fix it properly in 6.2). Cheers, Don *** Bug 683097 has been marked as a duplicate of this bug. *** *** Bug 688711 has been marked as a duplicate of this bug. *** *** Bug 689885 has been marked as a duplicate of this bug. *** Adding Regression flag since this is a new message on affected systems and it causes a lot of log messages on such systems. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta downloaded earlier this week. (In reply to comment #13) > Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta > downloaded earlier this week. The patches are not pulled in yet. Cheers, Don *** Bug 693053 has been marked as a duplicate of this bug. *** *** Bug 692973 has been marked as a duplicate of this bug. *** ------- Comment From masbock.com 2011-04-08 15:39 EDT------- This bug focuses on the Uhhuh problem with family 15 processors. In the original post I also reported a system where the NMI watchdog is not enabled at all. This is an AMD based system (LS42 blade). This appears to be a separate problem. Will track that one separately from here on. (In reply to comment #17) > ------- Comment From masbock.com 2011-04-08 15:39 EDT------- > This bug focuses on the Uhhuh problem with family 15 processors. > In the original post I also reported a system where the NMI watchdog is not > enabled at all. This is an AMD based system (LS42 blade). This appears to be a > separate problem. Will track that one separately from here on. That one might be related to bz689065. You will be able to tell in the dmesg output if the system is AMD and has "Broken BIOS" in the dmesg log. Otherwise you need to attach the dmesg log for me to analyze. Cheers, Don ------- Comment From masbock.com 2011-04-08 16:46 EDT------- (In reply to comment #20) > > In the original post I also reported a system where the NMI watchdog is not > > enabled at all. This is an AMD based system (LS42 blade). This appears to be a > > separate problem. Will track that one separately from here on. > > That one might be related to bz689065. You will be able to tell in the dmesg > output if the system is AMD and has "Broken BIOS" in the dmesg log. > On the LS42 we get: Performance Events: Broken BIOS detected, using software events only. [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is 430076) NMI watchdog disabled for cpu0: unable to create perf event: -2 (I don't have access to BZ689065) - Max (In reply to comment #19) > ------- Comment From masbock.com 2011-04-08 16:46 EDT------- > (In reply to comment #20) > > > In the original post I also reported a system where the NMI watchdog is not > > > enabled at all. This is an AMD based system (LS42 blade). This appears to be a > > > separate problem. Will track that one separately from here on. > > > > That one might be related to bz689065. You will be able to tell in the dmesg > > output if the system is AMD and has "Broken BIOS" in the dmesg log. > > > On the LS42 we get: > Performance Events: Broken BIOS detected, using software events only. > [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is > 430076) > NMI watchdog disabled for cpu0: unable to create perf event: -2 > > (I don't have access to BZ689065) > > - Max The new nmi watchdog detects if someone is currently using the perf counter to avoid resource contention with the BIOS. Unfortunately, AMD boxes used it for tsc calculations and forgot to disable it. As a result the perf subsystem thinks the BIOS is using it and prevents the nmi watchdog from being enabled. The fix was to remove the obsoleted check on AMD boxes. It is already included in one of the snap builds. Cheers, Don Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU errors 21 and 31. Is this the same thing or something different? Do you want any output for it? Nothing until Monday, though...I'm enjoying my weekend. :) I've currently masked the problem by passing nmi_watchdog=0 on the kernel line. This is a dev box so we can play with it if necessary. Leam (In reply to comment #21) > Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU > errors 21 and 31. Is this the same thing or something different? Do you want > any output for it? Nothing until Monday, though...I'm enjoying my weekend. :) > > I've currently masked the problem by passing nmi_watchdog=0 on the kernel line. > This is a dev box so we can play with it if necessary. > > Leam Hi Leam, If you look at the output of 'cat /proc/cpuinfo', the 'cpu family' should be 15. If not then you might have a different issue. Cheers, Don Thanks Don! Same issue, based on: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 8 cpu MHz : 3200.000 Sounds like you already have the issue resolved, though your fix mentions AMD boxes. Same issue for "GenuineNotAMD"? Leam Hi Leam, Sorry for the confusion. This bz is for Intel family 15 chips, another bz 689065 deals with the AMD problem. I was just trying to help out another reporter as I dup'd their issue over here and they asked about the AMD problem. Cheers, Don Don, Is there anything I can provide to help with the Intel family 15 trouble-shooting? Leam Hi Leam, No, we are alright. We have machines that reproduce the problem. Find the strange interactions with the hardware PMU is the tricky part. The fix I posted just swallows all the NMIs for now, until we can find a proper fix in 6.2. Thanks for the offer though. Cheers, Don *** Bug 689658 has been marked as a duplicate of this bug. *** ------- Comment From shubgoya.com 2011-04-18 13:08 EDT------- I was able to reproduce this issue with snap3 kernel on x3850. *** Bug 697414 has been marked as a duplicate of this bug. *** ------- Comment From tpnoonan.com 2011-04-19 16:29 EDT------- Hi Red Hat. Once fixed in rhel6.2, please consider for rhel6.1.z. Thanks Patch(es) available on kernel-2.6.32-131.0.5.el6 I ran kernel-2.6.32-131.0.5.el6.x86_64 through a normal load and have not seen any of the NMI messages I was seeing before. Verified. ------- Comment From shubgoya.com 2011-04-29 08:29 EDT------- I am verifying this issue in snap5 release. Will post my results ASAP. ------- Comment From shubgoya.com 2011-05-05 15:25 EDT------- I verified this issue on one of affected platform (x3850) with snap5 kernel and did not see those 'Dazed and Confused' NMI messages under load. Looks like kernel 2.6.32-131.0.5.el6 solves the issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html ------- Comment From tpnoonan.com 2011-06-01 16:52 EDT------- ibm is no longer asking for rhel6.1.z, a fix for RHBZ692677 in rhel6.2 is okay Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it? Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it? (In reply to comment #45) > Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor > L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got > error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not > request iomem region............ How can i fix it? Hi Max, You can start by opening a new bugzilla and attaching a more complete dmesg log so we can have a better idea of what is going on. :-) The reason is this bugzilla is closed and developers like myself will not look at it any more. Thanks, Don Hello all. I am able to reproduce this bug on start-up 100% of the time on the IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15 processors. This happens during boot up and I get this message before the system locks up: Uhhuh. NMI received for unknown reason 35 on CPU 0. Do you have a strange power saving mode enabled? This occurs in all the 2.6.32-220.X series of kernels. Kind of annoying as I am stuck using the 2.6.32-131 kernel for the time being, which works just fine. The bios on these servers, from what I can tell, does not incorporate any power saving features. (In reply to comment #47) > Hello all. I am able to reproduce this bug on start-up 100% of the time on the > IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15 > processors. > This happens during boot up and I get this message before the system locks up: > > Uhhuh. NMI received for unknown reason 35 on CPU 0. > Do you have a strange power saving mode enabled? > > This occurs in all the 2.6.32-220.X series of kernels. Kind of annoying as I > am stuck using the 2.6.32-131 kernel for the time being, which works just fine. > The bios on these servers, from what I can tell, does not incorporate any > power saving features. Hi Mike, You will need to open a new bz and if possible attach a console log (or dmesg output if you can login). cc myself on the bz. Also add nmi_watchdog=0 on the commandline to see if it disappears. Cheers, Don |