Bug 688547 - RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason
RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.1
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Don Zickus
Red Hat Kernel QE team
: Regression
: 683097 688711 689658 689885 692973 693053 697414 (view as bug list)
Depends On:
Blocks: 1300182 692677 694811 1020769
  Show dependency treegraph
 
Reported: 2011-03-17 06:55 EDT by Petr Beňas
Modified: 2016-01-20 03:09 EST (History)
18 users (show)

See Also:
Fixed In Version: kernel-2.6.32-131.0.5.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 692677 1020769 (view as bug list)
Environment:
Last Closed: 2011-05-19 08:42:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Petr Beňas 2011-03-17 06:55:38 EDT
Description of problem:
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

Version-Release number of selected component (if applicable):
2.6.32-122.el6.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. Install RHEL6.1-20110316.1 on dell-pe2800-01.rhts.eng.bos.redhat.com
2. log in

Actual results:
Messages from syslogd like this one in console and in dmesg
Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Uhhuh. NMI received for unknown reason 21 on CPU 0.

Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Do you have a strange power saving mode enabled?

Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Dazed and confused, but trying to continue
  
Expected results:
no such messages

Additional info:
reason 00, 21 and 31 on CPU 0, 1 and 2
Comment 2 Don Zickus 2011-03-17 17:23:55 EDT
Yeah, we accepted a patch into -119 to fix p4 machines from swallowing all the nmis in the perf layer.  All we did is expose how broken the perf nmi handler is on a p4 machine.  Sucks.

I poked at this code before and it is convoluted.  Looks like I will have to poke at it again to finally fix it (or I will just revert the patch that exposed this problem and fix it properly in 6.2).

Cheers,
Don
Comment 3 Don Zickus 2011-03-17 17:26:02 EDT
*** Bug 683097 has been marked as a duplicate of this bug. ***
Comment 4 Don Zickus 2011-03-18 11:11:58 EDT
*** Bug 688711 has been marked as a duplicate of this bug. ***
Comment 5 Don Zickus 2011-03-22 17:22:43 EDT
*** Bug 689885 has been marked as a duplicate of this bug. ***
Comment 6 Nate Straz 2011-03-29 11:20:20 EDT
Adding Regression flag since this is a new message on affected systems and it causes a lot of log messages on such systems.
Comment 10 RHEL Product and Program Management 2011-03-31 16:39:47 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 13 Leam 2011-04-06 12:30:38 EDT
Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta downloaded earlier this week.
Comment 14 Don Zickus 2011-04-06 17:08:04 EDT
(In reply to comment #13)
> Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta
> downloaded earlier this week.

The patches are not pulled in yet.

Cheers,
Don
Comment 15 Don Zickus 2011-04-08 09:11:15 EDT
*** Bug 693053 has been marked as a duplicate of this bug. ***
Comment 16 Don Zickus 2011-04-08 13:29:06 EDT
*** Bug 692973 has been marked as a duplicate of this bug. ***
Comment 17 IBM Bug Proxy 2011-04-08 15:41:47 EDT
------- Comment From masbock@us.ibm.com 2011-04-08 15:39 EDT-------
This bug focuses on the Uhhuh problem with family 15 processors.
In the original post I also reported a system where the NMI watchdog is not enabled at all. This is an AMD based system (LS42 blade). This appears to be a separate problem. Will track that one separately from here on.
Comment 18 Don Zickus 2011-04-08 16:18:16 EDT
(In reply to comment #17)
> ------- Comment From masbock@us.ibm.com 2011-04-08 15:39 EDT-------
> This bug focuses on the Uhhuh problem with family 15 processors.
> In the original post I also reported a system where the NMI watchdog is not
> enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> separate problem. Will track that one separately from here on.

That one might be related to bz689065.  You will be able to tell in the dmesg output if the system is AMD and has "Broken BIOS" in the dmesg log.

Otherwise you need to attach the dmesg log for me to analyze.

Cheers,
Don
Comment 19 IBM Bug Proxy 2011-04-08 16:50:32 EDT
------- Comment From masbock@us.ibm.com 2011-04-08 16:46 EDT-------
(In reply to comment #20)
> > In the original post I also reported a system where the NMI watchdog is not
> > enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> > separate problem. Will track that one separately from here on.
>
> That one might be related to bz689065.  You will be able to tell in the dmesg
> output if the system is AMD and has "Broken BIOS" in the dmesg log.
>
On the LS42 we get:
Performance Events: Broken BIOS detected, using software events only.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is 430076)
NMI watchdog disabled for cpu0: unable to create perf event: -2

(I don't have access to BZ689065)

- Max
Comment 20 Don Zickus 2011-04-08 17:08:33 EDT
(In reply to comment #19)
> ------- Comment From masbock@us.ibm.com 2011-04-08 16:46 EDT-------
> (In reply to comment #20)
> > > In the original post I also reported a system where the NMI watchdog is not
> > > enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> > > separate problem. Will track that one separately from here on.
> >
> > That one might be related to bz689065.  You will be able to tell in the dmesg
> > output if the system is AMD and has "Broken BIOS" in the dmesg log.
> >
> On the LS42 we get:
> Performance Events: Broken BIOS detected, using software events only.
> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is
> 430076)
> NMI watchdog disabled for cpu0: unable to create perf event: -2
> 
> (I don't have access to BZ689065)
> 
> - Max

The new nmi watchdog detects if someone is currently using the perf counter to avoid resource contention with the BIOS.  Unfortunately, AMD boxes used it for tsc calculations and forgot to disable it.  As a result the perf subsystem thinks the BIOS is using it and prevents the nmi watchdog from being enabled.

The fix was to remove the obsoleted check on AMD boxes.  It is already included in one of the snap builds.

Cheers,
Don
Comment 21 Leam 2011-04-08 21:14:28 EDT
Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU errors 21 and 31. Is this the same thing or something different? Do you want any output for it? Nothing until Monday, though...I'm enjoying my weekend.  :)

I've currently masked the problem by passing nmi_watchdog=0 on the kernel line. This is a dev box so we can play with it if necessary.

Leam
Comment 22 Don Zickus 2011-04-11 10:01:02 EDT
(In reply to comment #21)
> Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU
> errors 21 and 31. Is this the same thing or something different? Do you want
> any output for it? Nothing until Monday, though...I'm enjoying my weekend.  :)
> 
> I've currently masked the problem by passing nmi_watchdog=0 on the kernel line.
> This is a dev box so we can play with it if necessary.
> 
> Leam


Hi Leam,

If you look at the output of 'cat /proc/cpuinfo', the 'cpu family' should be 15.  If not then you might have a different issue.

Cheers,
Don
Comment 23 Leam 2011-04-11 11:25:29 EDT
Thanks Don!

Same issue, based on:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 8
cpu MHz         : 3200.000


Sounds like you already have the issue resolved, though your fix mentions AMD boxes. Same issue for "GenuineNotAMD"?  

Leam
Comment 24 Don Zickus 2011-04-11 11:33:50 EDT
Hi Leam,

Sorry for the confusion.  This bz is for Intel family 15 chips, another bz 689065 deals with the AMD problem.  I was just trying to help out another reporter as I dup'd their issue over here and they asked about the AMD problem.

Cheers,
Don
Comment 25 Leam 2011-04-11 12:59:04 EDT
Don,

Is there anything I can provide to help with the Intel family 15 trouble-shooting?

Leam
Comment 26 Don Zickus 2011-04-11 13:11:31 EDT
Hi Leam,

No, we are alright. We have machines that reproduce the problem.  Find the strange interactions with the hardware PMU is the tricky part.  The fix I posted just swallows all the NMIs for now, until we can find a proper fix in 6.2.

Thanks for the offer though.

Cheers,
Don
Comment 27 Don Zickus 2011-04-13 13:29:36 EDT
*** Bug 689658 has been marked as a duplicate of this bug. ***
Comment 30 IBM Bug Proxy 2011-04-18 13:11:07 EDT
------- Comment From shubgoya@in.ibm.com 2011-04-18 13:08 EDT-------
I was able to reproduce this issue with snap3 kernel on x3850.
Comment 31 Don Zickus 2011-04-18 13:29:02 EDT
*** Bug 697414 has been marked as a duplicate of this bug. ***
Comment 35 IBM Bug Proxy 2011-04-19 16:32:05 EDT
------- Comment From tpnoonan@us.ibm.com 2011-04-19 16:29 EDT-------
Hi Red Hat. Once fixed in rhel6.2, please consider for rhel6.1.z. Thanks
Comment 36 Aristeu Rozanski 2011-04-20 18:09:08 EDT
Patch(es) available on kernel-2.6.32-131.0.5.el6
Comment 39 Nate Straz 2011-04-26 10:00:55 EDT
I ran kernel-2.6.32-131.0.5.el6.x86_64 through a normal load and have not seen any of the NMI messages I was seeing before.  Verified.
Comment 40 IBM Bug Proxy 2011-04-29 08:30:24 EDT
------- Comment From shubgoya@in.ibm.com 2011-04-29 08:29 EDT-------
I am verifying this issue in snap5 release. Will post my results ASAP.
Comment 41 IBM Bug Proxy 2011-05-05 15:30:43 EDT
------- Comment From shubgoya@in.ibm.com 2011-05-05 15:25 EDT-------
I verified this issue on one of affected platform (x3850) with snap5 kernel and did not see those 'Dazed and Confused' NMI messages under load. Looks like kernel 2.6.32-131.0.5.el6 solves the issue.
Comment 42 errata-xmlrpc 2011-05-19 08:42:57 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html
Comment 43 IBM Bug Proxy 2011-06-01 17:01:34 EDT
------- Comment From tpnoonan@us.ibm.com 2011-06-01 16:52 EDT-------
ibm is no longer asking for rhel6.1.z, a fix for RHBZ692677 in rhel6.2 is okay
Comment 44 Max Novaha 2012-02-03 03:25:06 EST
Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it?
Comment 45 Max Novaha 2012-02-03 03:41:44 EST
Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it?
Comment 46 Don Zickus 2012-02-03 10:29:57 EST
(In reply to comment #45)
> Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor
> L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got
> error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not
> request iomem region............ How can i fix it?

Hi Max,

You can start by opening a new bugzilla and attaching a more complete dmesg log so we can have a better idea of what is going on. :-)

The reason is this bugzilla is closed and developers like myself will not look at it any more.

Thanks,
Don
Comment 47 Mike Neuliep 2012-02-16 13:10:02 EST
Hello all.  I am able to reproduce this bug on start-up 100% of the time on the IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15 processors.
This happens during boot up and I get this message before the system locks up:

Uhhuh. NMI received for unknown reason 35 on CPU 0.
Do you have a strange power saving mode enabled?

This occurs in all the 2.6.32-220.X series of kernels.  Kind of annoying as I am stuck using the 2.6.32-131 kernel for the time being, which works just fine.  The bios on these servers, from what I can tell, does not incorporate any power saving features.
Comment 48 Don Zickus 2012-02-16 13:18:40 EST
(In reply to comment #47)
> Hello all.  I am able to reproduce this bug on start-up 100% of the time on the
> IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15
> processors.
> This happens during boot up and I get this message before the system locks up:
> 
> Uhhuh. NMI received for unknown reason 35 on CPU 0.
> Do you have a strange power saving mode enabled?
> 
> This occurs in all the 2.6.32-220.X series of kernels.  Kind of annoying as I
> am stuck using the 2.6.32-131 kernel for the time being, which works just fine.
>  The bios on these servers, from what I can tell, does not incorporate any
> power saving features.

Hi Mike,

You will need to open a new bz and if possible attach a console log (or dmesg output if you can login).  cc myself on the bz.

Also add nmi_watchdog=0 on the commandline to see if it disappears.

Cheers,
Don

Note You need to log in before you can comment on or make changes to this bug.