Bug 688547

Summary: RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason
Product: Red Hat Enterprise Linux 6 Reporter: Petr Beňas <pbenas>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: balkov, bmarson, bugproxy, charlotte.richardson, cye, david.bulkow, dbayly, eddie.williams, jparadis, kevin.paetzold, leamhall, mike, mxnovo, nstraz, phan, pstehlik, robert.evans, syeghiay
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-131.0.5.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 692677 1020769 (view as bug list) Environment:
Last Closed: 2011-05-19 12:42:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 692677, 694811, 1020769, 1300182    

Description Petr Beňas 2011-03-17 10:55:38 UTC
Description of problem:
Uhhuh. NMI received for unknown reason 31 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

Version-Release number of selected component (if applicable):
2.6.32-122.el6.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. Install RHEL6.1-20110316.1 on dell-pe2800-01.rhts.eng.bos.redhat.com
2. log in

Actual results:
Messages from syslogd like this one in console and in dmesg
Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Uhhuh. NMI received for unknown reason 21 on CPU 0.

Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Do you have a strange power saving mode enabled?

Message from syslogd@dell-pe2800-01 at Mar 17 06:51:58 ...
 kernel:Dazed and confused, but trying to continue
  
Expected results:
no such messages

Additional info:
reason 00, 21 and 31 on CPU 0, 1 and 2

Comment 2 Don Zickus 2011-03-17 21:23:55 UTC
Yeah, we accepted a patch into -119 to fix p4 machines from swallowing all the nmis in the perf layer.  All we did is expose how broken the perf nmi handler is on a p4 machine.  Sucks.

I poked at this code before and it is convoluted.  Looks like I will have to poke at it again to finally fix it (or I will just revert the patch that exposed this problem and fix it properly in 6.2).

Cheers,
Don

Comment 3 Don Zickus 2011-03-17 21:26:02 UTC
*** Bug 683097 has been marked as a duplicate of this bug. ***

Comment 4 Don Zickus 2011-03-18 15:11:58 UTC
*** Bug 688711 has been marked as a duplicate of this bug. ***

Comment 5 Don Zickus 2011-03-22 21:22:43 UTC
*** Bug 689885 has been marked as a duplicate of this bug. ***

Comment 6 Nate Straz 2011-03-29 15:20:20 UTC
Adding Regression flag since this is a new message on affected systems and it causes a lot of log messages on such systems.

Comment 10 RHEL Program Management 2011-03-31 20:39:47 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 13 Leam 2011-04-06 16:30:38 UTC
Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta downloaded earlier this week.

Comment 14 Don Zickus 2011-04-06 21:08:04 UTC
(In reply to comment #13)
> Got both Unknown Reason 21 and 31 on a Dell PE 6850 with 6.1 x86-64 Beta
> downloaded earlier this week.

The patches are not pulled in yet.

Cheers,
Don

Comment 15 Don Zickus 2011-04-08 13:11:15 UTC
*** Bug 693053 has been marked as a duplicate of this bug. ***

Comment 16 Don Zickus 2011-04-08 17:29:06 UTC
*** Bug 692973 has been marked as a duplicate of this bug. ***

Comment 17 IBM Bug Proxy 2011-04-08 19:41:47 UTC
------- Comment From masbock.com 2011-04-08 15:39 EDT-------
This bug focuses on the Uhhuh problem with family 15 processors.
In the original post I also reported a system where the NMI watchdog is not enabled at all. This is an AMD based system (LS42 blade). This appears to be a separate problem. Will track that one separately from here on.

Comment 18 Don Zickus 2011-04-08 20:18:16 UTC
(In reply to comment #17)
> ------- Comment From masbock.com 2011-04-08 15:39 EDT-------
> This bug focuses on the Uhhuh problem with family 15 processors.
> In the original post I also reported a system where the NMI watchdog is not
> enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> separate problem. Will track that one separately from here on.

That one might be related to bz689065.  You will be able to tell in the dmesg output if the system is AMD and has "Broken BIOS" in the dmesg log.

Otherwise you need to attach the dmesg log for me to analyze.

Cheers,
Don

Comment 19 IBM Bug Proxy 2011-04-08 20:50:32 UTC
------- Comment From masbock.com 2011-04-08 16:46 EDT-------
(In reply to comment #20)
> > In the original post I also reported a system where the NMI watchdog is not
> > enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> > separate problem. Will track that one separately from here on.
>
> That one might be related to bz689065.  You will be able to tell in the dmesg
> output if the system is AMD and has "Broken BIOS" in the dmesg log.
>
On the LS42 we get:
Performance Events: Broken BIOS detected, using software events only.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is 430076)
NMI watchdog disabled for cpu0: unable to create perf event: -2

(I don't have access to BZ689065)

- Max

Comment 20 Don Zickus 2011-04-08 21:08:33 UTC
(In reply to comment #19)
> ------- Comment From masbock.com 2011-04-08 16:46 EDT-------
> (In reply to comment #20)
> > > In the original post I also reported a system where the NMI watchdog is not
> > > enabled at all. This is an AMD based system (LS42 blade). This appears to be a
> > > separate problem. Will track that one separately from here on.
> >
> > That one might be related to bz689065.  You will be able to tell in the dmesg
> > output if the system is AMD and has "Broken BIOS" in the dmesg log.
> >
> On the LS42 we get:
> Performance Events: Broken BIOS detected, using software events only.
> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010000 is
> 430076)
> NMI watchdog disabled for cpu0: unable to create perf event: -2
> 
> (I don't have access to BZ689065)
> 
> - Max

The new nmi watchdog detects if someone is currently using the perf counter to avoid resource contention with the BIOS.  Unfortunately, AMD boxes used it for tsc calculations and forgot to disable it.  As a result the perf subsystem thinks the BIOS is using it and prevents the nmi watchdog from being enabled.

The fix was to remove the obsoleted check on AMD boxes.  It is already included in one of the snap builds.

Cheers,
Don

Comment 21 Leam 2011-04-09 01:14:28 UTC
Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU errors 21 and 31. Is this the same thing or something different? Do you want any output for it? Nothing until Monday, though...I'm enjoying my weekend.  :)

I've currently masked the problem by passing nmi_watchdog=0 on the kernel line. This is a dev box so we can play with it if necessary.

Leam

Comment 22 Don Zickus 2011-04-11 14:01:02 UTC
(In reply to comment #21)
> Don, my box is a Dell 6850 with 64-bit old Intel chips. It's reporting CPU
> errors 21 and 31. Is this the same thing or something different? Do you want
> any output for it? Nothing until Monday, though...I'm enjoying my weekend.  :)
> 
> I've currently masked the problem by passing nmi_watchdog=0 on the kernel line.
> This is a dev box so we can play with it if necessary.
> 
> Leam


Hi Leam,

If you look at the output of 'cat /proc/cpuinfo', the 'cpu family' should be 15.  If not then you might have a different issue.

Cheers,
Don

Comment 23 Leam 2011-04-11 15:25:29 UTC
Thanks Don!

Same issue, based on:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 8
cpu MHz         : 3200.000


Sounds like you already have the issue resolved, though your fix mentions AMD boxes. Same issue for "GenuineNotAMD"?  

Leam

Comment 24 Don Zickus 2011-04-11 15:33:50 UTC
Hi Leam,

Sorry for the confusion.  This bz is for Intel family 15 chips, another bz 689065 deals with the AMD problem.  I was just trying to help out another reporter as I dup'd their issue over here and they asked about the AMD problem.

Cheers,
Don

Comment 25 Leam 2011-04-11 16:59:04 UTC
Don,

Is there anything I can provide to help with the Intel family 15 trouble-shooting?

Leam

Comment 26 Don Zickus 2011-04-11 17:11:31 UTC
Hi Leam,

No, we are alright. We have machines that reproduce the problem.  Find the strange interactions with the hardware PMU is the tricky part.  The fix I posted just swallows all the NMIs for now, until we can find a proper fix in 6.2.

Thanks for the offer though.

Cheers,
Don

Comment 27 Don Zickus 2011-04-13 17:29:36 UTC
*** Bug 689658 has been marked as a duplicate of this bug. ***

Comment 30 IBM Bug Proxy 2011-04-18 17:11:07 UTC
------- Comment From shubgoya.com 2011-04-18 13:08 EDT-------
I was able to reproduce this issue with snap3 kernel on x3850.

Comment 31 Don Zickus 2011-04-18 17:29:02 UTC
*** Bug 697414 has been marked as a duplicate of this bug. ***

Comment 35 IBM Bug Proxy 2011-04-19 20:32:05 UTC
------- Comment From tpnoonan.com 2011-04-19 16:29 EDT-------
Hi Red Hat. Once fixed in rhel6.2, please consider for rhel6.1.z. Thanks

Comment 36 Aristeu Rozanski 2011-04-20 22:09:08 UTC
Patch(es) available on kernel-2.6.32-131.0.5.el6

Comment 39 Nate Straz 2011-04-26 14:00:55 UTC
I ran kernel-2.6.32-131.0.5.el6.x86_64 through a normal load and have not seen any of the NMI messages I was seeing before.  Verified.

Comment 40 IBM Bug Proxy 2011-04-29 12:30:24 UTC
------- Comment From shubgoya.com 2011-04-29 08:29 EDT-------
I am verifying this issue in snap5 release. Will post my results ASAP.

Comment 41 IBM Bug Proxy 2011-05-05 19:30:43 UTC
------- Comment From shubgoya.com 2011-05-05 15:25 EDT-------
I verified this issue on one of affected platform (x3850) with snap5 kernel and did not see those 'Dazed and Confused' NMI messages under load. Looks like kernel 2.6.32-131.0.5.el6 solves the issue.

Comment 42 errata-xmlrpc 2011-05-19 12:42:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Comment 43 IBM Bug Proxy 2011-06-01 21:01:34 UTC
------- Comment From tpnoonan.com 2011-06-01 16:52 EDT-------
ibm is no longer asking for rhel6.1.z, a fix for RHBZ692677 in rhel6.2 is okay

Comment 44 Max Novaha 2012-02-03 08:25:06 UTC
Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it?

Comment 45 Max Novaha 2012-02-03 08:41:44 UTC
Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not request iomem region............ How can i fix it?

Comment 46 Don Zickus 2012-02-03 15:29:57 UTC
(In reply to comment #45)
> Hello. I have [Firmware bug] on my server (processor - Intel® Xeon® Processor
> L5530 (8M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)). When i reboot seerver i got
> error: The BIOS has corrupted hw-PMU resources (............) ERST: Can not
> request iomem region............ How can i fix it?

Hi Max,

You can start by opening a new bugzilla and attaching a more complete dmesg log so we can have a better idea of what is going on. :-)

The reason is this bugzilla is closed and developers like myself will not look at it any more.

Thanks,
Don

Comment 47 Mike Neuliep 2012-02-16 18:10:02 UTC
Hello all.  I am able to reproduce this bug on start-up 100% of the time on the IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15 processors.
This happens during boot up and I get this message before the system locks up:

Uhhuh. NMI received for unknown reason 35 on CPU 0.
Do you have a strange power saving mode enabled?

This occurs in all the 2.6.32-220.X series of kernels.  Kind of annoying as I am stuck using the 2.6.32-131 kernel for the time being, which works just fine.  The bios on these servers, from what I can tell, does not incorporate any power saving features.

Comment 48 Don Zickus 2012-02-16 18:18:40 UTC
(In reply to comment #47)
> Hello all.  I am able to reproduce this bug on start-up 100% of the time on the
> IBM x3800 and X3950 servers that use the Intel Xeon CPU family type 15
> processors.
> This happens during boot up and I get this message before the system locks up:
> 
> Uhhuh. NMI received for unknown reason 35 on CPU 0.
> Do you have a strange power saving mode enabled?
> 
> This occurs in all the 2.6.32-220.X series of kernels.  Kind of annoying as I
> am stuck using the 2.6.32-131 kernel for the time being, which works just fine.
>  The bios on these servers, from what I can tell, does not incorporate any
> power saving features.

Hi Mike,

You will need to open a new bz and if possible attach a console log (or dmesg output if you can login).  cc myself on the bz.

Also add nmi_watchdog=0 on the commandline to see if it disappears.

Cheers,
Don