Bug 473392

Summary: [Intel 5.4 FEAT] mcelog can not support Nehalem/Dunnington processor error decoding
Product: Red Hat Enterprise Linux 5 Reporter: Song, Youquan <youquan.song>
Component: mcelogAssignee: Prarit Bhargava <prarit>
Status: CLOSED ERRATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: medium Docs Contact:
Priority: high    
Version: 5.4CC: amax, bugproxy, clement.t.cole, cward, gcase, jane.lv, jay_engh, jjarvis, jlarrew, jscotka, jvillalo, keve.a.gabbert, lcm, ltroan, luyu, rlerch, rpacheco
Target Milestone: alphaKeywords: FutureFeature, OtherQA
Target Release: 5.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
The mcelog package had not been updated in several releases. Therefore, the existing package could not decode MCE events from newer Intel processors such as the Intel® Core™ i7 processor and Intel® Xeon® Processor 7400 series. This update adds support for newer Intel platforms.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:42:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 445204, 450795, 452016, 483701, 483784, 485920, 488637, 488639, 488645, 488646, 488648    
Attachments:
Description Flags
binary x86_64 rpm none

Description Song, Youquan 2008-11-28 09:24:05 UTC
Description of problem:
Current mcelog-0.7-1.22.fc6 can not support Nehalem/Dunnington processor error decoding.
The update version mcelog can do it. git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. 
2. 
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jane Lv 2008-12-01 06:34:46 UTC
We would like to make this as a feature request for RHEL5.4.  This code is in the public git repository.  Release is still needed and the Fedora maintainer will be notified.

Comment 2 Keve Gabbert 2009-01-05 23:51:32 UTC
please update subject with "RHEL 5.4"

Comment 3 John Villalovos 2009-01-21 02:17:58 UTC
Jon Masters,

Any status on this for RHEL 5.4?  Will you be able to bring in the upstream mcelog code to fix this issue?

Comment 5 Ronald Pacheco 2009-03-31 02:53:14 UTC
Can Intel confirm that they will assist in the testing of this patch?  Thanks!

Comment 6 John Villalovos 2009-03-31 12:34:14 UTC
Intel will assist in testing of this patch.

Comment 8 Jesse Larrew 2009-04-01 17:24:05 UTC
Hi jcm!

IBM is also interested in this feature for 5.4. Is this on track? A status update would be greatly appreciated. Thanks!

Jesse

Comment 9 Prarit Bhargava 2009-04-07 13:58:13 UTC
Jesse, I've taken over mcelog from jcm.

AFAIK, Andi Kleen is still working on this upstream.

P.

Comment 10 Prarit Bhargava 2009-04-22 12:37:46 UTC
jlarrew -- do you have a way of testing this?

P.

Comment 11 Jesse Larrew 2009-04-22 20:42:04 UTC
Yes, we can test this on our systems at IBM. I understand that Intel also has a special test harness that can inject MCEs and log the results, and they have agreed to help test this package as well. Thanks Prarit!

Comment 12 Prarit Bhargava 2009-04-22 21:32:50 UTC
(In reply to comment #11)
> Yes, we can test this on our systems at IBM. I understand that Intel also has a
> special test harness that can inject MCEs and log the results, and they have
> agreed to help test this package as well. Thanks Prarit!  

I have a ping into jvillalo to see if he can help with testing as well.

P.

Comment 13 Prarit Bhargava 2009-04-22 21:34:02 UTC
Created attachment 340832 [details]
binary x86_64 rpm

Binary x86_64 rpm.

jlarrew, jvillalo -- please test.

P.

Comment 14 Jane Lv 2009-05-07 08:50:31 UTC
(In reply to comment #13)
> Created an attachment (id=340832) [details]
> binary x86_64 rpm
> 
> Binary x86_64 rpm.
> 
> jlarrew, jvillalo -- please test.
> 
> P.  

I tested the package mcelog-0.8pre-1.23.el5 on Nehalem processor.  It works to log the memory controller MCEs created by the my test harness but missed some Nehalem error decoding as below,

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 BANK 8 TSC ff38315fca
MISC 100000000080 ADDR 636240
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: Unknown Error 9f
...

I tried the latest mcelog in Andi's git tree.  The above information can be decoded.

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 TSC 6e3ff1c1692
MISC 100000000083 ADDR 636000
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 8302
Memory transaction Tracker ID (RTId): 3
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 1000
...


Andi's recent 2 patches can fix this gap.  Could you please add them into RHEL5.4 package?  Thanks.

http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commitdiff;h=4cb783bec1f897bf617d5c42f9dffcb873fad57f
Fix some bugs in Nehalem decoding

- Add missing terminator
- Decode DIMM numbers from correct register
- Check MISCV before decoding misc

Signed-off-by: Andi Kleen <ak.com>


http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commitdiff;h=14a64e1e7a7c3d36c8eace4101b26c90c00d198c
More Nehalem memory error decoding fixes

Signed-off-by: Andi Kleen <ak.com>

Comment 15 Keve Gabbert 2009-05-07 19:24:30 UTC
is this now on the approved list for RHEL 5.4?

Comment 16 Jesse Larrew 2009-05-08 18:20:36 UTC
This bug has the PM ACK for 5.4, so I would say yes. John, can you please confirm?

Comment 17 Jane Lv 2009-05-11 09:35:52 UTC
Prarit,

Here is one more patch for Dunnington support is needed to be included for RHEL5.4 mcelog package.  I currently don't have Dunnington system to check the binary code.  The Dunnington system is expected to be available for me in this week.  Before that, please check if this patch is in the source code.  Thanks a lot.

http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commit;h=0aaa20fc348de7dcb814588174c07442a20fdc33

Comment 18 Prarit Bhargava 2009-05-11 22:50:51 UTC
(In reply to comment #17)
> Prarit,
> 
> Here is one more patch for Dunnington support is needed to be included for
> RHEL5.4 mcelog package.  I currently don't have Dunnington system to check the
> binary code.  The Dunnington system is expected to be available for me in this
> week.  Before that, please check if this patch is in the source code.  Thanks a
> lot.
> 
> http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commit;h=0aaa20fc348de7dcb814588174c07442a20fdc33 

Jane, np -- I decided to update the entire package to 0.9pre.  Could you request that testing on the new package be done at Intel?

Thanks,

P.

Comment 19 Jane Lv 2009-05-12 03:33:50 UTC
(In reply to comment #18)

> 
> Jane, np -- I decided to update the entire package to 0.9pre.  Could you
> request that testing on the new package be done at Intel?
> 
> Thanks,
> 
> P.  

Prarit,

Did you mean updating the entire package to RELEASE_0_9_PRE1 of the git tree?  The above required patches in comment #14 and comment #17 are not included in the RELEASE_0_9_PRE1.  Could you please consider taking more recent snapshot?  Thanks.

-Jane

Comment 20 Prarit Bhargava 2009-05-12 12:55:39 UTC
> 
> Did you mean updating the entire package to RELEASE_0_9_PRE1 of the git tree? 
> The above required patches in comment #14 and comment #17 are not included in
> the RELEASE_0_9_PRE1.  Could you please consider taking more recent snapshot? 
> Thanks.
> 

I used the latest-and-greatest tree.  The tip was
be08956ae9cc5afa81be36d36ffda90dfdb70636 .

Both patches were verified to be in the snapshot.

Sorry for the confusion Jane,

P.

Comment 24 Ruediger Landmann 2009-05-19 01:39:57 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The mcelog package had not been updated in several releases. Therefore, the existing package could not decode MCE events from newer Intel processors such as the "Nehalem" and "Dunnington" series. This update adds support for newer Intel hardware.

Comment 25 Ronald Pacheco 2009-05-19 14:07:11 UTC
*** Bug 474907 has been marked as a duplicate of this bug. ***

Comment 26 Keve Gabbert 2009-05-21 21:47:33 UTC
Release notes: please replace code names with product names.
Intel® Core™ i7 processor for Nehalem 
Intel® Xeon® Processor 7400 series for Dunnington

Comment 27 Ronald Pacheco 2009-05-22 11:41:17 UTC
Keve,

I took care of this.  Are you able to edit the release notes field from your account?

Comment 28 Ronald Pacheco 2009-05-22 11:41:17 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-The mcelog package had not been updated in several releases. Therefore, the existing package could not decode MCE events from newer Intel processors such as the "Nehalem" and "Dunnington" series. This update adds support for newer Intel hardware.+The mcelog package had not been updated in several releases. Therefore, the existing package could not decode MCE events from newer Intel processors such as the Intel® Core™ i7 processor and Intel® Xeon® Processor 7400 series. This update adds support for newer Intel platforms.

Comment 29 Chris Ward 2009-06-14 23:16:47 UTC
~~ Attention Partners RHEL 5.4 Partner Alpha Released! ~~

RHEL 5.4 Partner Alpha has been released on partners.redhat.com. There should
be a fix present that addresses this particular request. Please test and report back your results here, at your earliest convenience. Our Public Beta release is just around the corner!

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have verified the request functions as expected, please set your Partner ID in the Partner field above to indicate successful test results. Do not flip the bug status to VERIFIED. Further questions can be directed to your Red Hat Partner Manager. Thanks!

Comment 30 IBM Bug Proxy 2009-06-24 11:13:14 UTC
Description of problem:
Current mcelog-0.7-1.22.fc6 can not support Nehalem/Dunnington processor error decoding.
The update version mcelog can do it. git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
We would like to make this as a feature request for RHEL5.4.  This code is in the public git repository.  Release is still needed and the Fedora maintainer will be notified.
please update subject with "RHEL 5.4"
Jon Masters,

Any status on this for RHEL 5.4?  Will you be able to bring in the upstream mcelog code to fix this issue?
Can Intel confirm that they will assist in the testing of this patch?  Thanks!
Intel will assist in testing of this patch.
Hi jcm!

IBM is also interested in this feature for 5.4. Is this on track? A status update would be greatly appreciated. Thanks!

Jesse
Jesse, I've taken over mcelog from jcm.

AFAIK, Andi Kleen is still working on this upstream.

P.
jlarrew -- do you have a way of testing this?

P.
Yes, we can test this on our systems at IBM. I understand that Intel also has a special test harness that can inject MCEs and log the results, and they have agreed to help test this package as well. Thanks Prarit!
(In reply to comment #11)
> Yes, we can test this on our systems at IBM. I understand that Intel also has a
> special test harness that can inject MCEs and log the results, and they have
> agreed to help test this package as well. Thanks Prarit!

I have a ping into jvillalo to see if he can help with testing as well.

P.
(In reply to comment #13)
> Created an attachment (id=340832) [details]
> binary x86_64 rpm
>
> Binary x86_64 rpm.
>
> jlarrew, jvillalo -- please test.
>
> P.

I tested the package mcelog-0.8pre-1.23.el5 on Nehalem processor.  It works to log the memory controller MCEs created by the my test harness but missed some Nehalem error decoding as below,

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 BANK 8 TSC ff38315fca
MISC 100000000080 ADDR 636240
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: Unknown Error 9f
...

I tried the latest mcelog in Andi's git tree.  The above information can be decoded.

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 TSC 6e3ff1c1692
MISC 100000000083 ADDR 636000
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 8302
Memory transaction Tracker ID (RTId): 3
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 1000
...

Andi's recent 2 patches can fix this gap.  Could you please add them into RHEL5.4 package?  Thanks.

http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commitdiff;h=4cb783bec1f897bf617d5c42f9dffcb873fad57f
Fix some bugs in Nehalem decoding

- Add missing terminator
- Decode DIMM numbers from correct register
- Check MISCV before decoding misc

Signed-off-by: Andi Kleen <ak.com>

http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commitdiff;h=14a64e1e7a7c3d36c8eace4101b26c90c00d198c
More Nehalem memory error decoding fixes

Signed-off-by: Andi Kleen <ak.com>
is this now on the approved list for RHEL 5.4?
This bug has the PM ACK for 5.4, so I would say yes. John, can you please confirm?
Prarit,

Here is one more patch for Dunnington support is needed to be included for RHEL5.4 mcelog package.  I currently don't have Dunnington system to check the binary code.  The Dunnington system is expected to be available for me in this week.  Before that, please check if this patch is in the source code.  Thanks a lot.

http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commit;h=0aaa20fc348de7dcb814588174c07442a20fdc33
(In reply to comment #17)
> Prarit,
>
> Here is one more patch for Dunnington support is needed to be included for
> RHEL5.4 mcelog package.  I currently don't have Dunnington system to check the
> binary code.  The Dunnington system is expected to be available for me in this
> week.  Before that, please check if this patch is in the source code.  Thanks a
> lot.
>
> http://git.kernel.org/?p=utils/cpu/mce/mcelog.git;a=commit;h=0aaa20fc348de7dcb814588174c07442a20fdc33

Jane, np -- I decided to update the entire package to 0.9pre.  Could you request that testing on the new package be done at Intel?

Thanks,

P.
(In reply to comment #18)

>
> Jane, np -- I decided to update the entire package to 0.9pre.  Could you
> request that testing on the new package be done at Intel?
>
> Thanks,
>
> P.

Prarit,

Did you mean updating the entire package to RELEASE_0_9_PRE1 of the git tree?  The above required patches in comment #14 and comment #17 are not included in the RELEASE_0_9_PRE1.  Could you please consider taking more recent snapshot?  Thanks.

-Jane
>
> Did you mean updating the entire package to RELEASE_0_9_PRE1 of the git tree?
> The above required patches in comment #14 and comment #17 are not included in
> the RELEASE_0_9_PRE1.  Could you please consider taking more recent snapshot?
> Thanks.
>

I used the latest-and-greatest tree.  The tip was
be08956ae9cc5afa81be36d36ffda90dfdb70636 .

Both patches were verified to be in the snapshot.

Sorry for the confusion Jane,

P.
Release note added. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.

New Contents:
The mcelog package had not been updated in several releases. Therefore, the existing package could not decode MCE events from newer Intel processors such as the "Nehalem" and "Dunnington" series. This update adds support for newer Intel hardware.
*** Bug 474907 has been marked as a duplicate of this bug. ***

All revisions will be proofread by the Engineering Content Services team.

Comment 31 Chris Ward 2009-06-25 01:26:52 UTC
IBM, is your bugproxy broken? I'm not following the previous post. Is there anything of value in comment #30?

Comment 32 Chris Ward 2009-07-03 18:14:24 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 33 Chris Ward 2009-07-10 19:06:50 UTC
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.

Comment 34 Jane Lv 2009-07-17 03:08:36 UTC
I tested RHEL 5.4 mcelog on Intel Nehalem platform with no obvious issue found so far.  And I verified the patches in comment #14 and #17 has been included in RHEL 5.4 mcelog package.  I can get the detailed decode information described in comment #14.

Comment 35 Chris Ward 2009-07-17 07:38:25 UTC
Thank. I'm moving this bug then to VERIFIED. If you encounter any new issues, please clone this bug to open a new one and escalate to your Partner Manager. Thanks for your feedback.

Comment 37 errata-xmlrpc 2009-09-02 11:42:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1374.html