Bug 129359

Summary: [RHEL3] uncorrectable ECC memory errors do NOT halt the system
Product: Red Hat Enterprise Linux 3 Reporter: Alexandre Oliva <aoliva>
Component: kernelAssignee: Dave Anderson <anderson>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: barryn, jbaron, jparadis, peterm, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-20 20:55:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 123574    

Description Alexandre Oliva 2004-08-06 20:47:58 UTC
Linux won't protect itself from memory corruption due to not properly
panic'ing and leaving the system in a somewhat usable state upon
encountering serious memory error.

Arjan van de Ven wrote:

yes this is an oversight that I'll be correcting in the rhel4 kernel

Frank Hirtz wrote:

Is this something that we can get addressed from within the context of
RHEL 2.1 and 3?

Comment 3 Dave Anderson 2004-08-18 17:19:32 UTC
It looks OK to me, although I'd prefer that the "if (mem_nmi_panic)"
clutter be moved inside the mem_parity_error() function.  An AS2.1
version would require a bit more since there's no die_nmi() function
but could be easily done.  I can put together a couple patches for
both kernels, but I'd also prefer to follow Arjan's lead in how he
would implement it in RHEL4.

Comment 4 Dave Anderson 2004-08-18 17:53:21 UTC
I see now he's just followed the lead of the "unknown_nmi_panic"
sysctl check above it, which leads to the question as to whether it
makes sense to put both that sysctl as well as the proposed
mem_nmi_panic sysctl's into AS2.1 and RHEL4 to maintain consistency?


Comment 5 Alexandre Oliva 2004-08-19 11:20:34 UTC
The feature request is for both 2.1 and 3, and it should definitely be
carried over to RHEL4 to avoid a regression.

Comment 6 Dave Anderson 2004-09-14 20:46:24 UTC
RHEL3 patch posted today.

I'll start on an AS2.1 version tomorrow, noting as before
that it is not as simple because there's no die_nmi() function
in AS2.1. 

Comment 8 Dave Anderson 2004-09-15 18:58:24 UTC
I am just about to post the AS2.1 patch.

Comment 9 Dave Anderson 2004-09-15 19:14:19 UTC
AS2.1 patch posted today.

Note that the patch also adds the "unknown_nmi_panic" tuneable in
addition to the requested "mem_nmi_panic", making it consistent with
RHEL3.  RHEL4 already has "unknown_nmi_panic", and Arjan has indicated
that he will be adding "mem_nmi_panic".  

Comment 10 Ernie Petrides 2004-09-20 06:53:17 UTC
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.8.EL).


Comment 11 John Flanagan 2004-12-13 20:06:27 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-505.html


Comment 12 Ernie Petrides 2004-12-13 21:41:41 UTC
This bug was inappropriate listed in the above 2.1 Erratum (listed above),
and thus should not have yet been closed.  I'm reverting it to MODIFIED
state until the RHEL3 Erratum is released (which should be in a week).
I'm also removing it from the RHEL2.1 blocker list.


Comment 13 Ernie Petrides 2004-12-13 21:45:09 UTC
I meant to write "inappropriately".

Comment 14 John Flanagan 2004-12-20 20:55:51 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html