Bug 108623

Summary: ECC SBE appear to hang tiger 4 machines
Product: Red Hat Enterprise Linux 3 Reporter: Ben Woodard <woodard>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: knoel, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 18:59:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed patch to fix the problems
none
patch that tries to make mca conform to the spec none

Description Ben Woodard 2003-10-30 19:21:49 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
As far as I can see the MCA handlers do not tell the SAL to clear the errors.
According to the Itanium Processor Family Error Handling Guide section 2.7.3.4
and the Itanium Processor Family System Abstraction Layer section 4.3.2 if the
error is corrected or if the OS can continue, the SAL procedure should invoke
the SAL_CLEAR_STATE_INFO procedure.

I think that since the error isn't cleared, the SAL re-asserts the error and the
it repeats in a very tight loop. This essentially, hangs
the machine because all the APs are rendezvoused and the BP spends all
its time spewing the same error to the console. NOTE: the way that I
read the code as it now exists. ANY MCA will essentially hang the
machine. 

We have an instrumented DIMM here which allows me to generate SBEs MBEs and
other memory errors at will. My test with the SBE was not 100% conclusive that
the machine was totally hung forever. I waited about 5 minutes after I stopped
generating the SBE for the console to potentially drain and errors were still
flying up the screen. I didn't feel like waiting any longer and so I rebooted
the node. However, while the console was still spewing, I tried pinging the
machine and there was no response. This was not a really carefully controlled
test but the fact that 5 minutes after a the SBEs stopped the machine still
wouldn't respond to a ping is a problem in and of itself.

Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Always

Steps to Reproduce:
1. install instrumented DIMM in Tiger 4 ia64 machine.
2. connect the jumper that generates a SBE on the instrumented DIMM
3. disconnect the jumper
    

Actual Results:  Machine unpingable for >5 minutes after SBEs stop.

Expected Results:  Machine should become responsive shortly after SBEs stop.

Additional info:

Went throught the kernel source pretty carefully. There is only one place where
ia64_sal_clear_state_info is called and that is buried in the implementation of
the sal /proc filesystem code, in salinfo.c. 

It seems to me like the "correct" place to acknoledge the corrected error would
be in mca_plaform_handler which is now an empty function.

I am not sure how this problem escaped detection previously. I think that the
MCA code may have not been terribly well exercised or it may be that the tiger 4
motherboard's chipsets behave slightly differently than other motherboards and
keep reasserting the same error if it is not acknowledged. If I understand the
SAL's MCA error buffer correctly, then it is a circular buffer and it could be
that on other motherboards the fact that the error remains unacknowled doesn't
matter. The errors just wrap around.

Comment 1 Ben Woodard 2003-10-30 19:23:58 UTC
I'm going to start working on a patch to try to fix this bug.

Comment 2 Ben Woodard 2003-10-30 22:53:09 UTC
There is a comment in ia64_mca_log_sal_error_record:
	/* TODO:
	 * 1. analyze error logs to determine recoverability
	 * 2. perform error recovery procedures, if applicable
	 * 3. set ia64_os_mca_recovery_successful flag, if applicable
	 */
Which suggests that this is the place where an error should be reovered.
However, it is strange that they would put the error recovery code down in
function that concerns itself with logging rather in the function that is
actually called mca_platform_handler. Any suggestions?

Comment 3 Ben Woodard 2003-10-30 23:47:14 UTC
My review had a bit of an error in it. MCAs come in through the SAL layer but
CPEs such as an ECC SBE come in through the CPE handler which is an interrupt
handler. However, both code paths have the same problem. They both do not call
SAL_CLEAR_STATE_INFO. Ref: Itanium Processor Family Error Handling Guide section
2.7.3.3

Furthermore, according to section 2.7.3.3 the error handler must repeatedly call
SAL_GET_STATE_INFO until the SAL returns "no information available". Neither the
interrrupt handler, ia64_mca_log_sal_error_record, nor ia64_log_get iterate
through the calls this way.

Section 2.7.3.2 of the Error handling guide also reqires the same treatment for
CMCs.

This strongly suggests that the function that needs to be modified is
ia64_mca_log_sal_error_record rather than platform error.

Comment 4 Ben Woodard 2003-10-31 01:15:14 UTC
Created attachment 95621 [details]
proposed patch to fix the problems

I haven't had a chance to test the patch yet. I will do that on Monday.

This patch addresses the two facets of this bug:
1) First of all, it clears the error within the SAL.
2) It iterates through all the errors of that that particular SAL info type
until all the errors are consumed.

The two other smaller elements of the patch are that it gets rid of some
unnecessary return codes and it adds RH to the credits for the file. We are
going to be doing a lot more work in the MCA area here at LLNL and the combined
changes are going to substantial. These spot fixes are just the beginning.

Comment 5 Ben Woodard 2003-11-05 20:08:23 UTC
Created attachment 95740 [details]
patch that tries to make mca conform to the spec

Tried this patch and it doesn't solve the problem there still seems to be too
much data for the console to handle but it has been confirmed to move the
codebase closer to the spec.

Comment 6 Ben Woodard 2003-11-06 01:07:29 UTC
The test that I did this morning was invalid. The effectiveness of the
patch still remains unknown. After looking closely at the results of
running without the patch (while thinking the patch was applied), I
feel more convinced that the patch will fix the problem.

Comment 7 Bill Nottingham 2003-12-04 21:23:57 UTC

*** This bug has been marked as a duplicate of 104667 ***

Comment 8 Red Hat Bugzilla 2006-02-21 18:59:34 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.