Red Hat Bugzilla – Bug 108623
ECC SBE appear to hang tiger 4 machines
Last modified: 2013-03-06 00:56:16 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225
Description of problem:
As far as I can see the MCA handlers do not tell the SAL to clear the errors.
According to the Itanium Processor Family Error Handling Guide section 22.214.171.124
and the Itanium Processor Family System Abstraction Layer section 4.3.2 if the
error is corrected or if the OS can continue, the SAL procedure should invoke
the SAL_CLEAR_STATE_INFO procedure.
I think that since the error isn't cleared, the SAL re-asserts the error and the
it repeats in a very tight loop. This essentially, hangs
the machine because all the APs are rendezvoused and the BP spends all
its time spewing the same error to the console. NOTE: the way that I
read the code as it now exists. ANY MCA will essentially hang the
We have an instrumented DIMM here which allows me to generate SBEs MBEs and
other memory errors at will. My test with the SBE was not 100% conclusive that
the machine was totally hung forever. I waited about 5 minutes after I stopped
generating the SBE for the console to potentially drain and errors were still
flying up the screen. I didn't feel like waiting any longer and so I rebooted
the node. However, while the console was still spewing, I tried pinging the
machine and there was no response. This was not a really carefully controlled
test but the fact that 5 minutes after a the SBEs stopped the machine still
wouldn't respond to a ping is a problem in and of itself.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. install instrumented DIMM in Tiger 4 ia64 machine.
2. connect the jumper that generates a SBE on the instrumented DIMM
3. disconnect the jumper
Actual Results: Machine unpingable for >5 minutes after SBEs stop.
Expected Results: Machine should become responsive shortly after SBEs stop.
Went throught the kernel source pretty carefully. There is only one place where
ia64_sal_clear_state_info is called and that is buried in the implementation of
the sal /proc filesystem code, in salinfo.c.
It seems to me like the "correct" place to acknoledge the corrected error would
be in mca_plaform_handler which is now an empty function.
I am not sure how this problem escaped detection previously. I think that the
MCA code may have not been terribly well exercised or it may be that the tiger 4
motherboard's chipsets behave slightly differently than other motherboards and
keep reasserting the same error if it is not acknowledged. If I understand the
SAL's MCA error buffer correctly, then it is a circular buffer and it could be
that on other motherboards the fact that the error remains unacknowled doesn't
matter. The errors just wrap around.
I'm going to start working on a patch to try to fix this bug.
There is a comment in ia64_mca_log_sal_error_record:
* 1. analyze error logs to determine recoverability
* 2. perform error recovery procedures, if applicable
* 3. set ia64_os_mca_recovery_successful flag, if applicable
Which suggests that this is the place where an error should be reovered.
However, it is strange that they would put the error recovery code down in
function that concerns itself with logging rather in the function that is
actually called mca_platform_handler. Any suggestions?
My review had a bit of an error in it. MCAs come in through the SAL layer but
CPEs such as an ECC SBE come in through the CPE handler which is an interrupt
handler. However, both code paths have the same problem. They both do not call
SAL_CLEAR_STATE_INFO. Ref: Itanium Processor Family Error Handling Guide section
Furthermore, according to section 126.96.36.199 the error handler must repeatedly call
SAL_GET_STATE_INFO until the SAL returns "no information available". Neither the
interrrupt handler, ia64_mca_log_sal_error_record, nor ia64_log_get iterate
through the calls this way.
Section 188.8.131.52 of the Error handling guide also reqires the same treatment for
This strongly suggests that the function that needs to be modified is
ia64_mca_log_sal_error_record rather than platform error.
Created attachment 95621 [details]
proposed patch to fix the problems
I haven't had a chance to test the patch yet. I will do that on Monday.
This patch addresses the two facets of this bug:
1) First of all, it clears the error within the SAL.
2) It iterates through all the errors of that that particular SAL info type
until all the errors are consumed.
The two other smaller elements of the patch are that it gets rid of some
unnecessary return codes and it adds RH to the credits for the file. We are
going to be doing a lot more work in the MCA area here at LLNL and the combined
changes are going to substantial. These spot fixes are just the beginning.
Created attachment 95740 [details]
patch that tries to make mca conform to the spec
Tried this patch and it doesn't solve the problem there still seems to be too
much data for the console to handle but it has been confirmed to move the
codebase closer to the spec.
The test that I did this morning was invalid. The effectiveness of the
patch still remains unknown. After looking closely at the results of
running without the patch (while thinking the patch was applied), I
feel more convinced that the patch will fix the problem.
*** This bug has been marked as a duplicate of 104667 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.