From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: As far as I can see the MCA handlers do not tell the SAL to clear the errors. According to the Itanium Processor Family Error Handling Guide section 2.7.3.4 and the Itanium Processor Family System Abstraction Layer section 4.3.2 if the error is corrected or if the OS can continue, the SAL procedure should invoke the SAL_CLEAR_STATE_INFO procedure. I think that since the error isn't cleared, the SAL re-asserts the error and the it repeats in a very tight loop. This essentially, hangs the machine because all the APs are rendezvoused and the BP spends all its time spewing the same error to the console. NOTE: the way that I read the code as it now exists. ANY MCA will essentially hang the machine. We have an instrumented DIMM here which allows me to generate SBEs MBEs and other memory errors at will. My test with the SBE was not 100% conclusive that the machine was totally hung forever. I waited about 5 minutes after I stopped generating the SBE for the console to potentially drain and errors were still flying up the screen. I didn't feel like waiting any longer and so I rebooted the node. However, while the console was still spewing, I tried pinging the machine and there was no response. This was not a really carefully controlled test but the fact that 5 minutes after a the SBEs stopped the machine still wouldn't respond to a ping is a problem in and of itself. Version-Release number of selected component (if applicable): kernel-2.4.21-4.EL How reproducible: Always Steps to Reproduce: 1. install instrumented DIMM in Tiger 4 ia64 machine. 2. connect the jumper that generates a SBE on the instrumented DIMM 3. disconnect the jumper Actual Results: Machine unpingable for >5 minutes after SBEs stop. Expected Results: Machine should become responsive shortly after SBEs stop. Additional info: Went throught the kernel source pretty carefully. There is only one place where ia64_sal_clear_state_info is called and that is buried in the implementation of the sal /proc filesystem code, in salinfo.c. It seems to me like the "correct" place to acknoledge the corrected error would be in mca_plaform_handler which is now an empty function. I am not sure how this problem escaped detection previously. I think that the MCA code may have not been terribly well exercised or it may be that the tiger 4 motherboard's chipsets behave slightly differently than other motherboards and keep reasserting the same error if it is not acknowledged. If I understand the SAL's MCA error buffer correctly, then it is a circular buffer and it could be that on other motherboards the fact that the error remains unacknowled doesn't matter. The errors just wrap around.
I'm going to start working on a patch to try to fix this bug.
There is a comment in ia64_mca_log_sal_error_record: /* TODO: * 1. analyze error logs to determine recoverability * 2. perform error recovery procedures, if applicable * 3. set ia64_os_mca_recovery_successful flag, if applicable */ Which suggests that this is the place where an error should be reovered. However, it is strange that they would put the error recovery code down in function that concerns itself with logging rather in the function that is actually called mca_platform_handler. Any suggestions?
My review had a bit of an error in it. MCAs come in through the SAL layer but CPEs such as an ECC SBE come in through the CPE handler which is an interrupt handler. However, both code paths have the same problem. They both do not call SAL_CLEAR_STATE_INFO. Ref: Itanium Processor Family Error Handling Guide section 2.7.3.3 Furthermore, according to section 2.7.3.3 the error handler must repeatedly call SAL_GET_STATE_INFO until the SAL returns "no information available". Neither the interrrupt handler, ia64_mca_log_sal_error_record, nor ia64_log_get iterate through the calls this way. Section 2.7.3.2 of the Error handling guide also reqires the same treatment for CMCs. This strongly suggests that the function that needs to be modified is ia64_mca_log_sal_error_record rather than platform error.
Created attachment 95621 [details] proposed patch to fix the problems I haven't had a chance to test the patch yet. I will do that on Monday. This patch addresses the two facets of this bug: 1) First of all, it clears the error within the SAL. 2) It iterates through all the errors of that that particular SAL info type until all the errors are consumed. The two other smaller elements of the patch are that it gets rid of some unnecessary return codes and it adds RH to the credits for the file. We are going to be doing a lot more work in the MCA area here at LLNL and the combined changes are going to substantial. These spot fixes are just the beginning.
Created attachment 95740 [details] patch that tries to make mca conform to the spec Tried this patch and it doesn't solve the problem there still seems to be too much data for the console to handle but it has been confirmed to move the codebase closer to the spec.
The test that I did this morning was invalid. The effectiveness of the patch still remains unknown. After looking closely at the results of running without the patch (while thinking the patch was applied), I feel more convinced that the patch will fix the problem.
*** This bug has been marked as a duplicate of 104667 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.