Bug 108623
Summary: | ECC SBE appear to hang tiger 4 machines | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Ben Woodard <woodard> | ||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.0 | CC: | knoel, petrides, riel | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | ia64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-02-21 18:59:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ben Woodard
2003-10-30 19:21:49 UTC
I'm going to start working on a patch to try to fix this bug. There is a comment in ia64_mca_log_sal_error_record: /* TODO: * 1. analyze error logs to determine recoverability * 2. perform error recovery procedures, if applicable * 3. set ia64_os_mca_recovery_successful flag, if applicable */ Which suggests that this is the place where an error should be reovered. However, it is strange that they would put the error recovery code down in function that concerns itself with logging rather in the function that is actually called mca_platform_handler. Any suggestions? My review had a bit of an error in it. MCAs come in through the SAL layer but CPEs such as an ECC SBE come in through the CPE handler which is an interrupt handler. However, both code paths have the same problem. They both do not call SAL_CLEAR_STATE_INFO. Ref: Itanium Processor Family Error Handling Guide section 2.7.3.3 Furthermore, according to section 2.7.3.3 the error handler must repeatedly call SAL_GET_STATE_INFO until the SAL returns "no information available". Neither the interrrupt handler, ia64_mca_log_sal_error_record, nor ia64_log_get iterate through the calls this way. Section 2.7.3.2 of the Error handling guide also reqires the same treatment for CMCs. This strongly suggests that the function that needs to be modified is ia64_mca_log_sal_error_record rather than platform error. Created attachment 95621 [details]
proposed patch to fix the problems
I haven't had a chance to test the patch yet. I will do that on Monday.
This patch addresses the two facets of this bug:
1) First of all, it clears the error within the SAL.
2) It iterates through all the errors of that that particular SAL info type
until all the errors are consumed.
The two other smaller elements of the patch are that it gets rid of some
unnecessary return codes and it adds RH to the credits for the file. We are
going to be doing a lot more work in the MCA area here at LLNL and the combined
changes are going to substantial. These spot fixes are just the beginning.
Created attachment 95740 [details]
patch that tries to make mca conform to the spec
Tried this patch and it doesn't solve the problem there still seems to be too
much data for the console to handle but it has been confirmed to move the
codebase closer to the spec.
The test that I did this morning was invalid. The effectiveness of the patch still remains unknown. After looking closely at the results of running without the patch (while thinking the patch was applied), I feel more convinced that the patch will fix the problem. *** This bug has been marked as a duplicate of 104667 *** Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |