PV#958194 Description of problem: **Customer Reported** On a 2046p cluster configured as four 512p hosts interconnected via NumaLink. In yesterday's tests, we ran a 512p MPI job split 2x256 across two hosts. This worked the first time we tried it, but then failed the next 3 times. The signature of the failure was that one host would spit out this message: XPMEM error: 5 MPI: recv_xp/xpmem_copy: Input/output error Killed and the other host would crash. After crashing three of the hosts, we shutdown the whole cluster and ran diagnostics: both btet and rtrt showed no errors. We reset all of the hardware (including all of the routers) and brough the machine back up; we were then able to run jobs across hosts, getting in 7 successful runs. One of the four hosts was shutdown in order to return it to the users, and MPI promptly stopped working. Although at this point it only didn't work; attempts to run never got started rather than crashing the machine. Unloading and reloading the XPC module restored service. Given that experiance, we then deliberately downed one of the remaining three machines, and tried to run across the remaining two. This crashed with the same signature as given above. Version-Release number of selected component (if applicable): How reproducible: Reliably. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: With the failing test case, we quickly isolated the problem to bte_unaligned_copy(). In cases where the transfer ends exactly on a cache line, bte_unaligned_copy() was adding one additional cache line to the transfer and putting that in a temporary buffer. This is normally not an issue as the data is not copied into the user buffer as part of the bcopy. The fix is fairly straight forward. Do not transfer the extra cache line. I pulled the code which calculates start and length into a userland program, verified the existing code was the source of the problem, then fixed the code and verified we get the correct results. I then made equivalent changes in the kernel and am now testing this on revenue. The patch is attached below. I have verified it applies with nothing more than line number fixups to Linus' kernel, sles10-latest, sles10-update, sles9sp3, and lbs3.6 trees. Code has been reviewed by dcn. Index: linux-2.6.16/arch/ia64/sn/kernel/bte.c =================================================================== --- linux-2.6.16.orig/arch/ia64/sn/kernel/bte.c 2006-03-19 23:53:29.000000000 -0600 +++ linux-2.6.16/arch/ia64/sn/kernel/bte.c 2006-11-14 11:09:14.669830260 -0600 @@ -383,14 +383,13 @@ bte_result_t bte_unaligned_copy(u64 src, * bcopy to the destination. */ - /* Add the leader from source */ - headBteLen = len + (src & L1_CACHE_MASK); - /* Add the trailing bytes from footer. */ - headBteLen += L1_CACHE_BYTES - (headBteLen & L1_CACHE_MASK); - headBteSource = src & ~L1_CACHE_MASK; headBcopySrcOffset = src & L1_CACHE_MASK; headBcopyDest = dest; headBcopyLen = len; + + headBteSource = src - headBcopySrcOffset; + /* Add the leading and trailing bytes from source */ + headBteLen = L1_CACHE_ALIGN(len + headBcopySrcOffset); } if (headBcopyLen > 0) {
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Just to make sure that this is clearly understood... You understand that this is really a silent data corruption problem? Right?
in 2.6.18-60.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot1--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot3--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
Greetings Red Hat Partner, A fix for this issue should be included in the latest packages contained in RHEL5.2-Snapshot4--available now on partners.redhat.com. Please test and confirm that your issue is fixed. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to ASSIGNED. If you are receiving this message in Issue Tracker, please reply with a message to Issue Tracker about your results and I will update bugzilla for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. Thank you
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html