Bug 218298 - bte_unaligned_copy transfers extra cache line beyond end of page
Summary: bte_unaligned_copy transfers extra cache line beyond end of page
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: ia64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Luming Yu
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 253733
TreeView+ depends on / blocked
 
Reported: 2006-12-04 15:21 UTC by George Beshers
Modified: 2013-08-06 01:42 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:40:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description George Beshers 2006-12-04 15:21:30 UTC
PV#958194

Description of problem:

**Customer Reported**
On a 2046p cluster configured as four 512p hosts interconnected via NumaLink.

In yesterday's tests, we ran a 512p MPI job split 2x256 across two hosts.
This worked the first time we tried it, but then failed the next 3 times.
The signature of the failure was that one host would spit out this message:

  XPMEM error: 5
  MPI: recv_xp/xpmem_copy: Input/output error
  Killed

and the other host would crash.  After crashing three of the hosts, we
shutdown the whole cluster and ran diagnostics: both btet and rtrt showed
no errors.  We reset all of the hardware (including all of the routers)
and brough the machine back up; we were then able to run jobs across
hosts, getting in 7 successful runs.  One of the four hosts was shutdown
in order to return it to the users, and MPI promptly stopped working.
Although at this point it only didn't work; attempts to run never got
started rather than crashing the machine.  Unloading and reloading the
XPC module restored service.  Given that experiance, we then deliberately
downed one of the remaining three machines, and tried to run across the
remaining two.  This crashed with the same signature as given above.

Version-Release number of selected component (if applicable):


How reproducible:
Reliably.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

With the failing test case, we quickly isolated the problem to
bte_unaligned_copy().  In cases where the transfer ends exactly on a cache
line, bte_unaligned_copy() was adding one additional cache line to the
transfer and putting that in a temporary buffer.  This is normally not an
issue as the data is not copied into the user buffer as part of the bcopy.

The fix is fairly straight forward.  Do not transfer the extra cache line.

I pulled the code which calculates start and length into a userland
program, verified the existing code was the source of the problem, then
fixed the code and verified we get the correct results.  I then made
equivalent changes in the kernel and am now testing this on revenue.

The patch is attached below.  I have verified it applies with nothing more
than line number fixups to Linus' kernel, sles10-latest, sles10-update,
sles9sp3, and lbs3.6 trees.

Code has been reviewed by dcn.

Index: linux-2.6.16/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.16.orig/arch/ia64/sn/kernel/bte.c	2006-03-19 23:53:29.000000000 -0600
+++ linux-2.6.16/arch/ia64/sn/kernel/bte.c	2006-11-14 11:09:14.669830260 -0600
@@ -383,14 +383,13 @@ bte_result_t bte_unaligned_copy(u64 src,
 		 * bcopy to the destination.
 		 */
 
-		/* Add the leader from source */
-		headBteLen = len + (src & L1_CACHE_MASK);
-		/* Add the trailing bytes from footer. */
-		headBteLen += L1_CACHE_BYTES - (headBteLen & L1_CACHE_MASK);
-		headBteSource = src & ~L1_CACHE_MASK;
 		headBcopySrcOffset = src & L1_CACHE_MASK;
 		headBcopyDest = dest;
 		headBcopyLen = len;
+
+		headBteSource = src - headBcopySrcOffset;
+		/* Add the leading and trailing bytes from source */
+		headBteLen = L1_CACHE_ALIGN(len + headBcopySrcOffset);
 	}
 
 	if (headBcopyLen > 0) {

Comment 1 RHEL Program Management 2007-11-20 05:06:14 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Tony Ernst 2007-11-21 14:46:44 UTC
Just to make sure that this is clearly understood...  You understand that this
is really a silent data corruption problem?  Right?


Comment 4 Don Zickus 2007-12-14 18:34:01 UTC
in 2.6.18-60.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 6 John Poelstra 2008-03-21 03:59:26 UTC
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you

Comment 7 John Poelstra 2008-04-02 21:40:09 UTC
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you


Comment 8 John Poelstra 2008-04-09 22:45:52 UTC
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot4--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you


Comment 10 errata-xmlrpc 2008-05-21 14:40:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.