Bug 764186 (GLUSTER-2454) - rdma data corruption
Summary: rdma data corruption
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-2454
Product: GlusterFS
Classification: Community
Component: rdma
Version: mainline
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-23 14:56 UTC by Raghavendra G
Modified: 2015-12-01 16:45 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTP
Mount Type: fuse
Documentation: DNR
CRM:
Verified Versions:


Attachments (Terms of Use)
debug messages (410.55 KB, application/octet-stream)
2011-02-23 12:01 UTC, Raghavendra G
no flags Details

Description Raghavendra G 2011-02-23 12:01:19 UTC
Created attachment 439 [details]
/var/log/XFree86.0.log following seggie of XFree86 -configure

Debug msgs revealed that the second payload vector passed to server-protocol from rpcsvc as corrupted. But the vector was fine in transport/rdma, which pointed to corruption happening in rpcsvc.

Comment 1 Raghavendra G 2011-02-23 14:56:33 UTC
Reported by "Beat Rubischon"<beat.rubischon>

<bug report>
Hello!

I found some memory corruption in the RDMA transport layer.

Setup is CentOS 5.5, Mellanox OFED 1.5.2 / OpenFabrics OFED 1.5.2,
ConnectX-2 cards, GlusterFS 3.1.2 / Git Master Branch.

Application is ANSYS CFX wit transient cases, running with strange
corecounts like 6 or 12.

Symptoms are failure during the write out of the case. Errors are
recorded in the brick's and client's logs:

node24:/var/log/glusterfs/home.log
[2011-02-04 15:41:19.688110] W [fuse-bridge.c:1761:fuse_writev_cbk]
glusterfs-fuse: 29810266: WRITE => -1 (Bad address)

server2:/var/log/glusterfs/bricks/brick07.log
[2011-02-04 15:41:19.687733] E [posix.c:2504:posix_writev] home-posix:
write failed: offset 538534184, Bad address

I was able to reproduce the error using a single brick and a single
client. Running server and client on the same system didn't pop up the
error, the data must pass a wire to trigger the bug. Switching to TCP
over IPoIB was a successful workaround.

It looks like a pointer in the iovec structure used by the writev is
screwed up during the transport over RDMA. I can imagine that the
debugging would be rather hard, hopefully you'll be able to find the
root cause. Feel free to ask for additional logs or traces, I'll try to
provide them.

Beat
</bug report>

Comment 2 Anand Avati 2011-03-01 04:40:39 UTC
PATCH: http://patches.gluster.com/patch/6250 in master (rpcsvc: Handle more than one payload vectors.)


Note You need to log in before you can comment on or make changes to this bug.