Bug 764186 (GLUSTER-2454)
Summary: | rdma data corruption | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Raghavendra G <raghavendra> | ||||
Component: | rdma | Assignee: | Raghavendra G <raghavendra> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | mainline | CC: | gluster-bugs | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | Type: | --- | |||||
Regression: | RTP | Mount Type: | fuse | ||||
Documentation: | DNR | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Reported by "Beat Rubischon"<beat.rubischon> <bug report> Hello! I found some memory corruption in the RDMA transport layer. Setup is CentOS 5.5, Mellanox OFED 1.5.2 / OpenFabrics OFED 1.5.2, ConnectX-2 cards, GlusterFS 3.1.2 / Git Master Branch. Application is ANSYS CFX wit transient cases, running with strange corecounts like 6 or 12. Symptoms are failure during the write out of the case. Errors are recorded in the brick's and client's logs: node24:/var/log/glusterfs/home.log [2011-02-04 15:41:19.688110] W [fuse-bridge.c:1761:fuse_writev_cbk] glusterfs-fuse: 29810266: WRITE => -1 (Bad address) server2:/var/log/glusterfs/bricks/brick07.log [2011-02-04 15:41:19.687733] E [posix.c:2504:posix_writev] home-posix: write failed: offset 538534184, Bad address I was able to reproduce the error using a single brick and a single client. Running server and client on the same system didn't pop up the error, the data must pass a wire to trigger the bug. Switching to TCP over IPoIB was a successful workaround. It looks like a pointer in the iovec structure used by the writev is screwed up during the transport over RDMA. I can imagine that the debugging would be rather hard, hopefully you'll be able to find the root cause. Feel free to ask for additional logs or traces, I'll try to provide them. Beat </bug report> PATCH: http://patches.gluster.com/patch/6250 in master (rpcsvc: Handle more than one payload vectors.) |
Created attachment 439 [details] /var/log/XFree86.0.log following seggie of XFree86 -configure Debug msgs revealed that the second payload vector passed to server-protocol from rpcsvc as corrupted. But the vector was fine in transport/rdma, which pointed to corruption happening in rpcsvc.