Created attachment 342
Test: arequal.sh /usr /mnt/distribute/usr Bug: checksum of regular files was different. Configuration: found it on both distribute and single point to point setups with all performance translators. Consistency in reproducing the issue: The bug is not reproducible conistently Is issue found on sockets: Test not run with sockets as transport. It was found that one binary git-repo-config was corrupted. git-repo-config was the file on glusterfs mount point. raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ ls -lh git-repo-config local.git-repo-config -rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:40 git-repo-config* -rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:36 local.git-repo-config* raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ md5sum git-repo-config local.git-repo-config 6f2d845bc5c6e9f9f57a19c46fc9757a git-repo-config e44ec37902b419eb7e599e5a268da18b local.git-repo-config diff on hexdump of these two files showed that a contiguous chunk of file of size 131056 (16 bytes less than iobuf size) bytes was zeroed out in corrupted file. I've attached the diff. No rdma errors were found in both client and server logs.
Bug is easily reproducible with following shell script: #!/bin/bash GLUSTER_MOUNT=/mnt/gluster2 prev="empty" ; while true; do cp -f /usr/lib/locale/locale-archive $GLUSTER_MOUNT sum=`md5sum $GLUSTER_MOUNT/locale-archive` if [ "$prev" != "empty" -a "$prev" != "$sum" ]; then echo "mismatch prev=$prev sum=$sum" break fi prev=$sum rm -f $GLUSTER_MOUNT/locale-archive done locale-archive is a file of size around 50MB. As of now, the minimum configuration required to reproduce this bug is distributed replicate with just write-behind as the only performance translator.
Minimum configuration required to reproduce the bug is a two node replicate setup with write-behind as the only performance translator on rdma transport.
PATCH: http://patches.gluster.com/patch/5600 in master (rpc-transport: fix race-condition between rdma-read completion and updating the count of number of vectors to be passed to rpc.)
PATCH: http://patches.gluster.com/patch/5609 in master (rpc-transport/rdma: increment post->ctx.count in a loop doint rdma_read.)