Red Hat Bugzilla – Bug 763609
data corruption while running arequal.
Last modified: 2015-12-01 11:45:32 EST
Created attachment 342
Test: arequal.sh /usr /mnt/distribute/usr
Bug: checksum of regular files was different.
Configuration: found it on both distribute and single point to point setups with all performance translators.
Consistency in reproducing the issue: The bug is not reproducible conistently
Is issue found on sockets: Test not run with sockets as transport.
It was found that one binary git-repo-config was corrupted. git-repo-config was the file on glusterfs mount point.
raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ ls -lh git-repo-config local.git-repo-config
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:40 git-repo-config*
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:36 local.git-repo-config*
raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ md5sum git-repo-config local.git-repo-config
diff on hexdump of these two files showed that a contiguous chunk of file of size 131056 (16 bytes less than iobuf size) bytes was zeroed out in corrupted file. I've attached the diff.
No rdma errors were found in both client and server logs.
Bug is easily reproducible with following shell script:
while true; do
cp -f /usr/lib/locale/locale-archive $GLUSTER_MOUNT
if [ "$prev" != "empty" -a "$prev" != "$sum" ]; then
echo "mismatch prev=$prev sum=$sum"
rm -f $GLUSTER_MOUNT/locale-archive
locale-archive is a file of size around 50MB.
As of now, the minimum configuration required to reproduce this bug is distributed replicate with just write-behind as the only performance translator.
Minimum configuration required to reproduce the bug is a two node replicate setup with write-behind as the only performance translator on rdma transport.
PATCH: http://patches.gluster.com/patch/5600 in master (rpc-transport: fix race-condition between rdma-read completion and updating the count of number of vectors to be passed to rpc.)
PATCH: http://patches.gluster.com/patch/5609 in master (rpc-transport/rdma: increment post->ctx.count in a loop doint rdma_read.)