Bug 763609 (GLUSTER-1877)

Summary: data corruption while running arequal.
Product: [Community] GlusterFS Reporter: Raghavendra G <raghavendra>
Component: rdmaAssignee: Raghavendra G <raghavendra>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, rabhat
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: fuse
Documentation: DNR CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
diff of hexdump of git-repo-config none

Description Raghavendra G 2010-10-08 22:25:35 EDT
Created attachment 342
Comment 1 Raghavendra G 2010-10-09 01:23:55 EDT
Test: arequal.sh /usr /mnt/distribute/usr
Bug: checksum of regular files was different.
Configuration: found it on both distribute and single point to point setups with all performance translators.
Consistency in reproducing the issue: The bug is not reproducible conistently
Is issue found on sockets: Test not run with sockets as transport.

It was found that one binary git-repo-config was corrupted. git-repo-config was the file on glusterfs mount point.

raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ ls -lh git-repo-config local.git-repo-config 
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:40 git-repo-config*
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:36 local.git-repo-config*

raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ md5sum git-repo-config local.git-repo-config 
6f2d845bc5c6e9f9f57a19c46fc9757a  git-repo-config
e44ec37902b419eb7e599e5a268da18b  local.git-repo-config

diff on hexdump of these two files showed that a contiguous chunk of file of size 131056 (16 bytes less than iobuf size) bytes was zeroed out in corrupted file. I've attached the diff.

No rdma errors were found in both client and server logs.
Comment 2 Raghavendra G 2010-10-27 21:14:04 EDT
Bug is easily reproducible with following shell script:

#!/bin/bash

GLUSTER_MOUNT=/mnt/gluster2
prev="empty" ;
while true; do
    cp -f /usr/lib/locale/locale-archive $GLUSTER_MOUNT
    sum=`md5sum $GLUSTER_MOUNT/locale-archive` 
    if [ "$prev" != "empty" -a "$prev" != "$sum" ]; then
        echo "mismatch prev=$prev sum=$sum"
        break
    fi
    prev=$sum
    rm -f $GLUSTER_MOUNT/locale-archive
done

locale-archive is a file of size around 50MB.

As of now, the minimum configuration required to reproduce this bug is distributed replicate with just write-behind as the only performance translator.
Comment 3 Raghavendra G 2010-10-27 21:51:19 EDT
Minimum configuration required to reproduce the bug is a two node replicate setup with write-behind as the only performance translator on rdma transport.
Comment 4 Anand Avati 2010-10-29 03:42:26 EDT
PATCH: http://patches.gluster.com/patch/5600 in master (rpc-transport: fix race-condition between rdma-read completion and updating the count of number of vectors to be passed to rpc.)
Comment 5 Anand Avati 2010-11-07 20:15:06 EST
PATCH: http://patches.gluster.com/patch/5609 in master (rpc-transport/rdma: increment post->ctx.count in a loop doint rdma_read.)