Bug 763609 (GLUSTER-1877) - data corruption while running arequal.
Summary: data corruption while running arequal.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1877
Product: GlusterFS
Classification: Community
Component: rdma
Version: mainline
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-09 05:23 UTC by Raghavendra G
Modified: 2015-12-01 16:45 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression: RTP
Mount Type: fuse
Documentation: DNR
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
diff of hexdump of git-repo-config (392.50 KB, application/octet-stream)
2010-10-09 02:25 UTC, Raghavendra G
no flags Details

Description Raghavendra G 2010-10-09 02:25:35 UTC
Created attachment 342

Comment 1 Raghavendra G 2010-10-09 05:23:55 UTC
Test: arequal.sh /usr /mnt/distribute/usr
Bug: checksum of regular files was different.
Configuration: found it on both distribute and single point to point setups with all performance translators.
Consistency in reproducing the issue: The bug is not reproducible conistently
Is issue found on sockets: Test not run with sockets as transport.

It was found that one binary git-repo-config was corrupted. git-repo-config was the file on glusterfs mount point.

raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ ls -lh git-repo-config local.git-repo-config 
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:40 git-repo-config*
-rwxr-xr-x 1 raghu users 3.6M 2010-10-09 08:36 local.git-repo-config*

raghu@booradley:~/work/user-issues/bugs/rdma-data-corruption$ md5sum git-repo-config local.git-repo-config 
6f2d845bc5c6e9f9f57a19c46fc9757a  git-repo-config
e44ec37902b419eb7e599e5a268da18b  local.git-repo-config

diff on hexdump of these two files showed that a contiguous chunk of file of size 131056 (16 bytes less than iobuf size) bytes was zeroed out in corrupted file. I've attached the diff.

No rdma errors were found in both client and server logs.

Comment 2 Raghavendra G 2010-10-28 01:14:04 UTC
Bug is easily reproducible with following shell script:

#!/bin/bash

GLUSTER_MOUNT=/mnt/gluster2
prev="empty" ;
while true; do
    cp -f /usr/lib/locale/locale-archive $GLUSTER_MOUNT
    sum=`md5sum $GLUSTER_MOUNT/locale-archive` 
    if [ "$prev" != "empty" -a "$prev" != "$sum" ]; then
        echo "mismatch prev=$prev sum=$sum"
        break
    fi
    prev=$sum
    rm -f $GLUSTER_MOUNT/locale-archive
done

locale-archive is a file of size around 50MB.

As of now, the minimum configuration required to reproduce this bug is distributed replicate with just write-behind as the only performance translator.

Comment 3 Raghavendra G 2010-10-28 01:51:19 UTC
Minimum configuration required to reproduce the bug is a two node replicate setup with write-behind as the only performance translator on rdma transport.

Comment 4 Anand Avati 2010-10-29 07:42:26 UTC
PATCH: http://patches.gluster.com/patch/5600 in master (rpc-transport: fix race-condition between rdma-read completion and updating the count of number of vectors to be passed to rpc.)

Comment 5 Anand Avati 2010-11-08 01:15:06 UTC
PATCH: http://patches.gluster.com/patch/5609 in master (rpc-transport/rdma: increment post->ctx.count in a loop doint rdma_read.)


Note You need to log in before you can comment on or make changes to this bug.