Bug 800326

Summary: Data corruption in stripe translator
Product: [Community] GlusterFS Reporter: Alexander Bersenev <bay>
Component: stripeAssignee: shishir gowda <sgowda>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.5CC: gluster-bugs, nsathyan, shmohan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-07 11:36:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The vol file
none
The output of dd
none
Dump of log file.
none
Commands executed
none
Patch to increase min stripe size to 16384 none

Description Alexander Bersenev 2012-03-06 10:26:44 UTC
Created attachment 567903 [details]
The vol file

Description of problem:
Stripe translator does not properly order the data and sometimes corrupts the data. 

Version-Release number of selected component (if applicable):
Tested on:
1. glusterfs 3.2.5 built on Nov 15 2011 08:43:14 (RHEL6)
2. glusterfs 3git built on Feb 13 2012 14:33:20 (Gentoo)

How reproducible:
Always.

Steps to Reproduce:
1. Create a new volume with stripe=4 and block-size=4096 with all bricks on one node. Turn off all caching and prefetch translators. My vol file is attached.
2. Mount it in some directory, for example in /gluster/fs/.
3. Create the test file in mounted filesystem:
perl -e 'print $_ x 4096 for(0..9,'a'..'z')' > s4096
4. Try to read it with large blocksize:
dd if=s4096 bs=1000000 2>/dev/null | hexdump -C
  
Actual results:
The data is corrupted. Same read command gives different read results:
# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
2146abf3b6cbc7e90a92aa55839da659  -
# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
f85c27c65320ad13906bf09710feaa7a  -
dd if=s4096 bs=1000000 2>/dev/null | md5sum 
3861b7a934d2d2bb77fdb6a9575d54de  -

If block size is small, data not corrupts:
# dd if=s4096 bs=100 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
# dd if=s4096 bs=100 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
# dd if=s4096 bs=100 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -

Expected results:
The data is not corrupted.

Additional info:

Comment 1 Alexander Bersenev 2012-03-06 10:29:59 UTC
Created attachment 567906 [details]
The output of dd

Comment 2 shylesh 2012-03-06 12:11:12 UTC
Backend :xfs
OS: RHEL 6.1


1. created a stripe volume with count 4
 
Volume Name: stripe4
Type: Stripe
Volume ID: 9a43501a-d782-4f8a-9147-4dcf66a123b7
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: RHEL6.1:/export/sdb/stripe41
Brick2: RHEL6.1:/export/sdb/stripe42
Brick3: RHEL6.1:/export/sdb/stripe43
Brick4: RHEL6.1:/export/sdb/stripe44
Options Reconfigured:
diagnostics.count-fop-hits: off
diagnostics.latency-measurement: off
performance.stat-prefetch: off
cluster.stripe-block-size: 4MB


2. Mounted the volume and created a file 
  perl -e 'print $_ x 4096 for(0..9,'a'..'z')' > s4096



 
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -
[root@RHEL6 mnt]# dd if=s4096 bs=1000000 2>/dev/null | md5sum 
015a232752a53b9195fd86562907bcea  -



We are not able to reproduce this issue, could you please provide the logs.

Comment 3 Alexander Bersenev 2012-03-06 12:19:20 UTC
Please, retry with blocksize=4096 bytes, not kbytes.

Comment 4 shishir gowda 2012-03-07 07:30:03 UTC
We are looking into the bug.

In the mean time can you please disable io-cache xlator, and read-ahead xlator.
This fixes the problem for us.

Comment 5 Alexander Bersenev 2012-03-07 07:53:33 UTC
Created attachment 568177 [details]
Dump of log file.

I've turned off all translators(the log is in attached file log.txt), but problem does not disappear.

Comment 6 Alexander Bersenev 2012-03-07 08:40:40 UTC
Created attachment 568196 [details]
Commands executed

Here is a dump of executed commands beginning from volume creation.

Comment 7 shishir gowda 2012-03-07 10:37:15 UTC
Hi Alexander,

We found the issue to be with related to iobufs

We can send a maximum of GF_IOBREF_IOBUF_COUNT (16) in the responses(read in this case). Each request tends to be of 128k.

By having stripe-block-size set to 4k, we would need 32 bufs, hence we end up loosing 16 bufs.

Stripe was designed to handle 128k or higher block sizes.

I will change the stripe-block-size min limit to 16k to prevent this scenario.

If needed I can give you a patch which hikes this buf count to 32.

Comment 8 Alexander Bersenev 2012-03-07 11:10:35 UTC
With hardcoded fuse's max_read = 131072 it seems to be an acceptable solution. I don't see data corruptions anymore.

I think this bug can be closed.

Comment 9 shishir gowda 2012-03-07 11:36:42 UTC
As stripe is designed to work with larger block size, we will not fix this bug.

Comment 10 Alexander Bersenev 2012-03-09 05:37:56 UTC
Created attachment 568818 [details]
Patch to increase min stripe size to 16384

Here is a patch to change the stripe-block-size min limit to 16k. 

As stripe is designed to work with larger block size and _doesn't work_ with lower values, please, fix this bug by raising the minimum block size limit.