Description of problem: ======================= In a replicate volume (1x2) when a file is in split-brain state IO's are successful on the file and self-heal happens from brick which has the file size greater to other brick. Version-Release number of selected component (if applicable): =============================================================== glusterfs 3.4.0.32rhs built on Sep 6 2013 10:26:11 How reproducible: ==================== Everytime 1. Create a replicate volume. set self-heal-daemon to off. Start the volume root@fan [Sep-07-2013-14:08:58] >gluster v info Volume Name: vol_dis_1_rep_2 Type: Replicate Volume ID: f5c43519-b5eb-4138-8219-723c064af71c Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: fan.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b0 Brick2: mia.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b1 Options Reconfigured: server.allow-insecure: on performance.stat-prefetch: off performance.write-behind: off cluster.self-heal-daemon: off 2. Create fuse, nfs, cifs mount: 3. From all the mounts execute the following script:(pass different file names from each mount point) test_script.sh <filename> : ====================== #!/bin/bash pwd=`pwd` filename="${pwd}/$1" ( echo "Time before flock : `date`" flock -x 200 echo "Time after flock : `date`" echo -e "\nWriting to file : $filename" for i in `seq 1 1000`; do echo "Hello $i" >&200 ; sleep 1; done echo "Time after the writes are successful : `date`" )200>>$filename 4. When the writes are in progress bring down brick-1. 5. After some time bring back brick-1 and bring down brick-0 almost at the same time. (situation leading to split-brain) 6. Let the writes on the file progress for some time. 7. Bring back brick-0 online. (split-brain state) Actual Result: ============= +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Fuse and Cifs mount behavior: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1. Writes from mount point are successful without reporting I/0 Error. 2. Self-heals data from any of the brick depending on which brick has more more file size. 3. Once the self-heal is complete, the change-logs are cleared on files. 4. Once the writes are complete "cat testfile" is successful from mount point. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ NFS Behavior +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1. Writes from mount point are successful without reporting I/0 Error. 2. Changelogs are not cleared. 3. Once the writes are complete cat testfile from mount gives I/0 Error Expected results: ==================== When file is in split-brain state, IO's should fail.
SOS Reports: http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1005485 fuse mount process info: ======================== root@darrel [Sep-07-2013-14:29:36] >ps -ef | grep gluster root 2335 1 0 07:35 ? 00:00:11 /usr/sbin/glusterfs --volfile-id=/vol_dis_1_rep_2 --volfile-server=mia /mnt/gm1
Targeting for 3.0.0 (Denali) release.
This issue will be seen if post-op-delay is set to non zero and the bricks go down and come back with in the post-op-delay time. A patch for this has been sent upstream : http://review.gluster.com/#/c/5635/ but this patch causes performance degradation.
Following test case was executed on "glusterfs 3.6.0.28 built on Sep 3 2014 10:13:12" Case :- ======= 1. create 2 x 2 distribute-replicate volume. start the volume. Set data-self-heal volume option to "off" 2. Create 2 fuse and 2 nfs mounts from 2 clients. 3. create 10 files from one of the mount. 4. From 1 fuse and 1 nfs mount on each client, open fd's on all 10 files and start writing to the fd's. exec 5>./file1 exec 6>./file2 exec 7>./file3 exec 8>./file4 exec 9>./file5 exec 10>./file6 exec 11>./file7 exec 12>./file8 exec 13>./file9 exec 14>./file10 while true ; do for i in `seq 5 14`; do echo "`date`" >&$i ; done ; done 5. From other fuse mount and nfs mount on each client, cat the contents of file , perform lookup on files in loop. while true ; do find . | xargs stat ; done while true ; do for i in `seq 1 10`; do cat file$i; done ; done 6. Bring down brick1 and brick3 (one brick per sub-vol) 7. Bring back the bricks after some time. (service glusterd restart) 8. file1 ended in split-brain state, Actual result: ============= writes are still successful on the split-brain files. [root@rhsauto006 ~]# gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 331cd4da-d234-480d-9152-a926e72369e7 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.36.234:/rhs/brick1/b1 Brick2: 10.70.36.236:/rhs/brick1/b2 Brick3: 10.70.36.237:/rhs/brick1/b3 Brick4: 10.70.36.244:/rhs/brick1/b4 Options Reconfigured: cluster.data-self-heal: off performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [root@rhsauto006 ~]# [root@rhsauto006 ~]# gluster v heal testvol info split-brain Gathering list of split brain entries on volume testvol has been successful Brick 10.70.36.234:/rhs/brick1/b1 Number of entries: 0 Brick 10.70.36.236:/rhs/brick1/b2 Number of entries: 0 Brick 10.70.36.237:/rhs/brick1/b3 Number of entries: 2 at path on brick ----------------------------------- 2014-11-18 01:37:55 <gfid:5157d3c5-54fe-4573-a8d5-9dc58e10d3c7> 2014-11-18 01:37:57 <gfid:5157d3c5-54fe-4573-a8d5-9dc58e10d3c7> Brick 10.70.36.244:/rhs/brick1/b4 Number of entries: 2 at path on brick ----------------------------------- 2014-11-18 01:37:58 /file1 2014-11-18 01:40:29 /file1 [root@rhsauto006 ~]# [root@rhsauto006 ~]# [root@rhsauto007 ~]# getfattr -d -e hex -m . /rhs/brick1/b3/file1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b3/file1 trusted.afr.testvol-client-2=0x000000070000000000000000 trusted.afr.testvol-client-3=0x000000090000000000000000 trusted.gfid=0x5157d3c554fe4573a8d59dc58e10d3c7 [root@rhsauto014 ~]# getfattr -d -e hex -m . /rhs/brick1/b4/file1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b4/file1 trusted.afr.testvol-client-2=0x00002e8f0000000000000000 trusted.afr.testvol-client-3=0x000000000000000000000000 trusted.gfid=0x5157d3c554fe4573a8d59dc58e10d3c7 [root@rhsauto014 ~]# From fuse mount: ================= [root@rhsauto001 fuse1]# ls -l total 4896 -rw-r--r--. 1 root root 500830 Nov 18 07:13 file1 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file10 -rw-r--r--. 1 root root 500830 Nov 18 07:13 file2 -rw-r--r--. 1 root root 500830 Nov 18 07:13 file3 -rw-r--r--. 1 root root 500830 Nov 18 07:13 file4 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file5 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file6 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file7 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file8 -rw-r--r--. 1 root root 500801 Nov 18 07:13 file9 -rw-r--r--. 1 root root 29 Nov 17 14:46 testfile [root@rhsauto001 fuse1]# ls -lh file1 -rw-r--r--. 1 root root 490K Nov 18 07:13 file1 [root@rhsauto001 fuse1]# echo "Hello" > file1 [root@rhsauto001 fuse1]#
Tested with build "glusterfs-libs-3.7.1-16", once files are in split-brian state still writes are going to files because of performance.write-behind and after disabling the performance translater writes are failing with IO error
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.