+++ This bug was initially created as a clone of Bug #1141750 +++ Description of problem: ========================== In a 2 x 2 distribute-replicate volume created a file and started writing to the file from multiple mounts. Bounced the same brick multiple times from the sub-volume which had the file . After the writes are complete and after successful self-heal, checked the md5sum of the file on both source and sink. md5sum of the file mismatches. How reproducible: ================== Tried once Steps to Reproduce: ======================= 1. Create 2 x 2 distribute-replicate volume. Start the volume. 2. From 2 client machines create 2 fuse and 2 nfs mount. 3. From all the mount points start IO on the file: "dd if=/dev/urandom of=./testfile1 bs=1M count=20480" 4. While IO is in progress, bring down the one brick from the sub-volume which contains the file. 5. perform "replace-brick commit force" on the brought down brick. (self-heal starts on the replaced brick) 6. add 2 new bricks to the volume. 7. start rebalance. 8. while rebalance is in progress, rename the file from one of the mount point: "for i in `seq 1 100`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done" (rename successful without any failures) 9. add 2 new bricks to the volume. 10. start rebalance. 11. while rebalance is in progress, rename the file from one of the mount point: "for i in `seq 101 200`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done" (rename successful without any failures) 12. add 2 new bricks to the volume. 13. start rebalance. 14. while rebalance is in progress, rename the file from one of the mount point: "for i in `seq 101 200`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done" for i in `seq 201 400`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done 15. Bring down the replaced brick where self-heal was in progress. 16. rename the file from one of the mount point. for i in `seq 401 500`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done 17. Bring back the brick online 18. Stop dd from all the mounts. Actual results: ================= After some time, self-heal was complete successfully. [2014-09-15 11:11:43.364608] I [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-vol1-replicate-1: foreground data self heal is successfully completed, data self heal from vol1-client-3 to sinks vol1-client-2, with 12232425472 bytes on vol1-client-2, 12934053888 bytes on vol1-client-3, data - Pending matrix: [ [ 522 522 ] [ 45970 419 ] ] on <gfid:a0f9dddf-9927-423f-b334-8d7dcd8ecc22> Check the md5sum of the file on both the bricks =============================================== Source extended attribute when the dd was stopped. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client12 [Sep-15-2014-11:09:08] >getfattr -d -e hex -m . /rhs/device1/b4/testfile501 getfattr: Removing leading '/' from absolute path names # file: rhs/device1/b4/testfile501 trusted.afr.vol1-client-2=0x0000b39b0000000000000000 trusted.afr.vol1-client-3=0x000001a30000000000000000 trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22 Sink extended attribute when dd was stopped. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client11 [Sep-15-2014-11:09:08] >getfattr -d -e hex -m . /rhs/device1/b3_replaced/testfile501 getfattr: Removing leading '/' from absolute path names # file: rhs/device1/b3_replaced/testfile501 trusted.afr.vol1-client-2=0x000000000000000000000000 trusted.afr.vol1-client-3=0x000000000000000000000000 trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22 Source md5sum after the dd is stopped: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client12 [Sep-15-2014-11:09:09] > root@rhs-client12 [Sep-15-2014-11:09:10] >md5sum /rhs/device1/b4/testfile501 a116972019a667785adaf3d50b276117 /rhs/device1/b4/testfile501 Sink md5sum after the dd is stopped : ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client11 [Sep-15-2014-11:09:10] >md5sum /rhs/device1/b3_replaced/testfile501 f2193f1062c6bc0db618d44c7096aa28 /rhs/device1/b3_replaced/testfile501 Source extended attributes and md5sum after self-heal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client12 [Sep-15-2014-11:17:43] >getfattr -d -e hex -m . /rhs/device1/b4/testfile501 getfattr: Removing leading '/' from absolute path names # file: rhs/device1/b4/testfile501 trusted.afr.vol1-client-2=0x000000000000000000000000 trusted.afr.vol1-client-3=0x000000000000000000000000 trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22 root@rhs-client12 [Sep-15-2014-11:17:44] >md5sum /rhs/device1/b4/testfile501 a116972019a667785adaf3d50b276117 /rhs/device1/b4/testfile501 root@rhs-client12 [Sep-15-2014-11:18:57] > Sink extended attributes and md5sum after self-heal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@rhs-client11 [Sep-15-2014-11:17:43] >getfattr -d -e hex -m . /rhs/device1/b3_replaced/testfile501 getfattr: Removing leading '/' from absolute path names # file: rhs/device1/b3_replaced/testfile501 trusted.afr.vol1-client-2=0x000000000000000000000000 trusted.afr.vol1-client-3=0x000000000000000000000000 trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22 root@rhs-client11 [Sep-15-2014-11:17:44] >md5sum /rhs/device1/b3_replaced/testfile501 f2193f1062c6bc0db618d44c7096aa28 /rhs/device1/b3_replaced/testfile501 root@rhs-client11 [Sep-15-2014-11:19:02] > Expected results: =================== After self-heal the md5sum's should match. --- Additional comment from Pranith Kumar K on 2014-09-18 00:33:06 EDT --- Simpler test to re-create the bug: 0) Create a replicate volume 1x2 start it and mount it. 1) Open a file 'a' from the mount and keep writing to it. 2) Bring one of the bricks down 3) rename the file '<mnt>/a' to '<mnt>/b' 4) Wait for at least one write to complete while the brick is still down. 5) Restart the brick 6) Wait until self-heal completes and stop the 'writing' from mount point. Root cause: When Rename happens while the brick is down after the brick comes back up, entry self-heal is triggered on the parent directory of where the rename happened, in this case that is <mnt>. As part of this entry self-heal 1) file 'a' is deleted and 2) file 'b' is re-created. 0) In parallel to this, writing fd needs to be opened on the file from the mount point. If re-opening of the file in step-0) happens before step-1) of self-heal then this issue is observed. Writes from mount keep going to the file that was deleted where as the self-heal happens on the file created at step-2. So the checksum mismatches. One more manifestation of this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1139599. Where writes from the mount only increase the file on the 'always up' brick but the file on the other brick is not growing. This leads to split-brain because of size mismatch but all-zero pending changelog.
REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#3) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#4) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#5) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/13001 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 683c880a02086effc5009a8420289b445ea423f0 Author: Krutika Dhananjay <kdhananj> Date: Thu Dec 17 17:41:08 2015 +0530 cluster/afr: Fix data loss due to race between sh and ongoing write Problem: When IO is happening on a file and a brick goes down comes back up during this time, protocol/client translator attempts reopening of the fd on the gfid handle of the file. But if another client renames this file while a brick was down && writes were in progress on it, once this brick is back up, there can be a race between reopening of the fd and entry self-heal replaying the effect of the rename() on the sink brick. If the reopening of the fd happens first, the application's writes continue to go into the data blocks associated with the gfid. Now entry-self-heal deletes 'src' and creates 'dst' file on the sink, marking dst as a 'newentry'. Data self-heal is also completed on 'dst' as a result and self-heal terminates. If at this point the application is still writing to this fd, all writes on the file after self-heal would go into the data blocks associated with this fd, which would be lost once the fd is closed. The result - the 'dst' file on the source and sink are not the same and there is no pending heal on the file, leading to silent corruption on the sink. Fix: Leverage http://review.gluster.org/#/c/12816/ to ensure the gfid handle path gets saved in .glusterfs/unlink until the fd is closed on the file. During this time, when self-heal sends mknod() with gfid of the file, do the following: link() the gfid handle under .glusterfs/unlink to the new path to be created in mknod() and rename() the gfid handle to go back under .glusterfs/ab/cd/. Change-Id: I86ef1f97a76ffe11f32653bb995f575f7648f798 BUG: 1292379 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/13001 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: NetBSD Build System <jenkins.org> Tested-by: Gluster Build System <jenkins.com>
Tested with build glusterfs-server-3.7.5-13, and fallowing test passed so marking this bug as verified 1. Create 1x2 volume and mount it on client and using DD continuously write to the file (Disable selfheal and data and entry heal) 2. Bring down one of the brick and rename the file from other mount or session 3. Bring back the brick and enable self heal and lunch self heal and after completion of heal wait for some time and then kill the DD 4. Now on both bricks md5sum is matching
sorry wrongly updated Ignore my previous comments
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user