Bug 1293265

Summary: md5sum of files mismatch after the self-heal is complete on the file
Product: [Community] GlusterFS Reporter: Krutika Dhananjay <kdhananj>
Component: replicateAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.6CC: bugs, mzywusko, ravishankar, smohan, spandura
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1292379 Environment:
Last Closed: 2016-04-19 07:51:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1141750, 1292379, 1293240    
Bug Blocks: 1139599, 1154491, 1293239    

Description Krutika Dhananjay 2015-12-21 09:25:22 UTC
+++ This bug was initially created as a clone of Bug #1292379 +++

+++ This bug was initially created as a clone of Bug #1141750 +++

Description of problem:
==========================
In a 2 x 2 distribute-replicate volume created a file and started writing to the file from multiple mounts. Bounced the  same brick multiple times from the sub-volume which had the file . After the writes are complete and after successful self-heal, checked the md5sum of the file on both source and sink. md5sum of the file mismatches. 


How reproducible:
==================
Tried once

Steps to Reproduce:
=======================
1. Create 2 x 2 distribute-replicate volume. Start the volume.

2. From 2 client machines create 2 fuse and 2 nfs mount. 

3. From all the mount points start IO on the file:
"dd if=/dev/urandom of=./testfile1 bs=1M count=20480"

4. While IO is in progress, bring down the one brick from the sub-volume which contains the file.

5. perform "replace-brick commit force" on the brought down brick. (self-heal starts on the replaced brick)

6. add 2 new bricks to the volume.

7. start rebalance. 

8. while rebalance is in progress, rename the file from one of the mount point:

"for i in `seq 1 100`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done" 

(rename successful without any failures)

9. add 2 new bricks to the volume.

10. start rebalance. 

11. while rebalance is in progress, rename the file from one of the mount point:

"for i in `seq 101 200`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done" 

(rename successful without any failures)

12. add 2 new bricks to the volume.

13. start rebalance. 

14. while rebalance is in progress, rename the file from one of the mount point:

"for i in `seq 101 200`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done"

for i in `seq 201 400`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done

15. Bring down the replaced brick where self-heal was in progress.

16. rename the file from one of the mount point. 

for i in `seq 401 500`; do mv /mnt/1/testfile${i} /mnt/1/testfile`expr $i + 1` ; done

17. Bring back the brick online

18. Stop dd from all the mounts. 

Actual results:
=================
After some time, self-heal was complete successfully. 

[2014-09-15 11:11:43.364608] I [afr-self-heal-common.c:2869:afr_log_self_heal_completion_status] 0-vol1-replicate-1:  foreground data self heal  is successfully completed,  data self heal from vol1-client-3  to sinks  vol1-client-2, with 12232425472 bytes on vol1-client-2, 12934053888 bytes on vol1-client-3,  data - Pending matrix:  [ [ 522 522 ] [ 45970 419 ] ]  on <gfid:a0f9dddf-9927-423f-b334-8d7dcd8ecc22>


Check the md5sum of the file on both the bricks
===============================================

Source extended attribute when the dd was stopped. 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client12 [Sep-15-2014-11:09:08] >getfattr -d -e hex -m . /rhs/device1/b4/testfile501 
getfattr: Removing leading '/' from absolute path names
# file: rhs/device1/b4/testfile501
trusted.afr.vol1-client-2=0x0000b39b0000000000000000
trusted.afr.vol1-client-3=0x000001a30000000000000000
trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22

Sink extended attribute when dd was stopped.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client11 [Sep-15-2014-11:09:08] >getfattr -d -e hex -m . /rhs/device1/b3_replaced/testfile501 
getfattr: Removing leading '/' from absolute path names
# file: rhs/device1/b3_replaced/testfile501
trusted.afr.vol1-client-2=0x000000000000000000000000
trusted.afr.vol1-client-3=0x000000000000000000000000
trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22


Source md5sum after the dd is stopped:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client12 [Sep-15-2014-11:09:09] >
root@rhs-client12 [Sep-15-2014-11:09:10] >md5sum /rhs/device1/b4/testfile501
a116972019a667785adaf3d50b276117  /rhs/device1/b4/testfile501



Sink md5sum after the dd is stopped :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client11 [Sep-15-2014-11:09:10] >md5sum /rhs/device1/b3_replaced/testfile501
f2193f1062c6bc0db618d44c7096aa28  /rhs/device1/b3_replaced/testfile501

Source extended attributes and md5sum after self-heal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client12 [Sep-15-2014-11:17:43] >getfattr -d -e hex -m . /rhs/device1/b4/testfile501 
getfattr: Removing leading '/' from absolute path names
# file: rhs/device1/b4/testfile501
trusted.afr.vol1-client-2=0x000000000000000000000000
trusted.afr.vol1-client-3=0x000000000000000000000000
trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22

root@rhs-client12 [Sep-15-2014-11:17:44] >md5sum /rhs/device1/b4/testfile501
a116972019a667785adaf3d50b276117  /rhs/device1/b4/testfile501
root@rhs-client12 [Sep-15-2014-11:18:57] >


Sink extended attributes and md5sum after self-heal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@rhs-client11 [Sep-15-2014-11:17:43] >getfattr -d -e hex -m . /rhs/device1/b3_replaced/testfile501 
getfattr: Removing leading '/' from absolute path names
# file: rhs/device1/b3_replaced/testfile501
trusted.afr.vol1-client-2=0x000000000000000000000000
trusted.afr.vol1-client-3=0x000000000000000000000000
trusted.gfid=0xa0f9dddf9927423fb3348d7dcd8ecc22

root@rhs-client11 [Sep-15-2014-11:17:44] >md5sum /rhs/device1/b3_replaced/testfile501
f2193f1062c6bc0db618d44c7096aa28  /rhs/device1/b3_replaced/testfile501
root@rhs-client11 [Sep-15-2014-11:19:02] >

Expected results:
===================
After self-heal the md5sum's should match. 

--- Additional comment from Pranith Kumar K on 2014-09-18 00:33:06 EDT ---

Simpler test to re-create the bug:
0) Create a replicate volume 1x2 start it and mount it.
1) Open a file 'a' from the mount and keep writing to it.
2) Bring one of the bricks down
3) rename the file '<mnt>/a' to '<mnt>/b'
4) Wait for at least one write to complete while the brick is still down.
5) Restart the brick
6) Wait until self-heal completes and stop the 'writing' from mount point.

Root cause:
When Rename happens while the brick is down after the brick comes back up, entry self-heal is triggered on the parent directory of where the rename happened, in this case that is <mnt>. As part of this entry self-heal 
1) file 'a' is deleted and
2) file 'b' is re-created.

0) In parallel to this, writing fd needs to be opened on the file from the mount point.

If re-opening of the file in step-0) happens before step-1) of self-heal then this issue is observed. Writes from mount keep going to the file that was deleted where as the self-heal happens on the file created at step-2. So the checksum mismatches. One more manifestation of this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1139599. Where writes from the mount only increase the file on the 'always up' brick but the file on the other brick is not growing. This leads to split-brain because of size mismatch but all-zero pending changelog.

--- Additional comment from Vijay Bellur on 2015-12-18 06:49:20 EST ---

REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Vijay Bellur on 2015-12-21 03:56:42 EST ---

REVIEW: http://review.gluster.org/13001 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 1 Vijay Bellur 2015-12-21 09:27:13 UTC
REVIEW: http://review.gluster.org/13035 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#1) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 2 Vijay Bellur 2015-12-21 10:01:24 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#1) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 3 Vijay Bellur 2016-01-25 11:37:44 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#2) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 4 Vijay Bellur 2016-01-27 02:46:17 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write) posted (#3) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 5 Vijay Bellur 2016-01-27 05:05:47 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write.) posted (#4) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 6 Vijay Bellur 2016-01-28 04:45:53 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write.) posted (#5) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 7 Vijay Bellur 2016-01-28 14:19:01 UTC
REVIEW: http://review.gluster.org/13036 (cluster/afr: Fix data loss due to race between sh and ongoing write.) posted (#6) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 8 Vijay Bellur 2016-01-29 01:43:41 UTC
COMMIT: http://review.gluster.org/13036 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) 
------
commit b43aa481712dab5df813050119ba6c08f50dbfd9
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Dec 17 17:41:08 2015 +0530

    cluster/afr: Fix data loss due to race between sh and ongoing write.
    
            Backport of: http://review.gluster.org/#/c/13001/
    
    Problem:
    
    When IO is happening on a file and a brick goes down comes back up
    during this time, protocol/client translator attempts reopening of the
    fd on the gfid handle of the file. But if another client renames this
    file while a brick was down && writes were in progress on it, once this
    brick is back up, there can be a race between reopening of the fd and
    entry self-heal replaying the effect of the rename() on the sink brick.
    If the reopening of the fd happens first, the application's writes
    continue to go into the data blocks associated with the gfid.
    Now entry-self-heal deletes 'src' and creates 'dst' file on the sink,
    marking dst as a 'newentry'.  Data self-heal is also completed on 'dst'
    as a result and self-heal terminates. If at this point the application
    is still writing to this fd, all writes on the file after self-heal
    would go into the data blocks associated with this fd, which would be
    lost once the fd is closed. The result - the 'dst' file on the source
    and sink are not the same and there is no pending heal on the file,
    leading to silent corruption on the sink.
    
    Fix:
    
    Leverage http://review.gluster.org/#/c/12816/ to ensure the gfid handle
    path gets saved in .glusterfs/unlink until the fd is closed on the file.
    During this time, when self-heal sends mknod() with gfid of the file,
    do the following:
    link() the gfid handle under .glusterfs/unlink to the new path to be
    created in mknod() and
    rename() the gfid handle to go back under .glusterfs/ab/cd/.
    
    
    Change-Id: I5dc49c127ef0a1bf3cf4ce1b24610b1527f84d6f
    BUG: 1293265
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/13036
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 9 Kaushal 2016-04-19 07:51:26 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.7, please open a new bug report.

glusterfs-3.7.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-February/025292.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user