Bug 1491670

Summary: [afr] split-brain observed on T files post hardlink and rename in x3 volume
Product: [Community] GlusterFS Reporter: Ravishankar N <ravishankar>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: mainlineCC: bugs, nchilaka, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.13.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1482812
: 1496317 1496321 (view as bug list) Environment:
Last Closed: 2017-12-08 17:40:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1482812    
Bug Blocks: 1496317, 1496321    

Description Ravishankar N 2017-09-14 11:46:34 UTC
+++ This bug was initially created as a clone of Bug #1482812 +++

Description of problem:
=======================

I have 4x3 volume where bricks were brought down in random order ensuring to keep 2 bricks online all the time. However, I see lot of split-brains on the system, when I looked into one of the file it seems to be coming from the hashed link of hardlink file as follows (all bricks blaming each other). Also, The files are accessible (ls,stat,cat) from mount and no EIO is seen.

getfattr form subvolume:
========================

[root@dhcp42-79 ~]#
[root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bdbea
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp42-79 ~]#

[root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x599593a600020657
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp42-79 ~]#


[root@dhcp43-210 ~]# getfattr -d -e hex -m . /rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000010000000200000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000befab
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp43-210 ~]#

[root@dhcp42-79 ~]# ls -l /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
---------T. 4 root root 0 Aug 17 12:16 /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root@dhcp42-79 ~]#



getfattr from cached subvolume:
===============================


Note: The files are accessible (ls,stat,cat) from mount and no EIO is seen. If the file is in split-brain, isnt it be shown as EIO. May be because these split-brains are on hashed files of hardlinks and no split-brain is on the actual cached file of hardlinks.


[root@dhcp41-217 ~]# ls -l /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
-rw-r--r--. 6 root root 9537 Aug 17 12:14 /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root@dhcp41-217 ~]# getfattr -d -e hex -m . /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bb080

[root@dhcp41-217 ~]#


IO Pattern while the bricks were brought down:
==============================================

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,chown,create,hardlink,hardlink,symlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i <mnt> ; sleep 10 ; done

Order of bringing the bricks down:
==================================

Subvolume 0: bricks {1,2,9}
Subvolume 1: bricks {3,4,10}
Subvolume 2: bricks {5,6,11}
Subvolume 3: bricks {7,8,12}

=> Bricks were brought down: 1, 11, 4 12 => One from each subvolume while IO is inprogress
=> Bring the bricks back and wait for heal to complete
=> Bring the other set of bricks down: 5, 2 , 10, 8 => One from each subvolume while IO is inprogress
=> Bring the bricks back and did not wait for heal to complete
=> Bring the final set of bricks down: 3, 6, 7, 9 => One from each subvolume while IO is inprogress


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64



How reproducible:
=================

2/2

--- Additional comment from Ravishankar N on 2017-08-18 04:29:32 EDT ---

Volume Name: master
Type: Distributed-Replicate
Volume ID: c9a04941-4045-4bc1-bb26-131f5634a792
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick1/b2
Brick3: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick3/b9
Brick4: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick1/b3
Brick5: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick1/b4
Brick6: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick3/b10
Brick7: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick2/b5
Brick8: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick2/b6
Brick9: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick3/b11
Brick10: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick2/b7
Brick11: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick2/b8
Brick12: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick3/b12
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
nfs.disable: off
cluster.enable-shared-storage: enable

--- Additional comment from Ravishankar N on 2017-08-18 11:07:39 EDT ---

I was able to hit this issue (T files ending up in split brain) like so:

1. Create  2 x 3 volume and disable all heals:
Brick1: 127.0.0.2:/home/ravi/bricks/brick1
Brick2: 127.0.0.2:/home/ravi/bricks/brick2
Brick3: 127.0.0.2:/home/ravi/bricks/brick3

Brick4: 127.0.0.2:/home/ravi/bricks/brick4
Brick5: 127.0.0.2:/home/ravi/bricks/brick5
Brick6: 127.0.0.2:/home/ravi/bricks/brick6

2. Create a file and 3 hardlinks to it from fuse mount.
#tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── HLINK1
├── HLINK3
└── HLINK7

All of these files hashed to the first dht subvol, i.e. replicate-0.

3. Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there.

4. Likewise rename HLINK2 and HLINK3 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time.

5. Now enable shd and let selfheals complete.

6. File names from the mount after rename:
[root@tuxpad ravi]# tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── NEW-HLINK1
├── NEW-HLINK3-NEW
└── NEW-HLINK7-NEW

7. The T files are now in split-brain:
[root@tuxpad ravi]# ll /home/ravi/bricks/brick{4..6}/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:59 /home/ravi/bricks/brick4/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick5/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick6/NEW-HLINK1
[root@tuxpad ravi]#
[root@tuxpad ravi]# getfattr -d -m . -e hex /home/ravi/bricks/brick{4..6}/NEW-HLINK1
getfattr: Removing leading '/' from absolute path names
# file: home/ravi/bricks/brick4/NEW-HLINK1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick5/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick6/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

Heal-info also shows the T files to be in split-brain.

Comment 1 Worker Ant 2017-09-14 12:11:13 UTC
REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for zero-byte files) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 2 Worker Ant 2017-09-16 04:06:04 UTC
REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for zero-byte files) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 3 Worker Ant 2017-09-26 04:04:23 UTC
COMMIT: https://review.gluster.org/18283 committed in master by Ravishankar N (ravishankar) 
------
commit 1719cffa911c5287715abfdb991bc8862f0c994e
Author: Ravishankar N <ravishankar>
Date:   Thu Sep 14 11:29:15 2017 +0530

    afr: auto-resolve split-brains for zero-byte files
    
    Problems:
    As described in BZ 1491670, renaming hardlinks can result in data/mdata
    split-brain of the DHT link-to files (T files) without any mismatch of
    data and metadata.
    
    As described in BZ 1486063, for a zero-byte file with only dirty bits
    set, arbiter brick will likely be chosen as the source brick.
    
    Fix:
    For zero byte files in split-brain, pick first brick as
    a) data source if file size is zero on all bricks.
    b) metadata source if metadata is the same on all bricks
    
    In arbiter case, if file size is zero on all bricks and there are no
    pending afr xattrs, pick 1st brick as data source.
    
    Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04
    BUG: 1491670
    Signed-off-by: Ravishankar N <ravishankar>
    Reported-by: Rahul Hinduja <rhinduja>
    Reported-by: Mabi <mabi>

Comment 4 Ravishankar N 2017-09-26 08:36:53 UTC
Sending an addendum to the patch in comment #3.

Comment 5 Worker Ant 2017-09-26 08:37:24 UTC
REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in afr_mark_source_sinks_if_file_empty) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 6 Worker Ant 2017-09-26 08:41:26 UTC
REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in afr_mark_source_sinks_if_file_empty) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 7 Worker Ant 2017-09-27 03:03:44 UTC
COMMIT: https://review.gluster.org/18391 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84
Author: Ravishankar N <ravishankar>
Date:   Tue Sep 26 14:03:52 2017 +0530

    afr: don't check for file size in afr_mark_source_sinks_if_file_empty
    
    ... for AFR_METADATA_TRANSACTION and just mark source and sinks if
    metadata is the same.
    
    Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05
    BUG: 1491670
    Signed-off-by: Ravishankar N <ravishankar>

Comment 8 Shyamsundar 2017-12-08 17:40:53 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/