Bug 1482812 - [afr] split-brain observed on T files post hardlink and rename in x3 volume
Summary: [afr] split-brain observed on T files post hardlink and rename in x3 volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.3
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
: RHGS 3.4.0
Assignee: Ravishankar N
QA Contact: Vijay Avuthu
URL:
Whiteboard: rebase
Depends On:
Blocks: 1491670 1496317 1496321 1503134
TreeView+ depends on / blocked
 
Reported: 2017-08-18 06:53 UTC by Rahul Hinduja
Modified: 2018-09-20 04:50 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1491670 (view as bug list)
Environment:
Last Closed: 2018-09-04 06:35:11 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:37:32 UTC

Description Rahul Hinduja 2017-08-18 06:53:03 UTC
Description of problem:
=======================

I have 4x3 volume where bricks were brought down in random order ensuring to keep 2 bricks online all the time. However, I see lot of split-brains on the system, when I looked into one of the file it seems to be coming from the hashed link of hardlink file as follows (all bricks blaming each other). Also, The files are accessible (ls,stat,cat) from mount and no EIO is seen.

getfattr form subvolume:
========================

[root@dhcp42-79 ~]#
[root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bdbea
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp42-79 ~]#

[root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000000000000000000000
trusted.afr.master-client-1=0x000000020000000300000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x599593a600020657
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp42-79 ~]#


[root@dhcp43-210 ~]# getfattr -d -e hex -m . /rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.master-client-0=0x000000010000000200000000
trusted.afr.master-client-8=0x000000010000000200000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000befab
trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300

[root@dhcp43-210 ~]#

[root@dhcp42-79 ~]# ls -l /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
---------T. 4 root root 0 Aug 17 12:16 /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root@dhcp42-79 ~]#



getfattr from cached subvolume:
===============================


Note: The files are accessible (ls,stat,cat) from mount and no EIO is seen. If the file is in split-brain, isnt it be shown as EIO. May be because these split-brains are on hashed files of hardlinks and no split-brain is on the actual cached file of hardlinks.


[root@dhcp41-217 ~]# ls -l /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
-rw-r--r--. 6 root root 9537 Aug 17 12:14 /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
[root@dhcp41-217 ~]# getfattr -d -e hex -m . /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07
trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bb080

[root@dhcp41-217 ~]#


IO Pattern while the bricks were brought down:
==============================================

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,chown,create,hardlink,hardlink,symlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i <mnt> ; sleep 10 ; done

Order of bringing the bricks down:
==================================

Subvolume 0: bricks {1,2,9}
Subvolume 1: bricks {3,4,10}
Subvolume 2: bricks {5,6,11}
Subvolume 3: bricks {7,8,12}

=> Bricks were brought down: 1, 11, 4 12 => One from each subvolume while IO is inprogress
=> Bring the bricks back and wait for heal to complete
=> Bring the other set of bricks down: 5, 2 , 10, 8 => One from each subvolume while IO is inprogress
=> Bring the bricks back and did not wait for heal to complete
=> Bring the final set of bricks down: 3, 6, 7, 9 => One from each subvolume while IO is inprogress


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64



How reproducible:
=================

2/2

Comment 4 Ravishankar N 2017-08-18 15:07:39 UTC
I was able to hit this issue (T files ending up in split brain) like so:

1. Create  2 x 3 volume and disable all heals:
Brick1: 127.0.0.2:/home/ravi/bricks/brick1
Brick2: 127.0.0.2:/home/ravi/bricks/brick2
Brick3: 127.0.0.2:/home/ravi/bricks/brick3

Brick4: 127.0.0.2:/home/ravi/bricks/brick4
Brick5: 127.0.0.2:/home/ravi/bricks/brick5
Brick6: 127.0.0.2:/home/ravi/bricks/brick6

2. Create a file and 3 hardlinks to it from fuse mount.
#tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── HLINK1
├── HLINK3
└── HLINK7

All of these files hashed to the first dht subvol, i.e. replicate-0.

3. Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there.

4. Likewise rename HLINK2 and HLINK3 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time.

5. Now enable shd and let selfheals complete.

6. File names from the mount after rename:
[root@tuxpad ravi]# tree /mnt/fuse_mnt/
/mnt/fuse_mnt/
├── FILE
├── NEW-HLINK1
├── NEW-HLINK3-NEW
└── NEW-HLINK7-NEW

7. The T files are now in split-brain:
[root@tuxpad ravi]# ll /home/ravi/bricks/brick{4..6}/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:59 /home/ravi/bricks/brick4/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick5/NEW-HLINK1
---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick6/NEW-HLINK1
[root@tuxpad ravi]#
[root@tuxpad ravi]# getfattr -d -m . -e hex /home/ravi/bricks/brick{4..6}/NEW-HLINK1
getfattr: Removing leading '/' from absolute path names
# file: home/ravi/bricks/brick4/NEW-HLINK1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick5/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-5=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

# file: home/ravi/bricks/brick6/NEW-HLINK1
security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-3=0x000000010000000200000000
trusted.afr.testvol-client-4=0x000000010000000200000000
trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495
trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000

Heal-info also shows the T files to be in split-brain.

Comment 7 Ravishankar N 2017-09-14 12:12:19 UTC
Upstream patch: https://review.gluster.org/#/c/18283/1

Comment 8 Ravishankar N 2017-09-27 04:17:38 UTC
(In reply to Ravishankar N from comment #7)
> Upstream patch: https://review.gluster.org/#/c/18283/1

There is also a follow-up patch: https://review.gluster.org/#/c/18391/ (so 2 patches in total for this bug)

Comment 11 Vijay Avuthu 2018-05-06 11:33:41 UTC
Update:
========

Build Used : glusterfs-3.12.2-8.el7rhgs.x86_64

Scenario:

1) Create  2 * 3 distribute replicate volume and disable all heals:
2) Create a file and 3 hardlinks to it from fuse mount. All of these files hashed to the first dht subvol, i.e. replicate-0.
3) Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there.
4) Likewise rename HLINK3 and HLINK7 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time.
e
eg : after renaming

[root@dhcp35-125 ~]# tree /mnt/23/
/mnt/23/
├── FILE
├── NEW-HLINK1
├── NEW-HLINK3-NEW
└── NEW-HLINK7-NEW

5) Now enable shd and let selfheals complete.
6) heal should complete without split-brains.

All files are healed after enabling shd

eg from 2nd dht subvol node : 

[root@dhcp35-163 ~]# ls -lrt /bricks/brick0/testvol_distributed-replicated_brick3/
total 12
---------T. 4 root root 0 May  6 07:02 NEW-HLINK7-NEW
---------T. 4 root root 0 May  6 07:02 NEW-HLINK3-NEW
---------T. 4 root root 0 May  6 07:02 NEW-HLINK1
[root@dhcp35-163 ~]#

Comment 13 errata-xmlrpc 2018-09-04 06:35:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.