Description of problem: ======================= I have 4x3 volume where bricks were brought down in random order ensuring to keep 2 bricks online all the time. However, I see lot of split-brains on the system, when I looked into one of the file it seems to be coming from the hashed link of hardlink file as follows (all bricks blaming each other). Also, The files are accessible (ls,stat,cat) from mount and no EIO is seen. getfattr form subvolume: ======================== [root@dhcp42-79 ~]# [root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-1=0x000000020000000300000000 trusted.afr.master-client-8=0x000000010000000200000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bdbea trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp42-79 ~]# [root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-0=0x000000000000000000000000 trusted.afr.master-client-1=0x000000020000000300000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x599593a600020657 trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp42-79 ~]# [root@dhcp43-210 ~]# getfattr -d -e hex -m . /rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-0=0x000000010000000200000000 trusted.afr.master-client-8=0x000000010000000200000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000befab trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp43-210 ~]# [root@dhcp42-79 ~]# ls -l /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN ---------T. 4 root root 0 Aug 17 12:16 /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN [root@dhcp42-79 ~]# getfattr from cached subvolume: =============================== Note: The files are accessible (ls,stat,cat) from mount and no EIO is seen. If the file is in split-brain, isnt it be shown as EIO. May be because these split-brains are on hashed files of hardlinks and no split-brain is on the actual cached file of hardlinks. [root@dhcp41-217 ~]# ls -l /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN -rw-r--r--. 6 root root 9537 Aug 17 12:14 /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN [root@dhcp41-217 ~]# getfattr -d -e hex -m . /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bb080 [root@dhcp41-217 ~]# IO Pattern while the bricks were brought down: ============================================== for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,chown,create,hardlink,hardlink,symlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i <mnt> ; sleep 10 ; done Order of bringing the bricks down: ================================== Subvolume 0: bricks {1,2,9} Subvolume 1: bricks {3,4,10} Subvolume 2: bricks {5,6,11} Subvolume 3: bricks {7,8,12} => Bricks were brought down: 1, 11, 4 12 => One from each subvolume while IO is inprogress => Bring the bricks back and wait for heal to complete => Bring the other set of bricks down: 5, 2 , 10, 8 => One from each subvolume while IO is inprogress => Bring the bricks back and did not wait for heal to complete => Bring the final set of bricks down: 3, 6, 7, 9 => One from each subvolume while IO is inprogress Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64 How reproducible: ================= 2/2
I was able to hit this issue (T files ending up in split brain) like so: 1. Create 2 x 3 volume and disable all heals: Brick1: 127.0.0.2:/home/ravi/bricks/brick1 Brick2: 127.0.0.2:/home/ravi/bricks/brick2 Brick3: 127.0.0.2:/home/ravi/bricks/brick3 Brick4: 127.0.0.2:/home/ravi/bricks/brick4 Brick5: 127.0.0.2:/home/ravi/bricks/brick5 Brick6: 127.0.0.2:/home/ravi/bricks/brick6 2. Create a file and 3 hardlinks to it from fuse mount. #tree /mnt/fuse_mnt/ /mnt/fuse_mnt/ ├── FILE ├── HLINK1 ├── HLINK3 └── HLINK7 All of these files hashed to the first dht subvol, i.e. replicate-0. 3. Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there. 4. Likewise rename HLINK2 and HLINK3 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time. 5. Now enable shd and let selfheals complete. 6. File names from the mount after rename: [root@tuxpad ravi]# tree /mnt/fuse_mnt/ /mnt/fuse_mnt/ ├── FILE ├── NEW-HLINK1 ├── NEW-HLINK3-NEW └── NEW-HLINK7-NEW 7. The T files are now in split-brain: [root@tuxpad ravi]# ll /home/ravi/bricks/brick{4..6}/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:59 /home/ravi/bricks/brick4/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick5/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick6/NEW-HLINK1 [root@tuxpad ravi]# [root@tuxpad ravi]# getfattr -d -m . -e hex /home/ravi/bricks/brick{4..6}/NEW-HLINK1 getfattr: Removing leading '/' from absolute path names # file: home/ravi/bricks/brick4/NEW-HLINK1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-4=0x000000010000000200000000 trusted.afr.testvol-client-5=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 # file: home/ravi/bricks/brick5/NEW-HLINK1 security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-3=0x000000010000000200000000 trusted.afr.testvol-client-5=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 # file: home/ravi/bricks/brick6/NEW-HLINK1 security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-3=0x000000010000000200000000 trusted.afr.testvol-client-4=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 Heal-info also shows the T files to be in split-brain.
Upstream patch: https://review.gluster.org/#/c/18283/1
(In reply to Ravishankar N from comment #7) > Upstream patch: https://review.gluster.org/#/c/18283/1 There is also a follow-up patch: https://review.gluster.org/#/c/18391/ (so 2 patches in total for this bug)
Update: ======== Build Used : glusterfs-3.12.2-8.el7rhgs.x86_64 Scenario: 1) Create 2 * 3 distribute replicate volume and disable all heals: 2) Create a file and 3 hardlinks to it from fuse mount. All of these files hashed to the first dht subvol, i.e. replicate-0. 3) Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there. 4) Likewise rename HLINK3 and HLINK7 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time. e eg : after renaming [root@dhcp35-125 ~]# tree /mnt/23/ /mnt/23/ ├── FILE ├── NEW-HLINK1 ├── NEW-HLINK3-NEW └── NEW-HLINK7-NEW 5) Now enable shd and let selfheals complete. 6) heal should complete without split-brains. All files are healed after enabling shd eg from 2nd dht subvol node : [root@dhcp35-163 ~]# ls -lrt /bricks/brick0/testvol_distributed-replicated_brick3/ total 12 ---------T. 4 root root 0 May 6 07:02 NEW-HLINK7-NEW ---------T. 4 root root 0 May 6 07:02 NEW-HLINK3-NEW ---------T. 4 root root 0 May 6 07:02 NEW-HLINK1 [root@dhcp35-163 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607