+++ This bug was initially created as a clone of Bug #1491670 +++ +++ This bug was initially created as a clone of Bug #1482812 +++ Description of problem: ======================= I have 4x3 volume where bricks were brought down in random order ensuring to keep 2 bricks online all the time. However, I see lot of split-brains on the system, when I looked into one of the file it seems to be coming from the hashed link of hardlink file as follows (all bricks blaming each other). Also, The files are accessible (ls,stat,cat) from mount and no EIO is seen. getfattr form subvolume: ======================== [root@dhcp42-79 ~]# [root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b1/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-1=0x000000020000000300000000 trusted.afr.master-client-8=0x000000010000000200000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bdbea trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp42-79 ~]# [root@dhcp42-79 ~]# getfattr -d -e hex -m . /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-0=0x000000000000000000000000 trusted.afr.master-client-1=0x000000020000000300000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x599593a600020657 trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp42-79 ~]# [root@dhcp43-210 ~]# getfattr -d -e hex -m . /rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/b2/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.master-client-0=0x000000010000000200000000 trusted.afr.master-client-8=0x000000010000000200000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000befab trusted.glusterfs.dht.linkto=0x6d61737465722d7265706c69636174652d3300 [root@dhcp43-210 ~]# [root@dhcp42-79 ~]# ls -l /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN ---------T. 4 root root 0 Aug 17 12:16 /rhs/brick3/b9/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN [root@dhcp42-79 ~]# getfattr from cached subvolume: =============================== Note: The files are accessible (ls,stat,cat) from mount and no EIO is seen. If the file is in split-brain, isnt it be shown as EIO. May be because these split-brains are on hashed files of hardlinks and no split-brain is on the actual cached file of hardlinks. [root@dhcp41-217 ~]# ls -l /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN -rw-r--r--. 6 root root 9537 Aug 17 12:14 /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN [root@dhcp41-217 ~]# getfattr -d -e hex -m . /rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/b8/f/new_test/thread8/level05/level15/hardlink_to_files/59958f84%%OR3N7XRPBN security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x0eab4b2eb0844a039a9b10fa50ed2e07 trusted.glusterfs.c9a04941-4045-4bc1-bb26-131f5634a792.xtime=0x59958f84000bb080 [root@dhcp41-217 ~]# IO Pattern while the bricks were brought down: ============================================== for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,chown,create,hardlink,hardlink,symlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i <mnt> ; sleep 10 ; done Order of bringing the bricks down: ================================== Subvolume 0: bricks {1,2,9} Subvolume 1: bricks {3,4,10} Subvolume 2: bricks {5,6,11} Subvolume 3: bricks {7,8,12} => Bricks were brought down: 1, 11, 4 12 => One from each subvolume while IO is inprogress => Bring the bricks back and wait for heal to complete => Bring the other set of bricks down: 5, 2 , 10, 8 => One from each subvolume while IO is inprogress => Bring the bricks back and did not wait for heal to complete => Bring the final set of bricks down: 3, 6, 7, 9 => One from each subvolume while IO is inprogress Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.8.4-41.el7rhgs.x86_64 How reproducible: ================= 2/2 --- Additional comment from Ravishankar N on 2017-08-18 04:29:32 EDT --- Volume Name: master Type: Distributed-Replicate Volume ID: c9a04941-4045-4bc1-bb26-131f5634a792 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick2: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick1/b2 Brick3: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick3/b9 Brick4: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick1/b3 Brick5: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick1/b4 Brick6: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick3/b10 Brick7: dhcp42-79.lab.eng.blr.redhat.com:/rhs/brick2/b5 Brick8: dhcp43-210.lab.eng.blr.redhat.com:/rhs/brick2/b6 Brick9: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick3/b11 Brick10: dhcp42-74.lab.eng.blr.redhat.com:/rhs/brick2/b7 Brick11: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick2/b8 Brick12: dhcp41-217.lab.eng.blr.redhat.com:/rhs/brick3/b12 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on transport.address-family: inet nfs.disable: off cluster.enable-shared-storage: enable --- Additional comment from Ravishankar N on 2017-08-18 11:07:39 EDT --- I was able to hit this issue (T files ending up in split brain) like so: 1. Create 2 x 3 volume and disable all heals: Brick1: 127.0.0.2:/home/ravi/bricks/brick1 Brick2: 127.0.0.2:/home/ravi/bricks/brick2 Brick3: 127.0.0.2:/home/ravi/bricks/brick3 Brick4: 127.0.0.2:/home/ravi/bricks/brick4 Brick5: 127.0.0.2:/home/ravi/bricks/brick5 Brick6: 127.0.0.2:/home/ravi/bricks/brick6 2. Create a file and 3 hardlinks to it from fuse mount. #tree /mnt/fuse_mnt/ /mnt/fuse_mnt/ ├── FILE ├── HLINK1 ├── HLINK3 └── HLINK7 All of these files hashed to the first dht subvol, i.e. replicate-0. 3. Kill brick4, rename HLINK1 to an appropriate name so that it gets hashed to replicate-1 and a T file is created there. 4. Likewise rename HLINK2 and HLINK3 as will, killing brick5 and brick6 respectively each time. i.e. a different brick of the 2nd replica is down each time. 5. Now enable shd and let selfheals complete. 6. File names from the mount after rename: [root@tuxpad ravi]# tree /mnt/fuse_mnt/ /mnt/fuse_mnt/ ├── FILE ├── NEW-HLINK1 ├── NEW-HLINK3-NEW └── NEW-HLINK7-NEW 7. The T files are now in split-brain: [root@tuxpad ravi]# ll /home/ravi/bricks/brick{4..6}/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:59 /home/ravi/bricks/brick4/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick5/NEW-HLINK1 ---------T. 4 root root 0 Aug 18 12:58 /home/ravi/bricks/brick6/NEW-HLINK1 [root@tuxpad ravi]# [root@tuxpad ravi]# getfattr -d -m . -e hex /home/ravi/bricks/brick{4..6}/NEW-HLINK1 getfattr: Removing leading '/' from absolute path names # file: home/ravi/bricks/brick4/NEW-HLINK1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-4=0x000000010000000200000000 trusted.afr.testvol-client-5=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 # file: home/ravi/bricks/brick5/NEW-HLINK1 security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-3=0x000000010000000200000000 trusted.afr.testvol-client-5=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 # file: home/ravi/bricks/brick6/NEW-HLINK1 security.selinux=0x73797374656d5f753a6f626a6563745f723a757365725f686f6d655f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client-3=0x000000010000000200000000 trusted.afr.testvol-client-4=0x000000010000000200000000 trusted.gfid=0xb4bc9ec1f7a44fa3958d82b54fc2b495 trusted.glusterfs.dht.linkto=0x74657374766f6c2d7265706c69636174652d3000 Heal-info also shows the T files to be in split-brain. --- Additional comment from Worker Ant on 2017-09-14 08:11:13 EDT --- REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for zero-byte files) posted (#1) for review on master by Ravishankar N (ravishankar) --- Additional comment from Worker Ant on 2017-09-16 00:06:04 EDT --- REVIEW: https://review.gluster.org/18283 (afr: auto-resolve split-brains for zero-byte files) posted (#2) for review on master by Ravishankar N (ravishankar) --- Additional comment from Worker Ant on 2017-09-26 00:04:23 EDT --- COMMIT: https://review.gluster.org/18283 committed in master by Ravishankar N (ravishankar) ------ commit 1719cffa911c5287715abfdb991bc8862f0c994e Author: Ravishankar N <ravishankar> Date: Thu Sep 14 11:29:15 2017 +0530 afr: auto-resolve split-brains for zero-byte files Problems: As described in BZ 1491670, renaming hardlinks can result in data/mdata split-brain of the DHT link-to files (T files) without any mismatch of data and metadata. As described in BZ 1486063, for a zero-byte file with only dirty bits set, arbiter brick will likely be chosen as the source brick. Fix: For zero byte files in split-brain, pick first brick as a) data source if file size is zero on all bricks. b) metadata source if metadata is the same on all bricks In arbiter case, if file size is zero on all bricks and there are no pending afr xattrs, pick 1st brick as data source. Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04 BUG: 1491670 Signed-off-by: Ravishankar N <ravishankar> Reported-by: Rahul Hinduja <rhinduja> Reported-by: Mabi <mabi> --- Additional comment from Ravishankar N on 2017-09-26 04:36:53 EDT --- Sending an addendum to the patch in comment #3. --- Additional comment from Worker Ant on 2017-09-26 04:37:24 EDT --- REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in afr_mark_source_sinks_if_file_empty) posted (#1) for review on master by Ravishankar N (ravishankar) --- Additional comment from Worker Ant on 2017-09-26 04:41:26 EDT --- REVIEW: https://review.gluster.org/18391 (afr: don't check for file size in afr_mark_source_sinks_if_file_empty) posted (#2) for review on master by Ravishankar N (ravishankar) --- Additional comment from Worker Ant on 2017-09-26 23:03:44 EDT --- COMMIT: https://review.gluster.org/18391 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84 Author: Ravishankar N <ravishankar> Date: Tue Sep 26 14:03:52 2017 +0530 afr: don't check for file size in afr_mark_source_sinks_if_file_empty ... for AFR_METADATA_TRANSACTION and just mark source and sinks if metadata is the same. Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05 BUG: 1491670 Signed-off-by: Ravishankar N <ravishankar>
REVIEW: https://review.gluster.org/18402 (afr: auto-resolve split-brains for zero-byte files) posted (#1) for review on release-3.10 by Ravishankar N (ravishankar)
COMMIT: https://review.gluster.org/18402 committed in release-3.10 by Shyamsundar Ranganathan (srangana) ------ commit f5998f07dfd21d06a4119416ca79db50232b50d4 Author: Ravishankar N <ravishankar> Date: Wed Sep 27 10:32:36 2017 +0530 afr: auto-resolve split-brains for zero-byte files Backport of https://review.gluster.org/#/c/18283/ Problems: As described in BZ 1491670, renaming hardlinks can result in data/mdata split-brain of the DHT link-to files (T files) without any mismatch of data and metadata. As described in BZ 1486063, for a zero-byte file with only dirty bits set, arbiter brick will likely be chosen as the source brick. Fix: For zero byte files in split-brain, pick first brick as a) data source if file size is zero on all bricks. b) metadata source if metadata is the same on all bricks In arbiter case, if file size is zero on all bricks and there are no pending afr xattrs, pick 1st brick as data source. Change-Id: I0270a9a2f97c3b21087e280bb890159b43975e04 BUG: 1496321 Signed-off-by: Ravishankar N <ravishankar> Reported-by: Rahul Hinduja <rhinduja> Reported-by: Mabi <mabi>
REVIEW: https://review.gluster.org/18421 (afr: don't check for file size in afr_mark_source_sinks_if_file_empty) posted (#1) for review on release-3.10 by Ravishankar N (ravishankar)
COMMIT: https://review.gluster.org/18421 committed in release-3.10 by Shyamsundar Ranganathan (srangana) ------ commit ccec52220f4d2fb148c1bb1573a11b1727af7a0c Author: Ravishankar N <ravishankar> Date: Tue Sep 26 14:03:52 2017 +0530 afr: don't check for file size in afr_mark_source_sinks_if_file_empty ... for AFR_METADATA_TRANSACTION and just mark source and sinks if metadata is the same. (cherry picked from commit 24637d54dcbc06de8a7de17c75b9291fcfcfbc84) Change-Id: I69e55d3c842c7636e3538d1b57bc4deca67bed05 BUG: 1496321 Signed-off-by: Ravishankar N <ravishankar>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.6, please open a new bug report. glusterfs-3.10.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-October/000084.html [2] https://www.gluster.org/pipermail/gluster-users/