Description of problem: When data bricks in arbiter volume are brought down in a cyclic manner i see that arbiter brick becomes the source for heal which should not happen as this brick just contains meta data. Version-Release number of selected component (if applicable): glusterfs-3.8.4-5.el7rhgs.x86_64 How reproducible: Hit it once Steps to Reproduce: 1. Install HC stack on arbiter volumes 2. start doing I/O on the vms 3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another data brick 4.After some time Bring up the down brick and i observed few VM's are getting paused and arbiter brick becomes the source for other two bricks. Actual results: Vms are getting paused and i see that arbiter brick becomes source for the other two bricks. Expected results: Arbiter brick should not become source for other two bricks as it does not hold any data. Additional info:
Volume info : ================ [root@rhsqa-grafton1 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 09d43f7c-a6a2-4f4d-b781-c36e53a48bca Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.79:/rhgs/brick2/data Brick2: 10.70.36.80:/rhgs/brick2/data Brick3: 10.70.36.81:/rhgs/brick2/data (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 performance.strict-o-direct: on network.ping-timeout: 30 user.cifs: off cluster.granular-entry-heal: on gluster volume heal info output on data volume: ============================================== [root@rhsqa-grafton1 ~]# gluster volume heal data info Brick 10.70.36.79:/rhgs/brick2/data /f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 Status: Connected Number of entries: 1 Brick 10.70.36.80:/rhgs/brick2/data /f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 Status: Connected Number of entries: 1 Brick 10.70.36.81:/rhgs/brick2/data /f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 Status: Connected Number of entries: 1 fattrs on the first node: ============================ [root@rhsqa-grafton1 ~]# getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 getfattr: Removing leading '/' from absolute path names # file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.data-client-1=0x0000156a0000000000000000 trusted.afr.dirty=0x000000010000000000000000 trusted.bit-rot.version=0x0200000000000000583ebacf000b18d6 trusted.gfid=0x46744dafdde147758967c233e249f707 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000 fattrs on the second node: ============================= getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 getfattr: Removing leading '/' from absolute path names # file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.data-client-0=0x000000010000000000000000 trusted.afr.dirty=0x000000010000000000000000 trusted.bit-rot.version=0x0300000000000000583ecac6000e17f7 trusted.gfid=0x46744dafdde147758967c233e249f707 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000 fattrs on the third node: ============================== [root@rhsqa-grafton3 ~]# getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 getfattr: Removing leading '/' from absolute path names # file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.data-client-0=0x000000010000000000000000 trusted.afr.data-client-1=0x0000156a0000000000000000 trusted.afr.dirty=0x000000010000000000000000 trusted.bit-rot.version=0x0200000000000000583eb09a000b126b trusted.gfid=0x46744dafdde147758967c233e249f707 trusted.glusterfs.shard.block-size=0x0000000020000000 trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000
sosreports can be found at the link: ===================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1401969/
is this similar to https://bugzilla.redhat.com/show_bug.cgi?id=1361518 - Files not able to heal after arbiter and data bricks were rebooted ?
(In reply to nchilaka from comment #4) > is this similar to https://bugzilla.redhat.com/show_bug.cgi?id=1361518 - > Files not able to heal after arbiter and data bricks were rebooted ? No, those are zero byte files where arbiter is used as source brick during entry self-heal's new entry creation.
I do see that the first two data bricks in the volume blame each other (first brick says that second one needs healing and second one says first one needs healing)
Ravi, can you provide doc text with workaround?
Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-28.el7rhgs ) and I could hit this issue consistenly with the other issue of split-brain on arbiter volume BZ 1384983 Very simple test is to: 1. Create arbiter volume 1x (2+1) with bricks - brick1, brick2, arbiter 2. Fuse mount it on any RHEL 7 client 3. Run some app ( dd, truncate, etc, ) on a single file 4. Kill brick2 5. sleep for 3 seconds 6. Bring up brick2, sleep for 3 seconds, kill arbiter 7. sleep for 3 seconds 8. Bring up arbiter, sleep for 3 seconds, kill brick1 9. sleep for 3 seconds 10. continue with step 4 When the above steps are repeated, I observed that I landed up in a split-brain ( bz 1384983 ) or arbiter becoming source of heal.
There is a race which is leading to this situation. This happens when eager-lock is on, due to which 2 writes happen in parallel on a FD. First write fails on one brick and before marking the pending xattrs with post-op, another write comes in parallel. This will do the inode refresh and get the readables. Since we did not mark the xattrs on the disk yet, the refresh will get both the data bricks as readable and set it in the inode context. The in-flight split brain check see both the data bricks as readable and allow the second write. This write fails on the other brick and succeeds on the previously failed brick. Now we have one write failed on first data brick and the other failed on the second data brick. Now the post-op completes for both writes and marks pending on both the bricks, leading to arbiter becoming source.
Upstream patch: https://review.gluster.org/#/c/18049/
Upstream patch: https://review.gluster.org/#/c/19045/
Tested with RHGS 3.4.0 nightly build - glusterfs-3.12.2-16.el7rhgs with the following steps: 1. Create a 1x(2+1) arbitrated replicate volume and used that as a storage domain in RHV. 2. Created few VMs with their boot disks on this domain 3. Run some I/O inside the VM 4. Killed the first brick, wait for 10 mins 5. Bring back the brick, wait till self-heal is complete. 6. Repeat 4 & 5 for second & third brick 7. Repeate 4,5,6 for 100 iterations. All worked good. Arbiter has never become the source of heal
Have updated the doc text. Kindly review and confirm
Made a small change. Rest looks good to me.
have updated the doc text. Kindly review and confirm
In the last sentence you have to rewrite "arbiter bricks are not considered as source" to "arbiter brick will not be marked as source", because we will decide anything as source or sink based on the pending changelogs set on the file. With this fix we do not even allow to set the data pending part in the pending changelog xattrs if it is an arbiter brick, which was happening before. Considering source and marking source are two different things.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607