Description of problem: Using sanlock on a gluster mount with replica 3 (quorum auto) leads to a split-brain. Version-Release number of selected component (if applicable): glusterfs-3.4.2-1.el6.x86_64 sanlock-2.8-1.el6.x86_64 How reproducible: ~100% Steps to Reproduce: 1. create a volume with replica on 3 nodes (quorum-type=auto, ping-timeout=10): Volume Name: glusterfs1 Type: Replicate Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1 Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1 Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1 Options Reconfigured: cluster.quorum-type: auto network.ping-timeout: 10 2. mount the volume from all the 3 nodes with: # mount -t glusterfs 192.168.124.81:/glusterfs1 /mnt 3. create a new lockspace: # touch /mnt/lockspace1.ls # chown sanlock:sanlock /mnt/lockspace1.ls # sanlock init -s lockspace1:0:/mnt/lockspace1.ls:0 4. add the lockspace on each host (with a different id: 1, 2, 3) and wait until acquired: [root@node1 ~]# sanlock add_lockspace -s lockspace1:1:/mnt/lockspace1.ls:0 add_lockspace add_lockspace done 0 [root@node2 ~]# sanlock add_lockspace -s lockspace1:2:/mnt/lockspace1.ls:0 add_lockspace add_lockspace done 0 [root@node3 ~]# sanlock add_lockspace -s lockspace1:3:/mnt/lockspace1.ls:0 add_lockspace add_lockspace done 0 # sanlock direct dump /mnt/lockspace1.ls offset lockspace resource timestamp own gen lver 00000000 lockspace1 830b815d-6f55-4237-af11-f4af1bc888c3.node1.i 0000000915 0001 0001 00000512 lockspace1 7a14c9fa-a13e-4c7a-9e05-e50b4766db4f.node2.i 0000000920 0002 0001 00001024 lockspace1 c2996bb0-5519-482e-9c89-4e4103ce2fe5.node3.i 0000000924 0003 0001 5. optionally you can monitor the lockspace status (on the 3 hosts) with: # watch -d -n 1 "sanlock client host_status -D | grep timestamp" 6. block the connectivity on node 1: [root@node1 ~]# iptables -A INPUT -i glusternet1 -j REJECT ; iptables -A OUTPUT -o glusternet1 -j REJECT 7. wait until "sanlock client host_status -D" is not reporting anything anymore on node 1, roughly 1 minute (node 2 and 3 should be fine) 8. inspect the lockspace1.ls file on each node: [root@node1 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls getfattr: Removing leading '/' from absolute path names # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000 trusted.afr.glusterfs1-client-0=0x000000000000000000000000 trusted.afr.glusterfs1-client-1=0x000000010000000000000000 trusted.afr.glusterfs1-client-2=0x000000010000000000000000 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e [root@node2 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls getfattr: Removing leading '/' from absolute path names # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000 trusted.afr.glusterfs1-client-0=0x000000130000000000000000 trusted.afr.glusterfs1-client-1=0x000000000000000000000000 trusted.afr.glusterfs1-client-2=0x000000000000000000000000 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e [root@node3 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls getfattr: Removing leading '/' from absolute path names # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000 trusted.afr.glusterfs1-client-0=0x000000140000000000000000 trusted.afr.glusterfs1-client-1=0x000000010000000000000000 trusted.afr.glusterfs1-client-2=0x000000010000000000000000 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e 9. restore the connectivity for node 1: [root@node1 ~]# iptables -D INPUT -i ovirtnet1 -j REJECT ; iptables -D OUTPUT -o ovirtnet1 -j REJECT 10. read and writes are now failing an all hosts (it takes few seconds), you can quickly verify it with (try multiple times): [root@node1 ~]# sha1sum /mnt/lockspace1.ls sha1sum: /mnt/lockspace1.ls: Input/output error [root@node2 ~]# sha1sum /mnt/lockspace1.ls sha1sum: /mnt/lockspace1.ls: Input/output error [root@node3 ~]# sha1sum /mnt/lockspace1.ls sha1sum: /mnt/lockspace1.ls: Input/output error 11. in roughly 1 minute the lockspaces will disappear on node 2 and 3 as well Actual results: Split-brain that needs manual intervention to be resolved. Expected results: Split-brain should be avoided or automatically resolved.
I tried also adding the quorum-count option but the issue is still present. Volume Name: glusterfs1 Type: Replicate Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1 Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1 Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1 Options Reconfigured: cluster.quorum-count: 2 cluster.quorum-type: auto network.ping-timeout: 10
(In reply to Federico Simoncelli from comment #1) > I tried also adding the quorum-count option but the issue is still present. > > Volume Name: glusterfs1 > Type: Replicate > Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1 > Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1 > Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1 > Options Reconfigured: > cluster.quorum-count: 2 > cluster.quorum-type: auto > network.ping-timeout: 10 quorum-count is used in conjunction with quorum-type. quorum-type of 'auto' overrides quorum-count So set the quorum-type to 'fixed' and then set the quorum-count to '2'
Same behavior with: Volume Name: glusterfs1 Type: Replicate Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1 Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1 Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1 Options Reconfigured: cluster.quorum-count: 2 cluster.quorum-type: fixed network.ping-timeout: 10
I was able to re-create the bug with the steps given in the bug description. The issue is happening because quorum checks are done only when the fop is issued. If the fop fails in a way which leads to a quorum failure that is not treated in _cbk path. I need to think about how to fix this issue. Will update as soon as I come up with a good way to fix the issue.
REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#1) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7600 (cluster/afr: Fix bugs in quorum implementation) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#2) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
Tested the fix posted by pranith in comment7 Before testing the fix, I installed 3.4.3 bits to confirm the bug Here is the sequence of steps. 1. Created 5 VMs with fed20. 3 Machines used for glusterfs nodes and 2 other nodes used for mounting the volume 2. Installed glusterfs-3.4.3 from fedora repo and also installed glusterfs-debuginfo packages from fedora debuginfo repo 3. Created a replicate volume with 3 bricks 4. Enabled client-side quorum on the volume > gluster volume set <vol-name> cluster.quorum-type auto 5. Start the volume and mounted the volume from 2 fed20 machines ( say mount1 and mount2 ) 6. From mount1, gdb-ed on mount process and set breakpoint on the function, "afr_have_quorum", which is where the quorum calculation and verification was done 7. Created a file from mount1, and after quorum calculation, ( tracing through gdb), blocked incoming and outgoing data traffic to brick2 and brick3 This is done to simulate network partition Result is mount1 can't see brick2 and brick3 8. Check for the changelogs on the brick1. [ it was 0-1-1 ] 9. Now block traffic from mount2, to brick1 > iptables -A OUTPUT -d <destination> -p all -j REJECT 10. Continue editing the same file from mount2 The changelog on brick2 and brick3, correspond to 1-0-0 11. Flush the iptables rules on the nodes on which those were added in earlier steps. 12. Perform heal on gluster node > gluster volume heal <vol-name> full 13. Check for splitbrain info >gluster volume heal <vol-name> info split-brain Result of all the above steps is that Split-brain was evident I repeated the same steps with the fix, with the small changes 1. New breakpoint at the function, "afr_has_fop_quorum" where the quorum calculation and verification are done for the FOP on call path The observations are, 1. After step 8 ( see above for the step) ,changelogs on brick1 were 1-1-1 2. And after step 10, the changelog on brick2 and brick3 corresponded to 1-0-0 3. After flushing iptables rules,and after healing the volume, there were no split-brain witnessed
I repeated testing the fix with the test steps as mentioned in comment0 I was not running in to split-brain. I repeated the test number of times ( around 15) and I haven't witnessed split-brain. I have a question here for Federico, when the mount goes to read-only after quorum being not met, then I restored quorum after sufficient time ( say after 15 minutes), I see that,"sanlock client host_status -D", not giving any output. I see from log ( /var/log/sanlock.log ) the following, <snip> 2014-04-30 14:55:53-0400 71911 [8377]: s4 all pids clear 2014-04-30 14:55:54-0400 71911 [30873]: ls1 aio collect 1 0x7fa37c0008c0:0x7fa37c0008d0:0x7fa37c101000 result -30:0 match res 2014-04-30 14:55:54-0400 71911 [30873]: s4 delta_renew write error -30 2014-04-30 14:55:54-0400 71911 [30873]: s4 renewal error -30 delta_length 0 last_success 71831 </snip> But there are no split-brains and confirmed the same Is that the behavior with sanlock ?
REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#3) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#4) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#5) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/7513 committed in release-3.5 by Niels de Vos (ndevos) ------ commit 46bf591eb87d743baacd30a85b99fd47db4fb055 Author: Pranith Kumar K <pkarampu> Date: Mon Apr 21 03:13:58 2014 +0530 cluster/afr: Fix bugs in quorum implementation - Have common place to perform quorum fop wind check - Check if fop succeeded in a way that matches quorum to avoid marking changelog in split-brain. Change-Id: I663072ece0e1de6e1ee9fccb03e1b6c968793bc5 BUG: 1066996 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/7513 Reviewed-by: Ravishankar N <ravishankar> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Niels de Vos <ndevos>
REVIEW: http://review.gluster.org/7600 (cluster/afr: Fix bugs in quorum implementation) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/7600 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit eb04dab9992f8f5d4b2d45e1ca10032fededcff1 Author: Pranith Kumar K <pkarampu> Date: Tue Apr 29 05:49:21 2014 +0530 cluster/afr: Fix bugs in quorum implementation - Have common place to perform quorum fop wind check - Check if fop succeeded in a way that matches quorum to avoid marking changelog in split-brain. BUG: 1066996 Change-Id: Ibc5b80e01dc206b2abbea2d29e26f3c60ff4f204 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/7600 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Ravishankar N <ravishankar>
I just verified this using glusterfs-3.6.0.4-1.el6rhs and ovirt from the master repository. I haven't been able to reproduce the issue. I'll do some more testing. Thanks!
The first (and last?) Beta for GlusterFS 3.5.1 has been released [1]. Please verify if the release solves this bug report for you. In case the glusterfs-3.5.1beta release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-May/040377.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.1, please reopen this bug report. glusterfs-3.5.1 has been announced on the Gluster Users mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-June/040723.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user