1066996 – Using sanlock on a gluster mount with replica 3 (quorum-type auto) leads to a split-brain

Bug 1066996 - Using sanlock on a gluster mount with replica 3 (quorum-type auto) leads to a split-brain

Summary: Using sanlock on a gluster mount with replica 3 (quorum-type auto) leads to a...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1095110 1098020
TreeView+	depends on / blocked

Reported:	2014-02-19 13:26 UTC by Federico Simoncelli
Modified:	2014-06-24 11:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.5.1beta
Clone Of:
Clones:	1095110 1098020 (view as bug list)
Environment:
Last Closed:	2014-06-24 11:03:31 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Federico Simoncelli 2014-02-19 13:26:30 UTC

Description of problem:
Using sanlock on a gluster mount with replica 3 (quorum auto) leads to a split-brain.

Version-Release number of selected component (if applicable):
glusterfs-3.4.2-1.el6.x86_64
sanlock-2.8-1.el6.x86_64

How reproducible:
~100%

Steps to Reproduce:

1. create a volume with replica on 3 nodes (quorum-type=auto, ping-timeout=10):

 Volume Name: glusterfs1
 Type: Replicate
 Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c
 Status: Started
 Number of Bricks: 1 x 3 = 3
 Transport-type: tcp
 Bricks:
 Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1
 Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1
 Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1
 Options Reconfigured:
 cluster.quorum-type: auto
 network.ping-timeout: 10

2. mount the volume from all the 3 nodes with:

 # mount -t glusterfs 192.168.124.81:/glusterfs1 /mnt

3. create a new lockspace:

 # touch /mnt/lockspace1.ls
 # chown sanlock:sanlock /mnt/lockspace1.ls
 # sanlock init -s lockspace1:0:/mnt/lockspace1.ls:0

4. add the lockspace on each host (with a different id: 1, 2, 3) and wait until acquired:

 [root@node1 ~]# sanlock add_lockspace -s lockspace1:1:/mnt/lockspace1.ls:0
 add_lockspace
 add_lockspace done 0
 [root@node2 ~]# sanlock add_lockspace -s lockspace1:2:/mnt/lockspace1.ls:0
 add_lockspace
 add_lockspace done 0
 [root@node3 ~]# sanlock add_lockspace -s lockspace1:3:/mnt/lockspace1.ls:0
 add_lockspace
 add_lockspace done 0

 # sanlock direct dump /mnt/lockspace1.ls
   offset                            lockspace                                     resource   timestamp  own  gen lver
 00000000                           lockspace1  830b815d-6f55-4237-af11-f4af1bc888c3.node1.i 0000000915 0001 0001
 00000512                           lockspace1  7a14c9fa-a13e-4c7a-9e05-e50b4766db4f.node2.i 0000000920 0002 0001
 00001024                           lockspace1  c2996bb0-5519-482e-9c89-4e4103ce2fe5.node3.i 0000000924 0003 0001

5. optionally you can monitor the lockspace status (on the 3 hosts) with:

 # watch -d -n 1 "sanlock client host_status -D | grep timestamp"

6. block the connectivity on node 1:

 [root@node1 ~]# iptables -A INPUT -i glusternet1 -j REJECT ; iptables -A OUTPUT -o glusternet1 -j REJECT

7. wait until "sanlock client host_status -D" is not reporting anything anymore on node 1, roughly 1 minute (node 2 and 3 should be fine)

8. inspect the lockspace1.ls file on each node:

 [root@node1 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 getfattr: Removing leading '/' from absolute path names
 # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000
 trusted.afr.glusterfs1-client-0=0x000000000000000000000000
 trusted.afr.glusterfs1-client-1=0x000000010000000000000000
 trusted.afr.glusterfs1-client-2=0x000000010000000000000000
 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e

 [root@node2 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 getfattr: Removing leading '/' from absolute path names
 # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000
 trusted.afr.glusterfs1-client-0=0x000000130000000000000000
 trusted.afr.glusterfs1-client-1=0x000000000000000000000000
 trusted.afr.glusterfs1-client-2=0x000000000000000000000000
 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e

 [root@node3 ~]# getfattr -d -m. -e hex /srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 getfattr: Removing leading '/' from absolute path names
 # file: srv/glusterfs/glusterfs1/brick1/lockspace1.ls
 security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000
 trusted.afr.glusterfs1-client-0=0x000000140000000000000000
 trusted.afr.glusterfs1-client-1=0x000000010000000000000000
 trusted.afr.glusterfs1-client-2=0x000000010000000000000000
 trusted.gfid=0x21c2a41a19e7446eaf870ed1a5961a7e

9. restore the connectivity for node 1:

 [root@node1 ~]# iptables -D INPUT -i ovirtnet1 -j REJECT ; iptables -D OUTPUT -o ovirtnet1 -j REJECT

10. read and writes are now failing an all hosts (it takes few seconds), you can quickly verify it with (try multiple times):

 [root@node1 ~]# sha1sum /mnt/lockspace1.ls 
 sha1sum: /mnt/lockspace1.ls: Input/output error

 [root@node2 ~]# sha1sum /mnt/lockspace1.ls 
 sha1sum: /mnt/lockspace1.ls: Input/output error

 [root@node3 ~]# sha1sum /mnt/lockspace1.ls 
 sha1sum: /mnt/lockspace1.ls: Input/output error

11. in roughly 1 minute the lockspaces will disappear on node 2 and 3 as well

Actual results:
Split-brain that needs manual intervention to be resolved.

Expected results:
Split-brain should be avoided or automatically resolved.

Comment 1 Federico Simoncelli 2014-02-19 14:24:48 UTC

I tried also adding the quorum-count option but the issue is still present.

Volume Name: glusterfs1
Type: Replicate
Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1
Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1
Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1
Options Reconfigured:
cluster.quorum-count: 2
cluster.quorum-type: auto
network.ping-timeout: 10

Comment 2 SATHEESARAN 2014-02-19 14:45:01 UTC

(In reply to Federico Simoncelli from comment #1)
> I tried also adding the quorum-count option but the issue is still present.
> 
> Volume Name: glusterfs1
> Type: Replicate
> Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1
> Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1
> Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1
> Options Reconfigured:
> cluster.quorum-count: 2
> cluster.quorum-type: auto
> network.ping-timeout: 10

quorum-count is used in conjunction with quorum-type.
quorum-type of 'auto' overrides quorum-count

So set the quorum-type to 'fixed' and then set the quorum-count to '2'

Comment 3 Federico Simoncelli 2014-02-19 14:57:07 UTC

Same behavior with:

Volume Name: glusterfs1
Type: Replicate
Volume ID: f0b13eee-1087-49ab-aae7-48543fa78f2c
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.124.81:/srv/glusterfs/glusterfs1/brick1
Brick2: 192.168.124.82:/srv/glusterfs/glusterfs1/brick1
Brick3: 192.168.124.83:/srv/glusterfs/glusterfs1/brick1
Options Reconfigured:
cluster.quorum-count: 2
cluster.quorum-type: fixed
network.ping-timeout: 10

Comment 6 Pranith Kumar K 2014-04-08 10:58:08 UTC

I was able to re-create the bug with the steps given in the bug description. The issue is happening because quorum checks are done only when the fop is issued. If the fop fails in a way which leads to a quorum failure that is not treated in _cbk path. I need to think about how to fix this issue. Will update as soon as I come up with a good way to fix the issue.

Comment 7 Anand Avati 2014-04-20 23:12:39 UTC

REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#1) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)

Comment 8 Anand Avati 2014-04-29 11:19:11 UTC

REVIEW: http://review.gluster.org/7600 (cluster/afr: Fix bugs in quorum implementation) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 9 Anand Avati 2014-04-30 01:34:43 UTC

REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#2) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)

Comment 10 SATHEESARAN 2014-04-30 13:17:05 UTC

Tested the fix posted by pranith in comment7

Before testing the fix, I installed 3.4.3 bits to confirm the bug
Here is the sequence of steps.

1. Created 5 VMs with fed20. 3 Machines used for glusterfs nodes and 2 other nodes used for mounting the volume

2. Installed glusterfs-3.4.3 from fedora repo and also installed glusterfs-debuginfo packages from fedora debuginfo repo

3. Created a replicate volume with 3 bricks

4. Enabled client-side quorum on the volume
> gluster volume set <vol-name> cluster.quorum-type auto

5. Start the volume and mounted the volume from 2 fed20 machines ( say mount1 and mount2 )

6. From mount1, gdb-ed on mount process and set breakpoint on the function, "afr_have_quorum", which is where the quorum calculation and verification was done

7. Created a file from mount1, and after quorum calculation, ( tracing through gdb), blocked incoming and outgoing data traffic to brick2 and brick3
This is done to simulate network partition

Result is mount1 can't see brick2 and brick3

8. Check for the changelogs on the brick1. [ it was 0-1-1 ]

9. Now block traffic from mount2, to brick1
> iptables -A OUTPUT -d <destination> -p all -j REJECT

10. Continue editing the same file from mount2
The changelog on brick2 and brick3, correspond to 1-0-0 

11. Flush the iptables rules on the nodes on which those were added in earlier steps.

12. Perform heal on gluster node
> gluster volume heal <vol-name> full

13. Check for splitbrain info
>gluster volume heal <vol-name> info split-brain

Result of all the above steps is that Split-brain was evident

I repeated the same steps with the fix, with the small changes

1. New breakpoint at the function, "afr_has_fop_quorum" where the quorum calculation and verification are done for the FOP on call path

The observations are,
1. After step 8 ( see above for the step) ,changelogs on brick1 were 1-1-1
2. And after step 10, the changelog on brick2 and brick3 corresponded to 1-0-0
3. After flushing iptables rules,and after healing the volume, there were no split-brain witnessed

Comment 11 SATHEESARAN 2014-04-30 13:33:00 UTC

I repeated testing the fix with the test steps as mentioned in comment0

I was not running in to split-brain. I repeated the test number of times ( around 15) and I haven't witnessed split-brain.

I have a question here for Federico,

when the mount goes to read-only after quorum being not met, then I restored quorum after sufficient time ( say after 15 minutes), I see that,"sanlock client host_status -D", not giving any output.

I see from log ( /var/log/sanlock.log ) the following,
<snip>
2014-04-30 14:55:53-0400 71911 [8377]: s4 all pids clear
2014-04-30 14:55:54-0400 71911 [30873]: ls1 aio collect 1 0x7fa37c0008c0:0x7fa37c0008d0:0x7fa37c101000 result -30:0 match res
2014-04-30 14:55:54-0400 71911 [30873]: s4 delta_renew write error -30
2014-04-30 14:55:54-0400 71911 [30873]: s4 renewal error -30 delta_length 0 last_success 71831
</snip>

But there are no split-brains and confirmed the same
Is that the behavior with sanlock ?

Comment 12 Anand Avati 2014-05-01 00:43:04 UTC

REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#3) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)

Comment 13 Anand Avati 2014-05-06 02:22:19 UTC

REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#4) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)

Comment 14 Anand Avati 2014-05-08 01:44:53 UTC

REVIEW: http://review.gluster.org/7513 (cluster/afr: Fix bugs in quorum implementation) posted (#5) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)

Comment 15 Anand Avati 2014-05-10 20:30:56 UTC

COMMIT: http://review.gluster.org/7513 committed in release-3.5 by Niels de Vos (ndevos) 
------
commit 46bf591eb87d743baacd30a85b99fd47db4fb055
Author: Pranith Kumar K <pkarampu>
Date:   Mon Apr 21 03:13:58 2014 +0530

    cluster/afr: Fix bugs in quorum implementation
    
    - Have common place to perform quorum fop wind check
    - Check if fop succeeded in a way that matches quorum
      to avoid marking changelog in split-brain.
    
    Change-Id: I663072ece0e1de6e1ee9fccb03e1b6c968793bc5
    BUG: 1066996
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/7513
    Reviewed-by: Ravishankar N <ravishankar>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Niels de Vos <ndevos>

Comment 16 Anand Avati 2014-05-12 10:19:40 UTC

REVIEW: http://review.gluster.org/7600 (cluster/afr: Fix bugs in quorum implementation) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 17 Anand Avati 2014-05-15 00:29:28 UTC

COMMIT: http://review.gluster.org/7600 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit eb04dab9992f8f5d4b2d45e1ca10032fededcff1
Author: Pranith Kumar K <pkarampu>
Date:   Tue Apr 29 05:49:21 2014 +0530

    cluster/afr: Fix bugs in quorum implementation
    
    - Have common place to perform quorum fop wind check
    - Check if fop succeeded in a way that matches quorum
      to avoid marking changelog in split-brain.
    
    BUG: 1066996
    Change-Id: Ibc5b80e01dc206b2abbea2d29e26f3c60ff4f204
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/7600
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Ravishankar N <ravishankar>

Comment 18 Federico Simoncelli 2014-05-21 10:19:32 UTC

I just verified this using glusterfs-3.6.0.4-1.el6rhs and ovirt from the master repository.

I haven't been able to reproduce the issue. I'll do some more testing. Thanks!

Comment 19 Niels de Vos 2014-05-25 09:07:15 UTC

The first (and last?) Beta for GlusterFS 3.5.1 has been released [1]. Please verify if the release solves this bug report for you. In case the glusterfs-3.5.1beta release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-May/040377.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 20 Niels de Vos 2014-06-24 11:03:31 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.1, please reopen this bug report.

glusterfs-3.5.1 has been announced on the Gluster Users mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-June/040723.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.