Bug 765265 (GLUSTER-3533)

Summary: Go read-only if quorum not met
Product: [Community] GlusterFS Reporter: Jeff Darcy <jdarcy>
Component: replicateAssignee: Jeff Darcy <jdarcy>
Status: CLOSED CURRENTRELEASE QA Contact: Raghavendra Bhat <rabhat>
Severity: low Docs Contact:
Priority: medium    
Version: mainlineCC: amarts, gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:27:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: glusterfs-3.3.0qa45 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Jeff Darcy 2011-09-09 15:45:13 UTC
The coincidence of an IRC conversation and Chip Salzenberg's whining (http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.html) on the same day got me to thinking about split-brain situations, and it seems like we could make one small improvement with potentially large effects.  Many systems of this sort avoid split-brain inconsistency by enforcing quorum; after a network partition (or series of single-node failures) only a group containing a majority of the total retains the ability to write.  Since there can only be one such group at a time, this prevents writes occurring on both sides of the partition and (I believe) would avoid all but the most exotic split-brain problems.

It should be fairly easy to add a check at the top of each modifying operation in AFR, such that if a majority of the subvolumes are not available then EROFS is returned before attempting the operation.  "Majority" here could be exactly half if that includes the first subvolume, retaining the "there can be only one" property while supporting the common and important case of N=2.  I believe the list of currently-available subvolumes is already maintained in afr_notify, and I'll even volunteer to add the checks plus option processing to turn the new behavior on/off.  Any reason I shouldn't?

Comment 1 Jeff Darcy 2011-09-20 10:55:22 UTC
FWIW, I've pushed my local patch for this to Gerrit.  It's not fully baked, but probably does a better job than mere words can to explain where I think we need to go with this.

http://review.gluster.com/#change,473

Comment 2 Vijay Bellur 2011-09-21 07:49:55 UTC
Some questions:

1) Given that one of the use cases for AFR is to handle N-1 failures, would it be better to make this behavior optional? Or have the quorum number configurable with a default value of 1? 

2) How do we expect to handle split-brains that may arise out of the situation where a modify FOP is allowed and child down(s) is/are sensed later before the modify FOP reaches the server? The chance of this happening is not very low given that we require ping-timeout interval to determine a server being unreachable unless a rpc disconnection is sensed.

Comment 3 Jeff Darcy 2011-09-21 10:15:36 UTC
(In reply to comment #2)
> Some questions:
> 
> 1) Given that one of the use cases for AFR is to handle N-1 failures, would it
> be better to make this behavior optional? Or have the quorum number
> configurable with a default value of 1? 
> 
> 2) How do we expect to handle split-brains that may arise out of the situation
> where a modify FOP is allowed and child down(s) is/are sensed later before the
> modify FOP reaches the server? The chance of this happening is not very low
> given that we require ping-timeout interval to determine a server being
> unreachable unless a rpc disconnection is sensed.

(1) Yes, it absolutely should be optional.  Joe actually suggested there should be three options: no quorum enforcement, quorum enforcement for writes, quorum enforcement for everything.

(2) If a modify FOP is allowed (according to quorum rules) and subsequently fails, that ends up being the same case as if quorum had never been enforced.  This does mean there's still a slight chance of split brain, but it should be much reduced - the window is approximately 30 seconds to detect a partition vs. potentially hours (even days) that the partition might persist.

Comment 4 Anand Avati 2011-11-20 12:34:38 UTC
CHANGE: http://review.gluster.com/473 (Change-Id: I2f123ef93989862aa796903a45682981d5d7fc3c) merged in master by Vijay Bellur (vijay)

Comment 5 Raghavendra Bhat 2012-06-04 10:29:42 UTC
Checked with glusterfs-3.3.0qa45 and quorum enforcement works properly with EROFS error being propagated for any modify operations.

root@hyperspace:/mnt/client# gluster volume set mirror quorum-type auto
Set volume successful
root@hyperspace:/mnt/client# gluster volume set mirror quorum-count 2
Set volume successful
root@hyperspace:/mnt/client# cd
root@hyperspace:~# cd -
/mnt/client
root@hyperspace:/mnt/client# 
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# dd if=/dev/urandom of=k bs=10k count=22
dd: opening `k': Read-only file system
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# dd if=k of=/tmp/kkk bs=10k count=22
dd: opening `k': No such file or directory
root@hyperspace:/mnt/client# touch new
touch: cannot touch `new': Read-only file system
root@hyperspace:/mnt/client# gluster volume info mirror
 
Volume Name: mirror
Type: Replicate
Volume ID: 3382aaa7-37d0-4fab-bd3c-dc9a7a350acf
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hyperspace:/mnt/sda7/export3
Brick2: hyperspace:/mnt/sda8/export3
Brick3: hyperspace:/mnt/sda7/last35
Options Reconfigured:
cluster.quorum-type: auto
cluster.quorum-count: 2
features.lock-heal: on
features.quota: on
features.limit-usage: /:22GB
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
geo-replication.indexing: on
performance.stat-prefetch: on
root@hyperspace:/mnt/client#