765265 – (GLUSTER-3533) Go read-only if quorum not met

Bug 765265 (GLUSTER-3533) - Go read-only if quorum not met

Summary: Go read-only if quorum not met

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-3533
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Jeff Darcy
QA Contact:	Raghavendra Bhat
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	817967
TreeView+	depends on / blocked

Reported:	2011-09-09 15:45 UTC by Jeff Darcy
Modified:	2013-07-24 17:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:27:08 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:	glusterfs-3.3.0qa45
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jeff Darcy 2011-09-09 15:45:13 UTC

The coincidence of an IRC conversation and Chip Salzenberg's whining (http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.html) on the same day got me to thinking about split-brain situations, and it seems like we could make one small improvement with potentially large effects.  Many systems of this sort avoid split-brain inconsistency by enforcing quorum; after a network partition (or series of single-node failures) only a group containing a majority of the total retains the ability to write.  Since there can only be one such group at a time, this prevents writes occurring on both sides of the partition and (I believe) would avoid all but the most exotic split-brain problems.

It should be fairly easy to add a check at the top of each modifying operation in AFR, such that if a majority of the subvolumes are not available then EROFS is returned before attempting the operation.  "Majority" here could be exactly half if that includes the first subvolume, retaining the "there can be only one" property while supporting the common and important case of N=2.  I believe the list of currently-available subvolumes is already maintained in afr_notify, and I'll even volunteer to add the checks plus option processing to turn the new behavior on/off.  Any reason I shouldn't?

Comment 1 Jeff Darcy 2011-09-20 10:55:22 UTC

FWIW, I've pushed my local patch for this to Gerrit.  It's not fully baked, but probably does a better job than mere words can to explain where I think we need to go with this.

http://review.gluster.com/#change,473

Comment 2 Vijay Bellur 2011-09-21 07:49:55 UTC

Some questions:

1) Given that one of the use cases for AFR is to handle N-1 failures, would it be better to make this behavior optional? Or have the quorum number configurable with a default value of 1? 

2) How do we expect to handle split-brains that may arise out of the situation where a modify FOP is allowed and child down(s) is/are sensed later before the modify FOP reaches the server? The chance of this happening is not very low given that we require ping-timeout interval to determine a server being unreachable unless a rpc disconnection is sensed.

Comment 3 Jeff Darcy 2011-09-21 10:15:36 UTC

(In reply to comment #2)
> Some questions:
> 
> 1) Given that one of the use cases for AFR is to handle N-1 failures, would it
> be better to make this behavior optional? Or have the quorum number
> configurable with a default value of 1? 
> 
> 2) How do we expect to handle split-brains that may arise out of the situation
> where a modify FOP is allowed and child down(s) is/are sensed later before the
> modify FOP reaches the server? The chance of this happening is not very low
> given that we require ping-timeout interval to determine a server being
> unreachable unless a rpc disconnection is sensed.

(1) Yes, it absolutely should be optional.  Joe actually suggested there should be three options: no quorum enforcement, quorum enforcement for writes, quorum enforcement for everything.

(2) If a modify FOP is allowed (according to quorum rules) and subsequently fails, that ends up being the same case as if quorum had never been enforced.  This does mean there's still a slight chance of split brain, but it should be much reduced - the window is approximately 30 seconds to detect a partition vs. potentially hours (even days) that the partition might persist.

Comment 4 Anand Avati 2011-11-20 12:34:38 UTC

CHANGE: http://review.gluster.com/473 (Change-Id: I2f123ef93989862aa796903a45682981d5d7fc3c) merged in master by Vijay Bellur (vijay)

Comment 5 Raghavendra Bhat 2012-06-04 10:29:42 UTC

Checked with glusterfs-3.3.0qa45 and quorum enforcement works properly with EROFS error being propagated for any modify operations.

root@hyperspace:/mnt/client# gluster volume set mirror quorum-type auto
Set volume successful
root@hyperspace:/mnt/client# gluster volume set mirror quorum-count 2
Set volume successful
root@hyperspace:/mnt/client# cd
root@hyperspace:~# cd -
/mnt/client
root@hyperspace:/mnt/client# 
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# dd if=/dev/urandom of=k bs=10k count=22
dd: opening `k': Read-only file system
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# ls
root@hyperspace:/mnt/client# dd if=k of=/tmp/kkk bs=10k count=22
dd: opening `k': No such file or directory
root@hyperspace:/mnt/client# touch new
touch: cannot touch `new': Read-only file system
root@hyperspace:/mnt/client# gluster volume info mirror
 
Volume Name: mirror
Type: Replicate
Volume ID: 3382aaa7-37d0-4fab-bd3c-dc9a7a350acf
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hyperspace:/mnt/sda7/export3
Brick2: hyperspace:/mnt/sda8/export3
Brick3: hyperspace:/mnt/sda7/last35
Options Reconfigured:
cluster.quorum-type: auto
cluster.quorum-count: 2
features.lock-heal: on
features.quota: on
features.limit-usage: /:22GB
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
geo-replication.indexing: on
performance.stat-prefetch: on
root@hyperspace:/mnt/client#

Note You need to log in before you can comment on or make changes to this bug.