The coincidence of an IRC conversation and Chip Salzenberg's whining (http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.html) on the same day got me to thinking about split-brain situations, and it seems like we could make one small improvement with potentially large effects. Many systems of this sort avoid split-brain inconsistency by enforcing quorum; after a network partition (or series of single-node failures) only a group containing a majority of the total retains the ability to write. Since there can only be one such group at a time, this prevents writes occurring on both sides of the partition and (I believe) would avoid all but the most exotic split-brain problems. It should be fairly easy to add a check at the top of each modifying operation in AFR, such that if a majority of the subvolumes are not available then EROFS is returned before attempting the operation. "Majority" here could be exactly half if that includes the first subvolume, retaining the "there can be only one" property while supporting the common and important case of N=2. I believe the list of currently-available subvolumes is already maintained in afr_notify, and I'll even volunteer to add the checks plus option processing to turn the new behavior on/off. Any reason I shouldn't?
FWIW, I've pushed my local patch for this to Gerrit. It's not fully baked, but probably does a better job than mere words can to explain where I think we need to go with this. http://review.gluster.com/#change,473
Some questions: 1) Given that one of the use cases for AFR is to handle N-1 failures, would it be better to make this behavior optional? Or have the quorum number configurable with a default value of 1? 2) How do we expect to handle split-brains that may arise out of the situation where a modify FOP is allowed and child down(s) is/are sensed later before the modify FOP reaches the server? The chance of this happening is not very low given that we require ping-timeout interval to determine a server being unreachable unless a rpc disconnection is sensed.
(In reply to comment #2) > Some questions: > > 1) Given that one of the use cases for AFR is to handle N-1 failures, would it > be better to make this behavior optional? Or have the quorum number > configurable with a default value of 1? > > 2) How do we expect to handle split-brains that may arise out of the situation > where a modify FOP is allowed and child down(s) is/are sensed later before the > modify FOP reaches the server? The chance of this happening is not very low > given that we require ping-timeout interval to determine a server being > unreachable unless a rpc disconnection is sensed. (1) Yes, it absolutely should be optional. Joe actually suggested there should be three options: no quorum enforcement, quorum enforcement for writes, quorum enforcement for everything. (2) If a modify FOP is allowed (according to quorum rules) and subsequently fails, that ends up being the same case as if quorum had never been enforced. This does mean there's still a slight chance of split brain, but it should be much reduced - the window is approximately 30 seconds to detect a partition vs. potentially hours (even days) that the partition might persist.
CHANGE: http://review.gluster.com/473 (Change-Id: I2f123ef93989862aa796903a45682981d5d7fc3c) merged in master by Vijay Bellur (vijay)
Checked with glusterfs-3.3.0qa45 and quorum enforcement works properly with EROFS error being propagated for any modify operations. root@hyperspace:/mnt/client# gluster volume set mirror quorum-type auto Set volume successful root@hyperspace:/mnt/client# gluster volume set mirror quorum-count 2 Set volume successful root@hyperspace:/mnt/client# cd root@hyperspace:~# cd - /mnt/client root@hyperspace:/mnt/client# root@hyperspace:/mnt/client# ls root@hyperspace:/mnt/client# dd if=/dev/urandom of=k bs=10k count=22 dd: opening `k': Read-only file system root@hyperspace:/mnt/client# ls root@hyperspace:/mnt/client# ls root@hyperspace:/mnt/client# dd if=k of=/tmp/kkk bs=10k count=22 dd: opening `k': No such file or directory root@hyperspace:/mnt/client# touch new touch: cannot touch `new': Read-only file system root@hyperspace:/mnt/client# gluster volume info mirror Volume Name: mirror Type: Replicate Volume ID: 3382aaa7-37d0-4fab-bd3c-dc9a7a350acf Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: hyperspace:/mnt/sda7/export3 Brick2: hyperspace:/mnt/sda8/export3 Brick3: hyperspace:/mnt/sda7/last35 Options Reconfigured: cluster.quorum-type: auto cluster.quorum-count: 2 features.lock-heal: on features.quota: on features.limit-usage: /:22GB diagnostics.latency-measurement: on diagnostics.count-fop-hits: on geo-replication.indexing: on performance.stat-prefetch: on root@hyperspace:/mnt/client#