Description of problem: In case of a 2x2 setup and a quorum percentage of 100%, if a network disrupt happens between just machines. The quorum will be lost between only those two machines, and they will end up killing glusterfsd on their respective nodes. However the cluster remains operational in degraded state despite the quorum loss. For example: Consider the setup [root@rhs-client19 ~]# gluster volume info Volume Name: quo Type: Distributed-Replicate Volume ID: 96852dd0-e8f6-48f8-94e2-ef80e8c70778 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhs-client19.lab.eng.blr.redhat.com:/home/A Brick2: rhs-client20.lab.eng.blr.redhat.com:/home/B Brick3: rhs-client21.lab.eng.blr.redhat.com:/home/C Brick4: rhs-client23.lab.eng.blr.redhat.com:/home/D Options Reconfigured: cluster.server-quorum-type: server cluster.server-quorum-ratio: 100 global-option-version: 35 [root@rhs-client19 ~]# [root@rhs-client19 ~]# gluster peer status Number of Peers: 3 Hostname: rhs-client20.lab.eng.blr.redhat.com Uuid: b7f33530-25c1-406c-8c76-2c5feabaf7b0 State: Peer in Cluster (Disconnected) Hostname: rhs-client21.lab.eng.blr.redhat.com Uuid: 5b315725-90dd-41f9-abe8-827d27db8210 State: Peer in Cluster (Connected) Hostname: rhs-client23.lab.eng.blr.redhat.com Uuid: 230ae9f2-310e-49a6-b9f6-440bb5962da3 State: Peer in Cluster (Connected) [root@rhs-client19 ~]# === [root@rhs-client20 ~]# gluster peer status Number of Peers: 3 Hostname: rhs-client21.lab.eng.blr.redhat.com Uuid: 5b315725-90dd-41f9-abe8-827d27db8210 State: Peer in Cluster (Connected) Hostname: rhs-client23.lab.eng.blr.redhat.com Uuid: 230ae9f2-310e-49a6-b9f6-440bb5962da3 State: Peer in Cluster (Connected) Hostname: 10.70.36.43 Uuid: 772396e0-ccae-4b64-99f9-84f7e836d101 State: Peer in Cluster (Disconnected) [root@rhs-client20 ~]# ============ [root@rhs-client21 ~]# gluster peer status Number of Peers: 3 Hostname: rhs-client23.lab.eng.blr.redhat.com Uuid: 230ae9f2-310e-49a6-b9f6-440bb5962da3 State: Peer in Cluster (Connected) Hostname: 10.70.36.43 Uuid: 772396e0-ccae-4b64-99f9-84f7e836d101 State: Peer in Cluster (Connected) Hostname: rhs-client20.lab.eng.blr.redhat.com Uuid: b7f33530-25c1-406c-8c76-2c5feabaf7b0 State: Peer in Cluster (Connected) [root@rhs-client21 ~]# =============== [root@rhs-client23 ~]# gluster peer status Number of Peers: 3 Hostname: rhs-client20.lab.eng.blr.redhat.com Uuid: b7f33530-25c1-406c-8c76-2c5feabaf7b0 State: Peer in Cluster (Connected) Hostname: rhs-client21.lab.eng.blr.redhat.com Uuid: 5b315725-90dd-41f9-abe8-827d27db8210 State: Peer in Cluster (Connected) Hostname: 10.70.36.43 Uuid: 772396e0-ccae-4b64-99f9-84f7e836d101 State: Peer in Cluster (Connected) [root@rhs-client23 ~]# ================== Now in the above scenario, machines 19 and 20 are disconnected from each other. And quorum ratio is 100%. And the disconnect between 19 and 20 breaks the quorum. But the processes are killed only on 19 and 20, since the others don't see this disconnect. I am not sure if this is a bug or a limitation. But, however the mount point is active with another pair serving.
Sac, I discussed this bug with Vijay. We are in agreement that this situation can not be handled using current implementation. Lets keep the bug open for now. We are not going to fix it for 2.0.z though. Pranith.
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version. [1] https://rhn.redhat.com/errata/RHSA-2014-0821.html