Description of problem: ======================= Had 5 node cluster (n1, n2, n3, n4 & n5 ) with one distributed volume with server quorum enabled and stopped glusterd in 3 nodes (n3,n4 and n5) and checked the volume status in n1 node, the bricks were offline and restarted the glusterd on that node (n1) and checked the volume status again, this time it bricks are in online. Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.7.5-15 How reproducible: ================= Always Steps to Reproduce: =================== 1. Have 5 node cluster with one distributed volume 2. Enable the server quorum 3. Bring down 3 nodes ( Eg , n3, n4 and n5) 4. Check the volume status in node-1 (n1) // bricks will be in offline state 5. Restart glusterd on node-1 6. Check the volume status // bricks will be in online state Actual results: =============== bricks are in online when server quorum not met Expected results: ================= Bricks should be in offline state when server quorum not met Additional info:
I have tested the same and found out the possible hint : while restarting glusterd, glusterd finds out that the server quorum is not met and kills the brick. This is evident from the logs and glusterfsd PID <snip> [2016-01-13 13:53:58.238048] C [MSGID: 106002] [glusterd-server-quorum.c:351:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume distvol. Stopping local bricks. [2016-01-13 13:53:58.238707] D [MSGID: 0] [glusterd-utils.c:5611:glusterd_brick_stop] 0-management: About to stop glusterfs for brick dhcp37-152.lab.eng.blr.redhat.com:/rhs/brick1/b1 [2016-01-13 13:53:58.238836] D [MSGID: 0] [glusterd-utils.c:1531:glusterd_service_stop] 0-management: Stopping gluster brick running in pid: 7653 [2016-01-13 13:53:58.238902] D [MSGID: 0] [glusterd-utils.c:4952:glusterd_set_brick_status] 0-glusterd: Setting brick dhcp37-152.lab.eng.blr.redhat.com:/rhs/brick1/b1 status to stopped [2016-01-13 13:53:58.239078] D [MSGID: 0] [glusterd-utils.c:5622:glusterd_brick_stop] 0-management: returning 0 </snip> From the above snippet, you can see the pid 7653 is killed From gluster volume status output, I could see a different pid. # gluster volume status distvol Status of volume: distvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-152.lab.eng.blr.redhat.com:/rh s/brick1/b1 49153 0 Y 7894 NFS Server on localhost 2049 0 Y 7879 NFS Server on dhcp37-53.lab.eng.blr.redhat. com 2049 0 Y 14428 Task Status of Volume distvol ------------------------------------------------------------------------------ There are no active volume tasks This means, somebody or somehow, brick has started after glusterd killing it
An upstream patch is posted http://review.gluster.org/13236
The fix is now available in rhgs-3.1.3 branch, hence moving the state to Modified.
Verified this bug using the build "glusterfs-3.7.9-1". Repeated the reproducing steps mentioned in description section, Fix is working properly, bricks are not starting after glusterd restart when server quorum not met. Moving to verified state based on above info.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240