Description of problem: ------------------------ While testing server side quorum for volumes, some nodes in the cluster were taken down. After the nodes were taken down, services in the cluster showed weird behavior. For e.g., the volume status service started flapping between warning and critical states. The host representing the cluster itself went down. Volume self-heal service started flapping between warning and critical states. Volume utilization was critical, cluster utilization was unknown. Version-Release number of selected component (if applicable): gluster-nagios-addons-0.1.2-1.el6rhs.x86_64 How reproducible: Saw it once. Steps to Reproduce: 1. Create a couple of volume with server quorum configured and start monitoring these volumes. 2. Take down some nodes in the cluster such that quorum is lost. Actual results: The services of the cluster behave as described above. Expected results: If some nodes in the cluster are down, the quorum service alone should have been critical, other services should not have been affected. Additional info:
Another observation is that the status of cluster auto-configuration service changes to WARNING with status information reading as 'null' when a couple of nodes were powered off. It returns to OK when the nodes are brought back up.
Patch - http://review.gluster.org/#/c/8061/
Verified as fixed in gluster-nagios-addons-0.1.4-1.el6rhs.x86_64, nagios-server-addons-0.1.4-1.el6rhs.x86_64 Performed the following steps - 1. Created a cluster of 7 RHS nodes, created a distributed-replicate volume with server-side quorum enabled and server-quorum-ratio set to 80%. 2. Brought down 2 of the RHS nodes, causing quorum to be lost for the volume. The following results were seen - Cluster - Quorum service was critical as quorum was lost for the volume. Volume Utilization was unknown as the volume was down, because of quorum not being met. Volume status was critical as all bricks of the volume were down, owing to quorum not being met. Volume Self-Heal was in warning state as self-heal status could not be determined. Cluster utilization was unknown as volume utilization was unknown. Marking as VERIFIED.
One more observation, the host representing the cluster itself in the Nagios UI is down, because all volumes are critical, which is expected behavior.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1277.html