Bug 1109025

Summary: [Nagios] Cluster services show weird behavior when some nodes in the cluster were taken down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: gluster-nagios-addonsAssignee: Nishanth Thomas <nthomas>
Status: CLOSED ERRATA QA Contact: Shruti Sampat <ssampat>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: esammons, kmayilsa, nthomas, rhsc-qe-bugs
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-nagios-addons-0.1.4-1.el6rhs, nagios-server-addons-0.1.4-1.el6rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-22 19:11:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shruti Sampat 2014-06-13 05:22:10 UTC
Description of problem:
------------------------

While testing server side quorum for volumes, some nodes in the cluster were taken down. After the nodes were taken down, services in the cluster showed weird behavior. 

For e.g., the volume status service started flapping between warning and critical states. The host representing the cluster itself went down. Volume self-heal service started flapping between warning and critical states. Volume utilization was critical, cluster utilization was unknown.

Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Saw it once.

Steps to Reproduce:
1. Create a couple of volume with server quorum configured and start monitoring these volumes.
2. Take down some nodes in the cluster such that quorum is lost.

Actual results:
The services of the cluster behave as described above.

Expected results:
If some nodes in the cluster are down, the quorum service alone should have been critical, other services should not have been affected.

Additional info:

Comment 1 Shruti Sampat 2014-06-16 07:12:07 UTC
Another observation is that the status of cluster auto-configuration service changes to WARNING with status information reading as 'null' when a couple of nodes were powered off. It returns to OK when the nodes are brought back up.

Comment 2 Kanagaraj 2014-06-16 10:05:19 UTC
Patch - http://review.gluster.org/#/c/8061/

Comment 4 Shruti Sampat 2014-06-18 10:50:10 UTC
Verified as fixed in gluster-nagios-addons-0.1.4-1.el6rhs.x86_64, nagios-server-addons-0.1.4-1.el6rhs.x86_64

Performed the following steps -

1. Created a cluster of 7 RHS nodes, created a distributed-replicate volume with server-side quorum enabled and server-quorum-ratio set to 80%.
2. Brought down 2 of the RHS nodes, causing quorum to be lost for the volume.

The following results were seen -

Cluster - Quorum service was critical as quorum was lost for the volume.
Volume Utilization was unknown as the volume was down, because of quorum not being met.
Volume status was critical as all bricks of the volume were down, owing to quorum not being met.
Volume Self-Heal was in warning state as self-heal status could not be determined.
Cluster utilization was unknown as volume utilization was unknown.

Marking as VERIFIED.

Comment 5 Shruti Sampat 2014-06-18 10:56:45 UTC
One more observation, the host representing the cluster itself in the Nagios UI is down, because all volumes are critical, which is expected behavior.

Comment 6 errata-xmlrpc 2014-09-22 19:11:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1277.html