Bug 1109025

Summary:	[Nagios] Cluster services show weird behavior when some nodes in the cluster were taken down
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shruti Sampat <ssampat>
Component:	gluster-nagios-addons	Assignee:	Nishanth Thomas <nthomas>
Status:	CLOSED ERRATA	QA Contact:	Shruti Sampat <ssampat>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	esammons, kmayilsa, nthomas, rhsc-qe-bugs
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	gluster-nagios-addons-0.1.4-1.el6rhs, nagios-server-addons-0.1.4-1.el6rhs	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-09-22 19:11:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shruti Sampat 2014-06-13 05:22:10 UTC

Description of problem:
------------------------

While testing server side quorum for volumes, some nodes in the cluster were taken down. After the nodes were taken down, services in the cluster showed weird behavior. 

For e.g., the volume status service started flapping between warning and critical states. The host representing the cluster itself went down. Volume self-heal service started flapping between warning and critical states. Volume utilization was critical, cluster utilization was unknown.

Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Saw it once.

Steps to Reproduce:
1. Create a couple of volume with server quorum configured and start monitoring these volumes.
2. Take down some nodes in the cluster such that quorum is lost.

Actual results:
The services of the cluster behave as described above.

Expected results:
If some nodes in the cluster are down, the quorum service alone should have been critical, other services should not have been affected.

Additional info:

Comment 1 Shruti Sampat 2014-06-16 07:12:07 UTC

Another observation is that the status of cluster auto-configuration service changes to WARNING with status information reading as 'null' when a couple of nodes were powered off. It returns to OK when the nodes are brought back up.

Comment 2 Kanagaraj 2014-06-16 10:05:19 UTC

Patch - http://review.gluster.org/#/c/8061/

Comment 4 Shruti Sampat 2014-06-18 10:50:10 UTC

Verified as fixed in gluster-nagios-addons-0.1.4-1.el6rhs.x86_64, nagios-server-addons-0.1.4-1.el6rhs.x86_64

Performed the following steps -

1. Created a cluster of 7 RHS nodes, created a distributed-replicate volume with server-side quorum enabled and server-quorum-ratio set to 80%.
2. Brought down 2 of the RHS nodes, causing quorum to be lost for the volume.

The following results were seen -

Cluster - Quorum service was critical as quorum was lost for the volume.
Volume Utilization was unknown as the volume was down, because of quorum not being met.
Volume status was critical as all bricks of the volume were down, owing to quorum not being met.
Volume Self-Heal was in warning state as self-heal status could not be determined.
Cluster utilization was unknown as volume utilization was unknown.

Marking as VERIFIED.

Comment 5 Shruti Sampat 2014-06-18 10:56:45 UTC

One more observation, the host representing the cluster itself in the Nagios UI is down, because all volumes are critical, which is expected behavior.

Comment 6 errata-xmlrpc 2014-09-22 19:11:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1277.html