1109025 – [Nagios] Cluster services show weird behavior when some nodes in the cluster were taken down

Bug 1109025 - [Nagios] Cluster services show weird behavior when some nodes in the cluster were taken down

Summary: [Nagios] Cluster services show weird behavior when some nodes in the cluster ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nagios-addons
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Nishanth Thomas
QA Contact:	Shruti Sampat
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-13 05:22 UTC by Shruti Sampat
Modified:	2015-05-15 17:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:	gluster-nagios-addons-0.1.4-1.el6rhs, nagios-server-addons-0.1.4-1.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-09-22 19:11:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1277	0	normal	SHIPPED_LIVE	Red Hat Storage Console 3.0 enhancement and bug fix update	2014-09-22 23:06:30 UTC

Description Shruti Sampat 2014-06-13 05:22:10 UTC

Description of problem:
------------------------

While testing server side quorum for volumes, some nodes in the cluster were taken down. After the nodes were taken down, services in the cluster showed weird behavior. 

For e.g., the volume status service started flapping between warning and critical states. The host representing the cluster itself went down. Volume self-heal service started flapping between warning and critical states. Volume utilization was critical, cluster utilization was unknown.

Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Saw it once.

Steps to Reproduce:
1. Create a couple of volume with server quorum configured and start monitoring these volumes.
2. Take down some nodes in the cluster such that quorum is lost.

Actual results:
The services of the cluster behave as described above.

Expected results:
If some nodes in the cluster are down, the quorum service alone should have been critical, other services should not have been affected.

Additional info:

Comment 1 Shruti Sampat 2014-06-16 07:12:07 UTC

Another observation is that the status of cluster auto-configuration service changes to WARNING with status information reading as 'null' when a couple of nodes were powered off. It returns to OK when the nodes are brought back up.

Comment 2 Kanagaraj 2014-06-16 10:05:19 UTC

Patch - http://review.gluster.org/#/c/8061/

Comment 4 Shruti Sampat 2014-06-18 10:50:10 UTC

Verified as fixed in gluster-nagios-addons-0.1.4-1.el6rhs.x86_64, nagios-server-addons-0.1.4-1.el6rhs.x86_64

Performed the following steps -

1. Created a cluster of 7 RHS nodes, created a distributed-replicate volume with server-side quorum enabled and server-quorum-ratio set to 80%.
2. Brought down 2 of the RHS nodes, causing quorum to be lost for the volume.

The following results were seen -

Cluster - Quorum service was critical as quorum was lost for the volume.
Volume Utilization was unknown as the volume was down, because of quorum not being met.
Volume status was critical as all bricks of the volume were down, owing to quorum not being met.
Volume Self-Heal was in warning state as self-heal status could not be determined.
Cluster utilization was unknown as volume utilization was unknown.

Marking as VERIFIED.

Comment 5 Shruti Sampat 2014-06-18 10:56:45 UTC

One more observation, the host representing the cluster itself in the Nagios UI is down, because all volumes are critical, which is expected behavior.

Comment 6 errata-xmlrpc 2014-09-22 19:11:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1277.html

Note You need to log in before you can comment on or make changes to this bug.