Bug 1136205

Summary:	[Nagios] Volume status is seen to be in warning status with status information "null" when glusterd is stopped on one RHS node.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shruti Sampat <ssampat>
Component:	gluster-nagios-addons	Assignee:	Nishanth Thomas <nthomas>
Status:	CLOSED ERRATA	QA Contact:	Shruti Sampat <ssampat>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.0	CC:	bkunal, dpati, fharshav, kmayilsa, nthomas, psriniva, rhsc-qe-bugs, rnachimu, sharne, vumrao
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.3
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	nagios-server-addons-0.1.9-1.el6rhs	Doc Type:	Bug Fix
Doc Text:	Previously, the Nagios plug-in sent the volume status request to the Red Hat Storage node without converting the Nagios host name to the respective IP Address. When the glusterd service was stopped on one of the nodes in a Red Hat Storage Trusted Storage Pool, the volume status displayed a warning and the status information was empty. With this fix, the error scenarios are handled properly and the system ensures that the glusterd service starts before it sends such a request to a Red Hat Storage node.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-01-15 13:49:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1109843
Bug Blocks:	1087818

Description Shruti Sampat 2014-09-02 07:26:34 UTC

Description of problem:
-----------------------

When glusterd is stopped on one node in a cluster being monitored, the volume status of one of the volumes in the cluster was seen to be in warning state with "null" in the status information. One of the bricks of this volume was present on the node where glusterd was stopped. 

Occasionally the volume status service was seen to be unknown, with the status information displaying the message "Invalid host name rhs.4" (BZ #1109843)

Sometimes the volume status service was OK , with the status information reading "OK: Volume : DISTRIBUTE type - All bricks are Up"

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

gluster-nagios-addons-0.1.10-2.el6rhs.x86_64
nagios-server-addons-0.1.6-1.el6rhs.noarch

How reproducible:
Saw it once.

Steps to Reproduce:

1. Setup a cluster of 4 RHS nodes and configure it to be monitored nagios server that is setup outside the RHS cluster.

2. Create a distribute volume with one brick each on 2 of the servers in the cluster.

3. Bring down glusterd on one of the nodes in the cluster, this node should have one of the bricks created above.

4. Observe the volume status service for this volume. 

Actual results:

The volume status service is seen to be flapping between OK, warning and unknown states as explained above.


Expected results:

The volume status service should not be in the warning state.

Additional info:

Comment 2 Shalaka 2014-09-20 09:12:28 UTC

Please review and sign-off the edited doc text.

Comment 5 Kanagaraj 2014-10-15 10:15:10 UTC

Q1. Why Host and Address is Eskan as Eskan is nothing but a cluster name.
ANS: In Nagios cluster is represented as dummy with name as cluster-name

Q2. For NULL issue this is the bug which means Additional Info: NULL am I correct here ?
ANS: selinux in Enforcing mode can cause this issue. Moving selinux to Permissive mode should solve this problem

Q3. How customer can stop these messages to filling up their inboxes any workaround ?
ANS: Messages/Notifications can be disabled using the nagios ui. But its worth checking the selinux status before attempting this.

Comment 6 Kanagaraj 2014-10-15 10:19:04 UTC

Pls read the first answer in Comment #5 as

ANS: In Nagios, cluster is represented as dummy host with name as cluster-name. This is done by auto-discovery script

Comment 7 Vikhyat Umrao 2014-10-15 10:41:12 UTC

Thanks Kanagaraj, for your quick response it will help a lot.
I will get back to you if any thing else is needed from customer end.

Comment 8 Kanagaraj 2014-10-20 08:01:42 UTC

In Comment #5, 

Nagios needs to be restarted "service nagios restart" after moving Selinux to permissive mode.

Vikhyat, pls ask the customer to restart if not already done.

Comment 13 Ramesh N 2014-11-05 11:30:02 UTC

Moving back to assigned state as there are some scenarios which is not covered in the bug

Comment 14 Shruti Sampat 2014-11-27 11:28:19 UTC

Verified as fixed in nagios-server-addons-0.1.9-1.el6rhs

Tested with RHS+Nagios in a 4 node RHS cluster in the following scenarios -

1. glusterd stopped on one of the nodes, on which one of the bricks of a volume resided. Volume status was OK with status information 

"OK: Volume : DISTRIBUTE type - All bricks are Up "

2. On a cluster with server quorum enabled, brought down glusterd causing quorum to be lost. This issue was not observed in this case too. Volume status of volume with server quorum enabled was critical with status information -

"CRITICAL: Volume : REPLICATE type - All bricks are down"

3. Stopped nrpe service on one node. Volume status shows appropriate status information in this case too.

Marking as verified.

Comment 15 Pavithra 2014-12-17 06:29:19 UTC

Nishanth,
Can you please review the edited doc text for technical accuracy and sign off?

Comment 17 errata-xmlrpc 2015-01-15 13:49:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html

Comment 18 Red Hat Bugzilla 2023-09-14 02:46:48 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days