Bug 1136205 - [Nagios] Volume status is seen to be in warning status with status information "null" when glusterd is stopped on one RHS node.
Summary: [Nagios] Volume status is seen to be in warning status with status informatio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nagios-addons
Version: rhgs-3.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.0.3
Assignee: Nishanth Thomas
QA Contact: Shruti Sampat
URL:
Whiteboard:
Depends On: 1109843
Blocks: 1087818
TreeView+ depends on / blocked
 
Reported: 2014-09-02 07:26 UTC by Shruti Sampat
Modified: 2023-09-14 02:46 UTC (History)
10 users (show)

Fixed In Version: nagios-server-addons-0.1.9-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Previously, the Nagios plug-in sent the volume status request to the Red Hat Storage node without converting the Nagios host name to the respective IP Address. When the glusterd service was stopped on one of the nodes in a Red Hat Storage Trusted Storage Pool, the volume status displayed a warning and the status information was empty. With this fix, the error scenarios are handled properly and the system ensures that the glusterd service starts before it sends such a request to a Red Hat Storage node.
Clone Of:
Environment:
Last Closed: 2015-01-15 13:49:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1109843 0 high CLOSED [Nagios] Volume utilization is unknown with status information "Invalid host name <hostname-of-RHS-node>" when glusterd ... 2023-09-14 02:10:06 UTC
Red Hat Product Errata RHBA-2015:0039 0 normal SHIPPED_LIVE Red Hat Storage Console 3.0 enhancement and bug fix update #3 2015-01-15 18:46:40 UTC

Internal Links: 1109843

Description Shruti Sampat 2014-09-02 07:26:34 UTC
Description of problem:
-----------------------

When glusterd is stopped on one node in a cluster being monitored, the volume status of one of the volumes in the cluster was seen to be in warning state with "null" in the status information. One of the bricks of this volume was present on the node where glusterd was stopped. 

Occasionally the volume status service was seen to be unknown, with the status information displaying the message "Invalid host name rhs.4" (BZ #1109843)

Sometimes the volume status service was OK , with the status information reading "OK: Volume : DISTRIBUTE type - All bricks are Up"

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

gluster-nagios-addons-0.1.10-2.el6rhs.x86_64
nagios-server-addons-0.1.6-1.el6rhs.noarch

How reproducible:
Saw it once.

Steps to Reproduce:

1. Setup a cluster of 4 RHS nodes and configure it to be monitored nagios server that is setup outside the RHS cluster.

2. Create a distribute volume with one brick each on 2 of the servers in the cluster.

3. Bring down glusterd on one of the nodes in the cluster, this node should have one of the bricks created above.

4. Observe the volume status service for this volume. 

Actual results:

The volume status service is seen to be flapping between OK, warning and unknown states as explained above.


Expected results:

The volume status service should not be in the warning state.

Additional info:

Comment 2 Shalaka 2014-09-20 09:12:28 UTC
Please review and sign-off the edited doc text.

Comment 5 Kanagaraj 2014-10-15 10:15:10 UTC
Q1. Why Host and Address is Eskan as Eskan is nothing but a cluster name.
ANS: In Nagios cluster is represented as dummy with name as cluster-name

Q2. For NULL issue this is the bug which means Additional Info: NULL am I correct here ?
ANS: selinux in Enforcing mode can cause this issue. Moving selinux to Permissive mode should solve this problem

Q3. How customer can stop these messages to filling up their inboxes any workaround ?
ANS: Messages/Notifications can be disabled using the nagios ui. But its worth checking the selinux status before attempting this.

Comment 6 Kanagaraj 2014-10-15 10:19:04 UTC
Pls read the first answer in Comment #5 as

ANS: In Nagios, cluster is represented as dummy host with name as cluster-name. This is done by auto-discovery script

Comment 7 Vikhyat Umrao 2014-10-15 10:41:12 UTC
Thanks Kanagaraj, for your quick response it will help a lot.
I will get back to you if any thing else is needed from customer end.

Comment 8 Kanagaraj 2014-10-20 08:01:42 UTC
In Comment #5, 

Nagios needs to be restarted "service nagios restart" after moving Selinux to permissive mode.

Vikhyat, pls ask the customer to restart if not already done.

Comment 13 Ramesh N 2014-11-05 11:30:02 UTC
Moving back to assigned state as there are some scenarios which is not covered in the bug

Comment 14 Shruti Sampat 2014-11-27 11:28:19 UTC
Verified as fixed in nagios-server-addons-0.1.9-1.el6rhs

Tested with RHS+Nagios in a 4 node RHS cluster in the following scenarios -

1. glusterd stopped on one of the nodes, on which one of the bricks of a volume resided. Volume status was OK with status information 

"OK: Volume : DISTRIBUTE type - All bricks are Up "

2. On a cluster with server quorum enabled, brought down glusterd causing quorum to be lost. This issue was not observed in this case too. Volume status of volume with server quorum enabled was critical with status information -

"CRITICAL: Volume : REPLICATE type - All bricks are down"

3. Stopped nrpe service on one node. Volume status shows appropriate status information in this case too.

Marking as verified.

Comment 15 Pavithra 2014-12-17 06:29:19 UTC
Nishanth,
Can you please review the edited doc text for technical accuracy and sign off?

Comment 17 errata-xmlrpc 2015-01-15 13:49:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html

Comment 18 Red Hat Bugzilla 2023-09-14 02:46:48 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.