Description of problem: ----------------------- When glusterd is stopped on one node in a cluster being monitored, the volume status of one of the volumes in the cluster was seen to be in warning state with "null" in the status information. One of the bricks of this volume was present on the node where glusterd was stopped. Occasionally the volume status service was seen to be unknown, with the status information displaying the message "Invalid host name rhs.4" (BZ #1109843) Sometimes the volume status service was OK , with the status information reading "OK: Volume : DISTRIBUTE type - All bricks are Up" Version-Release number of selected component (if applicable): -------------------------------------------------------------- gluster-nagios-addons-0.1.10-2.el6rhs.x86_64 nagios-server-addons-0.1.6-1.el6rhs.noarch How reproducible: Saw it once. Steps to Reproduce: 1. Setup a cluster of 4 RHS nodes and configure it to be monitored nagios server that is setup outside the RHS cluster. 2. Create a distribute volume with one brick each on 2 of the servers in the cluster. 3. Bring down glusterd on one of the nodes in the cluster, this node should have one of the bricks created above. 4. Observe the volume status service for this volume. Actual results: The volume status service is seen to be flapping between OK, warning and unknown states as explained above. Expected results: The volume status service should not be in the warning state. Additional info:
Please review and sign-off the edited doc text.
Q1. Why Host and Address is Eskan as Eskan is nothing but a cluster name. ANS: In Nagios cluster is represented as dummy with name as cluster-name Q2. For NULL issue this is the bug which means Additional Info: NULL am I correct here ? ANS: selinux in Enforcing mode can cause this issue. Moving selinux to Permissive mode should solve this problem Q3. How customer can stop these messages to filling up their inboxes any workaround ? ANS: Messages/Notifications can be disabled using the nagios ui. But its worth checking the selinux status before attempting this.
Pls read the first answer in Comment #5 as ANS: In Nagios, cluster is represented as dummy host with name as cluster-name. This is done by auto-discovery script
Thanks Kanagaraj, for your quick response it will help a lot. I will get back to you if any thing else is needed from customer end.
In Comment #5, Nagios needs to be restarted "service nagios restart" after moving Selinux to permissive mode. Vikhyat, pls ask the customer to restart if not already done.
Moving back to assigned state as there are some scenarios which is not covered in the bug
Verified as fixed in nagios-server-addons-0.1.9-1.el6rhs Tested with RHS+Nagios in a 4 node RHS cluster in the following scenarios - 1. glusterd stopped on one of the nodes, on which one of the bricks of a volume resided. Volume status was OK with status information "OK: Volume : DISTRIBUTE type - All bricks are Up " 2. On a cluster with server quorum enabled, brought down glusterd causing quorum to be lost. This issue was not observed in this case too. Volume status of volume with server quorum enabled was critical with status information - "CRITICAL: Volume : REPLICATE type - All bricks are down" 3. Stopped nrpe service on one node. Volume status shows appropriate status information in this case too. Marking as verified.
Nishanth, Can you please review the edited doc text for technical accuracy and sign off?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0039.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days