Description of problem: ------------------------ When glusterd is stopped a couple of nodes in the cluster, the status of volume utilization changes to unknown. The status information of the service reads "Invalid host name rhs.7" rhs.7 is one of the nodes in the cluster where glusterd is stopped. Quorum for this volume was not met, so all bricks were down. Hence, volume utilization should have been unknown, but the status information should read something like "Failed to get utilization information" Version-Release number of selected component (if applicable): gluster-nagios-addons-0.1.2-1.el6rhs.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. In a cluster of 7 nodes, bring glusterd down on 2 nodes, causing quorum to be lost on the volumes and bricks to be killed. Actual results: Volume utilization is unknown with status information "Invalid host name rhs.7" Expected results: Volume utilization should have proper status information. Additional info:
This issue is unlikely to happen often, after the bug fix of Bug 1109025 . But it needs to be documented.
Please add doc text for the known issue
FYI, this issue is seen even with the fix of BZ #1109025, even with glusterd being running.
Please review and signoff edited doc text.
doc_text looks good
Hi, This issue is also seen in case of volume quota monitoring service, when the volume is stopped. Maybe the doc text needs to be changed to include this too, right now it seems specific to volume utilization.
Hi, Another situation where I saw this issue is while testing quota timeout value using the -t option (BZ #1094614) Performed the following steps to cause the quota list command to not return within 1 second, and thus the timeout to occur (timeout was set to 1 second using the -t option) - 1. Created 2000 directories on the mount of the volume. 2. Configured quota limits on all 2000 directories. Now quota list command takes over 1 second to return the information. While quota was being configured on the directories, the status of the quota service was UNKNOWN with the status information as "Invalid host name rhs.5" (rhs.5 is one of the hosts in the cluster being monitored) After a while the status of the service was CRITICAL with status information "CHECK_NRPE: Socket timeout after 1 seconds."
This issue is also seen when quota is enabled for a volume and the volume is stopped. The status information of quota status service displays "Invalid host name 'rhs.4' ", rhs.4 being the name of one of the hosts in the cluster.
Moving back to assigned state as there are some scenarios which is not covered in the bug
Verified as fixed in nagios-server-addons-0.1.9-1.el6rhs Tested with RHS+Nagios in cluster of 4 nodes. Verified in the following scenarios - 1. Stopped nrpe on one of the nodes. 2. Stopped glusterd on a couple of nodes. 3. Powered off one of the nodes. In all of the above scenarios, volume utilization was unknown with the following status information - UNKNOWN: Failed to get the Volume Utilization Data Also tested with volume quota service, as mentioned in Comment #6 and Comment #7 - 1. Status of volume quota service when volume was stopped was warning with status information - QUOTA: Quota status could not be determined. quota command failed : Volume is stopped, start volume before executing quota command. 2. Unable to reproduce with scenario mentioned in Comment #7 Marking as verified.
Hi Nishanth, Can you please review the edited doc text for technical accuracy and sign off?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0039.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days