Bug 1109843 - [Nagios] Volume utilization is unknown with status information "Invalid host name <hostname-of-RHS-node>" when glusterd is stopped
Summary: [Nagios] Volume utilization is unknown with status information "Invalid host ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nagios-server-addons
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.0.3
Assignee: Nishanth Thomas
QA Contact: Shruti Sampat
URL:
Whiteboard:
Depends On:
Blocks: 1087818 1136205
TreeView+ depends on / blocked
 
Reported: 2014-06-16 13:14 UTC by Shruti Sampat
Modified: 2023-09-14 02:10 UTC (History)
11 users (show)

Fixed In Version: nagios-server-addons-0.1.9-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Previously, if the host that is used for discovery was detached from the Red Hat Storage trusted storage pool, then all the hosts would get removed from the Nagios configuration when an auto-discovery was performed. With this fix, auto-config does not remove any configuration detail if the host used for discovery is detached from the Red Hat Storage trusted storage pool.
Clone Of:
Environment:
Last Closed: 2015-01-15 13:48:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1136205 0 urgent CLOSED [Nagios] Volume status is seen to be in warning status with status information "null" when glusterd is stopped on one RH... 2023-09-14 02:46:48 UTC
Red Hat Product Errata RHBA-2015:0039 0 normal SHIPPED_LIVE Red Hat Storage Console 3.0 enhancement and bug fix update #3 2015-01-15 18:46:40 UTC

Internal Links: 1136205

Description Shruti Sampat 2014-06-16 13:14:38 UTC
Description of problem:
------------------------

When glusterd is stopped a couple of nodes in the cluster, the status of volume utilization changes to unknown. The status information of the service reads "Invalid host name rhs.7"

rhs.7 is one of the nodes in the cluster where glusterd is stopped.

Quorum for this volume was not met, so all bricks were down. Hence, volume utilization should have been unknown, but the status information should read something like "Failed to get utilization information"

Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. In a cluster of 7 nodes, bring glusterd down on 2 nodes, causing quorum to be lost on the volumes and bricks to be killed.

Actual results:
Volume utilization is unknown with status information "Invalid host name rhs.7"

Expected results:
Volume utilization should have proper status information.

Additional info:

Comment 1 Dusmant 2014-06-17 12:58:16 UTC
This issue is unlikely to happen often, after the bug fix of Bug 1109025 . But it needs to be documented.

Comment 2 Shalaka 2014-06-18 05:58:20 UTC
Please add doc text for the known issue

Comment 3 Shruti Sampat 2014-06-18 09:14:28 UTC
FYI, this issue is seen even with the fix of BZ #1109025, even with glusterd being running.

Comment 4 Shalaka 2014-06-24 16:53:57 UTC
Please review and signoff edited doc text.

Comment 5 Kanagaraj 2014-06-25 04:28:26 UTC
doc_text looks good

Comment 6 Shruti Sampat 2014-07-15 06:24:39 UTC
Hi,

This issue is also seen in case of volume quota monitoring service, when the volume is stopped. Maybe the doc text needs to be changed to include this too, right now it seems specific to volume utilization.

Comment 7 Shruti Sampat 2014-07-15 11:42:25 UTC
Hi,

Another situation where I saw this issue is while testing quota timeout value using the -t option (BZ #1094614)

Performed the following steps to cause the quota list command to not return within 1 second, and thus the timeout to occur (timeout was set to 1 second using the -t option) -

1. Created 2000 directories on the mount of the volume.
2. Configured quota limits on all 2000 directories.
Now quota list command takes over 1 second to return the information.

While quota was being configured on the directories, the status of the quota service was UNKNOWN with the status information as "Invalid host name rhs.5" (rhs.5 is one of the hosts in the cluster being monitored)

After a while the status of the service was CRITICAL with status information "CHECK_NRPE: Socket timeout after 1 seconds."

Comment 9 Shruti Sampat 2014-08-12 14:59:21 UTC
This issue is also seen when quota is enabled for a volume and the volume is stopped. The status information of quota status service displays "Invalid host name 'rhs.4' ", rhs.4 being the name of one of the hosts in the cluster.

Comment 11 Ramesh N 2014-11-05 11:31:39 UTC
Moving back to assigned state as there are some scenarios which is not covered in the bug

Comment 12 Shruti Sampat 2014-11-27 08:37:03 UTC
Verified as fixed in nagios-server-addons-0.1.9-1.el6rhs

Tested with RHS+Nagios in cluster of 4 nodes. Verified in the following scenarios - 

1. Stopped nrpe on one of the nodes.
2. Stopped glusterd on a couple of nodes.
3. Powered off one of the nodes.

In all of the above scenarios, volume utilization was unknown with the following status information -

UNKNOWN: Failed to get the Volume Utilization Data 

Also tested with volume quota service, as mentioned in Comment #6 and Comment #7 -

1. Status of volume quota service when volume was stopped was warning with status information - 

QUOTA: Quota status could not be determined. quota command failed : Volume is stopped, start volume before executing quota command.

2. Unable to reproduce with scenario mentioned in Comment #7

Marking as verified.

Comment 13 Pavithra 2014-12-24 09:05:34 UTC
Hi Nishanth,

Can you please review the edited doc text for technical accuracy and sign off?

Comment 15 errata-xmlrpc 2015-01-15 13:48:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html

Comment 16 Red Hat Bugzilla 2023-09-14 02:10:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.