1128007 – [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP' with status information as 'OK:None of the volumes are in critical state'

Bug 1128007 - [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP' with status information as 'OK:None of the volumes are in critical state'

Summary: [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nagios-server-addons
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.0.3
Assignee:	Ramesh N
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-08 05:59 UTC by RamaKasturi
Modified:	2015-05-13 17:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:	nagios-server-addons-0.1.9-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, when all the nodes in a Red Hat Storage trusted storage pool were offline, all the volumes were moved to an "UNKNOWN" state and the cluster status was displayed as UP with message 'OK:None of the volumes are in critical state'. With this fix, changes are made to consider all the status of volumes while computing the status of the Red Hat Storage trusted storage pool.
Clone Of:
Environment:
Last Closed:	2015-01-15 13:49:12 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshot when all the nodes in the cluster are down. (153.00 KB, image/png) 2014-08-08 06:00 UTC, RamaKasturi	no flags	Details
Status of services in the cluster, when all the nodes are down (201.42 KB, image/png) 2014-11-05 07:15 UTC, RamaKasturi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0039	0	normal	SHIPPED_LIVE	Red Hat Storage Console 3.0 enhancement and bug fix update #3	2015-01-15 18:46:40 UTC

Description RamaKasturi 2014-08-08 05:59:42 UTC

Description of problem:
Setup is nagios on external server + RHS.
When all the nodes in a cluster goes down, cluster status shows as "UP" with status information as "OK: None of the volumes are in critical state".

Version-Release number of selected component (if applicable):
nagios-server-addons-0.1.5-1.el6rhs.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install nagios on an RHEL server.
2. Run discovery.py.
3. shtudown all the nodes in the cluster.

Actual results:
Cluster status shows 'UP' with status information as 'OK: None of the volumes are in critical state'

Expected results:
Cluster status should be 'UNKNOWN' with status information as 'None of the hosts in the cluster are up'

Additional info:

Comment 1 RamaKasturi 2014-08-08 06:00:56 UTC

Created attachment 925083 [details]
Screenshot when all the nodes in the cluster are down.

Comment 2 Kanagaraj 2014-08-18 10:06:12 UTC

Cluster state is an aggregation of states of volumes inside the cluster

As per the current code, Cluster state will be
CRITICAL - If all volumes in the cluster in CRITICAL state
WARNING - If some volumes in CRITICAL state and the others in NON-CRITICAL state(OK, WARNING, UNKNOWN, PENDING)
OK - If all the volumes in NON-CRITICAL state (OK, WARNING, UNKNOWN, PENDING)

Fixing this bug would require considering all possible states of the volumes and based on that cluster state needs to be determined. May be something like following,

CRITICAL - If all volumes CRITICAL state
WARNING - If some volumes in CRITICAL state or all/some volumes in WARNING state
UNKNOWN - If all the volumes in UNKNOWN state
PENDING - If all the volumes in PENDING state
OK - If all the volumes are in OK state

This change will affect the existing flow and will introduce newer flows.

Comment 3 Kanagaraj 2014-08-19 07:42:50 UTC

PENDING state is something internal to Nagios and not possible to change from outside. So in Comment 2, it is not possible to have cluster in PENDING state

Comment 4 Dusmant 2014-08-20 05:24:50 UTC

Further analysis from Kanagaraj
--------------------------------
Found a nagios which talks about the mappings.
http://nagios.sourceforge.net/docs/3_0/hostchecks.html


Plugin Result	Preliminary Host State
OK	        UP
WARNING	        UP or DOWN*
UNKNOWN	        DOWN
CRITICAL	DOWN

By going this way, cluster can be marked as DOWN if all the volumes are in CRITICAL or UNKNOWN state.

Comment 6 RamaKasturi 2014-11-05 07:15:57 UTC

Created attachment 953931 [details]
Status of services in the cluster, when all the nodes are down

Comment 7 Ramesh N 2014-11-05 10:06:53 UTC

based on the comments from 3,4 and 5, following will be the new cluster state and state information.

Cluster State                           State Information
UP              "OK : None of the Volumes in the cluster are in Critical State"
UP              "OK : No Volumes present in the cluster"
UP              "WARNING : Some Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in unknown State"

Comment 8 Ramesh N 2014-11-06 04:27:05 UTC

Upstream patch : http://review.gluster.org/#/c/9053/

Comment 9 Ramesh N 2014-11-12 08:28:48 UTC

Following will be the cluster state and state information with the fix.
 
Cluster State                State Information
UP      "OK : None of the Volumes in the cluster are in Critical State"
UP      "OK : No Volumes present in the cluster"
UP      "WARNING : Some Volumes in the cluster are in Critical State"
UP      "WARNING : Some Volumes in the cluster are in Unknown State"
UP      "WARNING : Some Volumes in the cluster are in Warning State"
UP      "WARNING : All Volumes in the cluster are in Warning State"
DOWN    "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN    "CRITICAL: All Volumes in the cluster are in Unknown State"

Comment 10 RamaKasturi 2014-11-21 10:26:10 UTC

Verified and works fine with build nagios-server-addons-0.1.9-1.el6rhs.

When all the nodes in the cluster goes down, Cluster status is displayed as "DOWN" with status information "CRITICAL : All Volumes in the cluster are in Unknown state".

Comment 11 Pavithra 2014-12-24 09:04:15 UTC

Hi Ramesh,

Can you review the edited doc text for technical accuracy and sign off?

Comment 12 Ramesh N 2014-12-24 11:43:57 UTC

Doc text looks good.

Comment 14 errata-xmlrpc 2015-01-15 13:49:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html

Note You need to log in before you can comment on or make changes to this bug.