Bug 1128007 - [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP' with status information as 'OK:None of the volumes are in critical state'
Summary: [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nagios-server-addons
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: RHGS 3.0.3
Assignee: Ramesh N
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-08 05:59 UTC by RamaKasturi
Modified: 2015-05-13 17:41 UTC (History)
5 users (show)

Fixed In Version: nagios-server-addons-0.1.9-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Previously, when all the nodes in a Red Hat Storage trusted storage pool were offline, all the volumes were moved to an "UNKNOWN" state and the cluster status was displayed as UP with message 'OK:None of the volumes are in critical state'. With this fix, changes are made to consider all the status of volumes while computing the status of the Red Hat Storage trusted storage pool.
Clone Of:
Environment:
Last Closed: 2015-01-15 13:49:12 UTC


Attachments (Terms of Use)
Screenshot when all the nodes in the cluster are down. (153.00 KB, image/png)
2014-08-08 06:00 UTC, RamaKasturi
no flags Details
Status of services in the cluster, when all the nodes are down (201.42 KB, image/png)
2014-11-05 07:15 UTC, RamaKasturi
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0039 normal SHIPPED_LIVE Red Hat Storage Console 3.0 enhancement and bug fix update #3 2015-01-15 18:46:40 UTC

Description RamaKasturi 2014-08-08 05:59:42 UTC
Description of problem:
Setup is nagios on external server + RHS.
When all the nodes in a cluster goes down, cluster status shows as "UP" with status information as "OK: None of the volumes are in critical state".

Version-Release number of selected component (if applicable):
nagios-server-addons-0.1.5-1.el6rhs.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install nagios on an RHEL server.
2. Run discovery.py.
3. shtudown all the nodes in the cluster.

Actual results:
Cluster status shows 'UP' with status information as 'OK: None of the volumes are in critical state'

Expected results:
Cluster status should be 'UNKNOWN' with status information as 'None of the hosts in the cluster are up'

Additional info:

Comment 1 RamaKasturi 2014-08-08 06:00:56 UTC
Created attachment 925083 [details]
Screenshot when all the nodes in the cluster are down.

Comment 2 Kanagaraj 2014-08-18 10:06:12 UTC
Cluster state is an aggregation of states of volumes inside the cluster

As per the current code, Cluster state will be
CRITICAL - If all volumes in the cluster in CRITICAL state
WARNING - If some volumes in CRITICAL state and the others in NON-CRITICAL state(OK, WARNING, UNKNOWN, PENDING)
OK - If all the volumes in NON-CRITICAL state (OK, WARNING, UNKNOWN, PENDING)

Fixing this bug would require considering all possible states of the volumes and based on that cluster state needs to be determined. May be something like following,

CRITICAL - If all volumes CRITICAL state
WARNING - If some volumes in CRITICAL state or all/some volumes in WARNING state
UNKNOWN - If all the volumes in UNKNOWN state
PENDING - If all the volumes in PENDING state
OK - If all the volumes are in OK state

This change will affect the existing flow and will introduce newer flows.

Comment 3 Kanagaraj 2014-08-19 07:42:50 UTC
PENDING state is something internal to Nagios and not possible to change from outside. So in Comment 2, it is not possible to have cluster in PENDING state

Comment 4 Dusmant 2014-08-20 05:24:50 UTC
Further analysis from Kanagaraj
--------------------------------
Found a nagios which talks about the mappings.
http://nagios.sourceforge.net/docs/3_0/hostchecks.html


Plugin Result	Preliminary Host State
OK	        UP
WARNING	        UP or DOWN*
UNKNOWN	        DOWN
CRITICAL	DOWN

By going this way, cluster can be marked as DOWN if all the volumes are in CRITICAL or UNKNOWN state.

Comment 6 RamaKasturi 2014-11-05 07:15:57 UTC
Created attachment 953931 [details]
Status of services in the cluster, when all the nodes are down

Comment 7 Ramesh N 2014-11-05 10:06:53 UTC
based on the comments from 3,4 and 5, following will be the new cluster state and state information.

Cluster State                           State Information
UP              "OK : None of the Volumes in the cluster are in Critical State"
UP              "OK : No Volumes present in the cluster"
UP              "WARNING : Some Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in unknown State"

Comment 8 Ramesh N 2014-11-06 04:27:05 UTC
Upstream patch : http://review.gluster.org/#/c/9053/

Comment 9 Ramesh N 2014-11-12 08:28:48 UTC
Following will be the cluster state and state information with the fix.
 
Cluster State                State Information
UP      "OK : None of the Volumes in the cluster are in Critical State"
UP      "OK : No Volumes present in the cluster"
UP      "WARNING : Some Volumes in the cluster are in Critical State"
UP      "WARNING : Some Volumes in the cluster are in Unknown State"
UP      "WARNING : Some Volumes in the cluster are in Warning State"
UP      "WARNING : All Volumes in the cluster are in Warning State"
DOWN    "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN    "CRITICAL: All Volumes in the cluster are in Unknown State"

Comment 10 RamaKasturi 2014-11-21 10:26:10 UTC
Verified and works fine with build nagios-server-addons-0.1.9-1.el6rhs.

When all the nodes in the cluster goes down, Cluster status is displayed as "DOWN" with status information "CRITICAL : All Volumes in the cluster are in Unknown state".

Comment 11 Pavithra 2014-12-24 09:04:15 UTC
Hi Ramesh,

Can you review the edited doc text for technical accuracy and sign off?

Comment 12 Ramesh N 2014-12-24 11:43:57 UTC
Doc text looks good.

Comment 14 errata-xmlrpc 2015-01-15 13:49:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html


Note You need to log in before you can comment on or make changes to this bug.