Bug 1128007

Summary: [Nagios] - When all the nodes in a cluster are down, cluster status shows 'UP' with status information as 'OK:None of the volumes are in critical state'
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: nagios-server-addonsAssignee: Ramesh N <rnachimu>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: high    
Version: rhgs-3.0CC: asrivast, dpati, psriniva, rhsc-qe-bugs, rnachimu
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nagios-server-addons-0.1.9-1.el6rhs Doc Type: Bug Fix
Doc Text:
Previously, when all the nodes in a Red Hat Storage trusted storage pool were offline, all the volumes were moved to an "UNKNOWN" state and the cluster status was displayed as UP with message 'OK:None of the volumes are in critical state'. With this fix, changes are made to consider all the status of volumes while computing the status of the Red Hat Storage trusted storage pool.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-15 13:49:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot when all the nodes in the cluster are down.
none
Status of services in the cluster, when all the nodes are down none

Description RamaKasturi 2014-08-08 05:59:42 UTC
Description of problem:
Setup is nagios on external server + RHS.
When all the nodes in a cluster goes down, cluster status shows as "UP" with status information as "OK: None of the volumes are in critical state".

Version-Release number of selected component (if applicable):
nagios-server-addons-0.1.5-1.el6rhs.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install nagios on an RHEL server.
2. Run discovery.py.
3. shtudown all the nodes in the cluster.

Actual results:
Cluster status shows 'UP' with status information as 'OK: None of the volumes are in critical state'

Expected results:
Cluster status should be 'UNKNOWN' with status information as 'None of the hosts in the cluster are up'

Additional info:

Comment 1 RamaKasturi 2014-08-08 06:00:56 UTC
Created attachment 925083 [details]
Screenshot when all the nodes in the cluster are down.

Comment 2 Kanagaraj 2014-08-18 10:06:12 UTC
Cluster state is an aggregation of states of volumes inside the cluster

As per the current code, Cluster state will be
CRITICAL - If all volumes in the cluster in CRITICAL state
WARNING - If some volumes in CRITICAL state and the others in NON-CRITICAL state(OK, WARNING, UNKNOWN, PENDING)
OK - If all the volumes in NON-CRITICAL state (OK, WARNING, UNKNOWN, PENDING)

Fixing this bug would require considering all possible states of the volumes and based on that cluster state needs to be determined. May be something like following,

CRITICAL - If all volumes CRITICAL state
WARNING - If some volumes in CRITICAL state or all/some volumes in WARNING state
UNKNOWN - If all the volumes in UNKNOWN state
PENDING - If all the volumes in PENDING state
OK - If all the volumes are in OK state

This change will affect the existing flow and will introduce newer flows.

Comment 3 Kanagaraj 2014-08-19 07:42:50 UTC
PENDING state is something internal to Nagios and not possible to change from outside. So in Comment 2, it is not possible to have cluster in PENDING state

Comment 4 Dusmant 2014-08-20 05:24:50 UTC
Further analysis from Kanagaraj
--------------------------------
Found a nagios which talks about the mappings.
http://nagios.sourceforge.net/docs/3_0/hostchecks.html


Plugin Result	Preliminary Host State
OK	        UP
WARNING	        UP or DOWN*
UNKNOWN	        DOWN
CRITICAL	DOWN

By going this way, cluster can be marked as DOWN if all the volumes are in CRITICAL or UNKNOWN state.

Comment 6 RamaKasturi 2014-11-05 07:15:57 UTC
Created attachment 953931 [details]
Status of services in the cluster, when all the nodes are down

Comment 7 Ramesh N 2014-11-05 10:06:53 UTC
based on the comments from 3,4 and 5, following will be the new cluster state and state information.

Cluster State                           State Information
UP              "OK : None of the Volumes in the cluster are in Critical State"
UP              "OK : No Volumes present in the cluster"
UP              "WARNING : Some Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN            "CRITICAL: All Volumes in the cluster are in unknown State"

Comment 8 Ramesh N 2014-11-06 04:27:05 UTC
Upstream patch : http://review.gluster.org/#/c/9053/

Comment 9 Ramesh N 2014-11-12 08:28:48 UTC
Following will be the cluster state and state information with the fix.
 
Cluster State                State Information
UP      "OK : None of the Volumes in the cluster are in Critical State"
UP      "OK : No Volumes present in the cluster"
UP      "WARNING : Some Volumes in the cluster are in Critical State"
UP      "WARNING : Some Volumes in the cluster are in Unknown State"
UP      "WARNING : Some Volumes in the cluster are in Warning State"
UP      "WARNING : All Volumes in the cluster are in Warning State"
DOWN    "CRITICAL: All Volumes in the cluster are in Critical State"
DOWN    "CRITICAL: All Volumes in the cluster are in Unknown State"

Comment 10 RamaKasturi 2014-11-21 10:26:10 UTC
Verified and works fine with build nagios-server-addons-0.1.9-1.el6rhs.

When all the nodes in the cluster goes down, Cluster status is displayed as "DOWN" with status information "CRITICAL : All Volumes in the cluster are in Unknown state".

Comment 11 Pavithra 2014-12-24 09:04:15 UTC
Hi Ramesh,

Can you review the edited doc text for technical accuracy and sign off?

Comment 12 Ramesh N 2014-12-24 11:43:57 UTC
Doc text looks good.

Comment 14 errata-xmlrpc 2015-01-15 13:49:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html