Description of problem: Setup is nagios on external server + RHS. When all the nodes in a cluster goes down, cluster status shows as "UP" with status information as "OK: None of the volumes are in critical state". Version-Release number of selected component (if applicable): nagios-server-addons-0.1.5-1.el6rhs.noarch How reproducible: Always Steps to Reproduce: 1. Install nagios on an RHEL server. 2. Run discovery.py. 3. shtudown all the nodes in the cluster. Actual results: Cluster status shows 'UP' with status information as 'OK: None of the volumes are in critical state' Expected results: Cluster status should be 'UNKNOWN' with status information as 'None of the hosts in the cluster are up' Additional info:
Created attachment 925083 [details] Screenshot when all the nodes in the cluster are down.
Cluster state is an aggregation of states of volumes inside the cluster As per the current code, Cluster state will be CRITICAL - If all volumes in the cluster in CRITICAL state WARNING - If some volumes in CRITICAL state and the others in NON-CRITICAL state(OK, WARNING, UNKNOWN, PENDING) OK - If all the volumes in NON-CRITICAL state (OK, WARNING, UNKNOWN, PENDING) Fixing this bug would require considering all possible states of the volumes and based on that cluster state needs to be determined. May be something like following, CRITICAL - If all volumes CRITICAL state WARNING - If some volumes in CRITICAL state or all/some volumes in WARNING state UNKNOWN - If all the volumes in UNKNOWN state PENDING - If all the volumes in PENDING state OK - If all the volumes are in OK state This change will affect the existing flow and will introduce newer flows.
PENDING state is something internal to Nagios and not possible to change from outside. So in Comment 2, it is not possible to have cluster in PENDING state
Further analysis from Kanagaraj -------------------------------- Found a nagios which talks about the mappings. http://nagios.sourceforge.net/docs/3_0/hostchecks.html Plugin Result Preliminary Host State OK UP WARNING UP or DOWN* UNKNOWN DOWN CRITICAL DOWN By going this way, cluster can be marked as DOWN if all the volumes are in CRITICAL or UNKNOWN state.
Created attachment 953931 [details] Status of services in the cluster, when all the nodes are down
based on the comments from 3,4 and 5, following will be the new cluster state and state information. Cluster State State Information UP "OK : None of the Volumes in the cluster are in Critical State" UP "OK : No Volumes present in the cluster" UP "WARNING : Some Volumes in the cluster are in Critical State" DOWN "CRITICAL: All Volumes in the cluster are in Critical State" DOWN "CRITICAL: All Volumes in the cluster are in unknown State"
Upstream patch : http://review.gluster.org/#/c/9053/
Following will be the cluster state and state information with the fix. Cluster State State Information UP "OK : None of the Volumes in the cluster are in Critical State" UP "OK : No Volumes present in the cluster" UP "WARNING : Some Volumes in the cluster are in Critical State" UP "WARNING : Some Volumes in the cluster are in Unknown State" UP "WARNING : Some Volumes in the cluster are in Warning State" UP "WARNING : All Volumes in the cluster are in Warning State" DOWN "CRITICAL: All Volumes in the cluster are in Critical State" DOWN "CRITICAL: All Volumes in the cluster are in Unknown State"
Verified and works fine with build nagios-server-addons-0.1.9-1.el6rhs. When all the nodes in the cluster goes down, Cluster status is displayed as "DOWN" with status information "CRITICAL : All Volumes in the cluster are in Unknown state".
Hi Ramesh, Can you review the edited doc text for technical accuracy and sign off?
Doc text looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0039.html