Bug 1141171

Summary: [Nagios] Quorum service is seen to be OK when it should actually be in CRITICAL state
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: gluster-nagios-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: Shruti Sampat <ssampat>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: asrivast, dpati, psriniva, rhsc-qe-bugs, rnachimu, sabose, sgraf, sharne
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-nagios-addons-0.1.12-1.el6rhs Doc Type: Bug Fix
Doc Text:
Previously, the status of the quorum service displayed an incorrect status. With this fix, a buffering issue is fixed and the quorum service displays the appropriate status.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-15 13:49:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shruti Sampat 2014-09-12 11:33:44 UTC
Description of problem:
------------------------

When a node in the cluster was brought down, to cause quorum to be lost, the status of cluster-quorum service changed to critical. Then the state changed to ok while the node was still down. The state should not have changed to OK as quorum was still lost.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

nagios-server-addons-0.1.6-1.el6rhs.noarch
gluster-nagios-addons-0.1.10-2.el6rhs.x86_64

How reproducible:
-----------------

Intermittently

Steps to Reproduce:
-------------------

1. Create a cluster of 2 RHS nodes and start monitoring it using nagios.
2. Create 2 volumes of distribute-replicate type, set server quorum and quorum ratio to 80%
3. Bring down one of the nodes, so quorum is lost.

Actual results:
----------------

The state of the service changed to critical, and then to ok.

Expected results:
------------------

The state of the service should have remained as critical as quorum was lost.

Additional info:

Comment 1 Sahina Bose 2014-09-15 11:30:45 UTC
http://review.gluster.org/8740

Comment 2 Sahina Bose 2014-10-29 05:20:49 UTC
The issue was due to buffered messages by the plugin that processes the syslog messages. This happens when the quorum is regained and lost again.

The earlier message for quorum regained was in the buffer and resend to nagios when the quorum lost messages were received. This behaviour is fixed with the attached patch

Comment 3 Shruti Sampat 2014-11-14 11:56:43 UTC
Verified as fixed in gluster-nagios-addons-0.1.12-1.el6rhs.x86_64

Quorum service remains in critical state when quorum is lost.

Comment 4 Shalaka 2014-11-27 06:24:39 UTC
Please add doc text for this bug.

Comment 5 Pavithra 2014-12-17 06:34:24 UTC
Hi Sahina,

Can you please review the edited doc text for technical accuracy and sign off?

Comment 6 Sahina Bose 2014-12-24 09:12:06 UTC
Looks good

Comment 8 errata-xmlrpc 2015-01-15 13:49:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html