Bug 1081900

Summary: [Nagios] [RFE] Alerting mechanism for split-brain from Nagios
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prasanth <pprakash>
Component: gluster-nagios-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: annair, asrivast, divya, dpati, knarra, nlevinki, nsathyan, rhs-bugs, sabose, sdharane, ssaha
Target Milestone: ---Keywords: FutureFeature, Reopened
Target Release: RHGS 3.1.0Flags: divya: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-nagios-addons-0.2.1-1 Doc Type: Enhancement
Doc Text:
Previously, there was no way to alert the user when split-brain is detected on a replicate volume. Due to this, users did not know the issue to take timely corrective action. With this enhancement, the Nagios plugin for self-heal monitoring has been enhanced to report if any of the entries are in split-brain state. Plugin has been renamed from "Volume Self-heal" to "Volume Split-brain status".
Story Points: ---
Clone Of: 1033197 Environment:
Last Closed: 2015-07-29 05:25:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1100563    
Bug Blocks: 1033197, 1202842    

Description Prasanth 2014-03-28 07:28:22 UTC
+++ This bug was initially created as a clone of Bug #1033197 +++

Description of problem:

split-brains are inevitable in the field either because of network issues or due to bugs in the software stack. 

There is no way currently for storage administrators to be notified of split-brain situations so that they can take remedial action. 

This is RFE (Request For Enhancement) to provide an alerting mechanism to storage administrators of split-brain situations. Furthermore, a mechanism needs to provided to storage administrator to diagnose the situation, identify root cause and take remedial action. This latter part is perhaps a different RFE, but combining it here until we have an wholesome assessment of this entire request. 
 
Version-Release number of selected component (if applicable):

RHSC 2.1 and RHS 2.1 


Additional info:

Alerts should be generated in case of split-brains in 

- client facing network
- server side network 
- or combinations of the above 

If there is a loss of connectivity between the management network (where RHSC is located) with clients and/or servers an alert to that effect also needs to be in place.

--- Additional comment from RHEL Product and Program Management on 2013-11-21 12:24:43 EST ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 2 Dusmant 2014-04-01 05:15:07 UTC

*** This bug has been marked as a duplicate of bug 1033197 ***

Comment 3 Prasanth 2014-04-01 10:36:53 UTC
Not sure why this bug was closed as duplicate as it was created specifically for having the feature included in Nagios as per the last Bug triage:

-----------------------------------------------
As discussed in the triage meeting, a new bug is now opened to track this feature through Nagios. ( Currently Alerts would not be shown in RHSC. They will be shown only in Nagios UI )
---------Note from triage meeting--------------
1033197 - Out, for now. A different bug will be created for monitoring split-brain using Nagios. (Bug 1081900 opened for the same)
-----------------------------------------------

Hence re-opening it.

Comment 4 Shubhendu Tripathi 2014-05-21 16:02:51 UTC
Currently there is way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain.
Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume.

Comment 5 Shubhendu Tripathi 2014-05-22 14:44:19 UTC
Small correction in the comment earlier. Please read as below -

"Currently there is NO way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain.
Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume."

Sorry for the typo.

Comment 6 Dusmant 2014-05-29 07:21:47 UTC
As discussed with Alok, Vijay and other key stake holders over e-mail, i am taking this bug out of Denali release.

Comment 7 Sahina Bose 2015-02-10 05:14:12 UTC
We will be taking the following in for Everglades:

1. Alerting when files are in split brain (using the "gluster volume heal split-brain info")

2. When there's a network split-brain this is currently alerted using the Cluster-quorum plugin (this plugin will alert the administrator when volumes have lost quorum as long as server side quorum is turned on)

Comment 8 Sahina Bose 2015-03-02 09:20:07 UTC
Patches http://review.gluster.org/9782 and  http://review.gluster.org/9783 posted

Comment 10 RamaKasturi 2015-06-19 13:23:06 UTC
Verified and works fine with gluster-nagios-addons-0.2.3-1.el6rhs.x86_64.

Currently when  nagios detects that split brain has occurred it marks the Volume Split-Brain status - <vol_name> service to critical and  shows how many no.of files are in split brain. 

When there is no split brain detected, Volume Split-brain status - <vol_name> remains in OK state with status information as "No split brain state entries found".

When the volume is stopped / deleted, Volume Split-brain status - <vol_name> displays the status as WARNING with status information as "split brain status could not be determined"

Comment 11 RamaKasturi 2015-06-19 13:27:49 UTC
An email and snmp notifications are sent when split brain status changes to critical and when it comes back to normal again.

Comment 12 Divya 2015-07-26 05:09:48 UTC
Sahina,

Please review the edited doc text and sign-off.

Comment 13 Sahina Bose 2015-07-27 05:04:16 UTC
Acked

Comment 16 errata-xmlrpc 2015-07-29 05:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1494.html