Bug 1081900 - [Nagios] [RFE] Alerting mechanism for split-brain from Nagios
Summary: [Nagios] [RFE] Alerting mechanism for split-brain from Nagios
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nagios-addons
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.1.0
Assignee: Sahina Bose
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On: 1100563
Blocks: 1033197 1202842
TreeView+ depends on / blocked
 
Reported: 2014-03-28 07:28 UTC by Prasanth
Modified: 2015-07-29 05:25 UTC (History)
11 users (show)

Fixed In Version: gluster-nagios-addons-0.2.1-1
Doc Type: Enhancement
Doc Text:
Previously, there was no way to alert the user when split-brain is detected on a replicate volume. Due to this, users did not know the issue to take timely corrective action. With this enhancement, the Nagios plugin for self-heal monitoring has been enhanced to report if any of the entries are in split-brain state. Plugin has been renamed from "Volume Self-heal" to "Volume Split-brain status".
Clone Of: 1033197
Environment:
Last Closed: 2015-07-29 05:25:34 UTC
Embargoed:
divya: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1494 0 normal SHIPPED_LIVE Red Hat Gluster Storage Console 3.1 Enhancement and bug fixes 2015-07-29 09:24:02 UTC

Description Prasanth 2014-03-28 07:28:22 UTC
+++ This bug was initially created as a clone of Bug #1033197 +++

Description of problem:

split-brains are inevitable in the field either because of network issues or due to bugs in the software stack. 

There is no way currently for storage administrators to be notified of split-brain situations so that they can take remedial action. 

This is RFE (Request For Enhancement) to provide an alerting mechanism to storage administrators of split-brain situations. Furthermore, a mechanism needs to provided to storage administrator to diagnose the situation, identify root cause and take remedial action. This latter part is perhaps a different RFE, but combining it here until we have an wholesome assessment of this entire request. 
 
Version-Release number of selected component (if applicable):

RHSC 2.1 and RHS 2.1 


Additional info:

Alerts should be generated in case of split-brains in 

- client facing network
- server side network 
- or combinations of the above 

If there is a loss of connectivity between the management network (where RHSC is located) with clients and/or servers an alert to that effect also needs to be in place.

--- Additional comment from RHEL Product and Program Management on 2013-11-21 12:24:43 EST ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 2 Dusmant 2014-04-01 05:15:07 UTC

*** This bug has been marked as a duplicate of bug 1033197 ***

Comment 3 Prasanth 2014-04-01 10:36:53 UTC
Not sure why this bug was closed as duplicate as it was created specifically for having the feature included in Nagios as per the last Bug triage:

-----------------------------------------------
As discussed in the triage meeting, a new bug is now opened to track this feature through Nagios. ( Currently Alerts would not be shown in RHSC. They will be shown only in Nagios UI )
---------Note from triage meeting--------------
1033197 - Out, for now. A different bug will be created for monitoring split-brain using Nagios. (Bug 1081900 opened for the same)
-----------------------------------------------

Hence re-opening it.

Comment 4 Shubhendu Tripathi 2014-05-21 16:02:51 UTC
Currently there is way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain.
Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume.

Comment 5 Shubhendu Tripathi 2014-05-22 14:44:19 UTC
Small correction in the comment earlier. Please read as below -

"Currently there is NO way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain.
Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume."

Sorry for the typo.

Comment 6 Dusmant 2014-05-29 07:21:47 UTC
As discussed with Alok, Vijay and other key stake holders over e-mail, i am taking this bug out of Denali release.

Comment 7 Sahina Bose 2015-02-10 05:14:12 UTC
We will be taking the following in for Everglades:

1. Alerting when files are in split brain (using the "gluster volume heal split-brain info")

2. When there's a network split-brain this is currently alerted using the Cluster-quorum plugin (this plugin will alert the administrator when volumes have lost quorum as long as server side quorum is turned on)

Comment 8 Sahina Bose 2015-03-02 09:20:07 UTC
Patches http://review.gluster.org/9782 and  http://review.gluster.org/9783 posted

Comment 10 RamaKasturi 2015-06-19 13:23:06 UTC
Verified and works fine with gluster-nagios-addons-0.2.3-1.el6rhs.x86_64.

Currently when  nagios detects that split brain has occurred it marks the Volume Split-Brain status - <vol_name> service to critical and  shows how many no.of files are in split brain. 

When there is no split brain detected, Volume Split-brain status - <vol_name> remains in OK state with status information as "No split brain state entries found".

When the volume is stopped / deleted, Volume Split-brain status - <vol_name> displays the status as WARNING with status information as "split brain status could not be determined"

Comment 11 RamaKasturi 2015-06-19 13:27:49 UTC
An email and snmp notifications are sent when split brain status changes to critical and when it comes back to normal again.

Comment 12 Divya 2015-07-26 05:09:48 UTC
Sahina,

Please review the edited doc text and sign-off.

Comment 13 Sahina Bose 2015-07-27 05:04:16 UTC
Acked

Comment 16 errata-xmlrpc 2015-07-29 05:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1494.html


Note You need to log in before you can comment on or make changes to this bug.