+++ This bug was initially created as a clone of Bug #1033197 +++ Description of problem: split-brains are inevitable in the field either because of network issues or due to bugs in the software stack. There is no way currently for storage administrators to be notified of split-brain situations so that they can take remedial action. This is RFE (Request For Enhancement) to provide an alerting mechanism to storage administrators of split-brain situations. Furthermore, a mechanism needs to provided to storage administrator to diagnose the situation, identify root cause and take remedial action. This latter part is perhaps a different RFE, but combining it here until we have an wholesome assessment of this entire request. Version-Release number of selected component (if applicable): RHSC 2.1 and RHS 2.1 Additional info: Alerts should be generated in case of split-brains in - client facing network - server side network - or combinations of the above If there is a loss of connectivity between the management network (where RHSC is located) with clients and/or servers an alert to that effect also needs to be in place. --- Additional comment from RHEL Product and Program Management on 2013-11-21 12:24:43 EST --- Since this issue was entered in bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.
*** This bug has been marked as a duplicate of bug 1033197 ***
Not sure why this bug was closed as duplicate as it was created specifically for having the feature included in Nagios as per the last Bug triage: ----------------------------------------------- As discussed in the triage meeting, a new bug is now opened to track this feature through Nagios. ( Currently Alerts would not be shown in RHSC. They will be shown only in Nagios UI ) ---------Note from triage meeting-------------- 1033197 - Out, for now. A different bug will be created for monitoring split-brain using Nagios. (Bug 1081900 opened for the same) ----------------------------------------------- Hence re-opening it.
Currently there is way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain. Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume.
Small correction in the comment earlier. Please read as below - "Currently there is NO way in gluster to identify a split brain and so in Nagios UI there is no way to alert the case of a split brain. Currently in Nagios the split brain scenario is being identified based on the quorum check for the volume." Sorry for the typo.
As discussed with Alok, Vijay and other key stake holders over e-mail, i am taking this bug out of Denali release.
We will be taking the following in for Everglades: 1. Alerting when files are in split brain (using the "gluster volume heal split-brain info") 2. When there's a network split-brain this is currently alerted using the Cluster-quorum plugin (this plugin will alert the administrator when volumes have lost quorum as long as server side quorum is turned on)
Patches http://review.gluster.org/9782 and http://review.gluster.org/9783 posted
Verified and works fine with gluster-nagios-addons-0.2.3-1.el6rhs.x86_64. Currently when nagios detects that split brain has occurred it marks the Volume Split-Brain status - <vol_name> service to critical and shows how many no.of files are in split brain. When there is no split brain detected, Volume Split-brain status - <vol_name> remains in OK state with status information as "No split brain state entries found". When the volume is stopped / deleted, Volume Split-brain status - <vol_name> displays the status as WARNING with status information as "split brain status could not be determined"
An email and snmp notifications are sent when split brain status changes to critical and when it comes back to normal again.
Sahina, Please review the edited doc text and sign-off.
Acked
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-1494.html