Bug 1329936 - Heal info plugin shows Critical state when files are healing, which is misleading
Summary: Heal info plugin shows Critical state when files are healing, which is mislea...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nagios-server-addons
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: RHGS 3.1.3
Assignee: Sahina Bose
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1311817
TreeView+ depends on / blocked
 
Reported: 2016-04-25 05:46 UTC by Sweta Anandpara
Modified: 2016-06-23 05:27 UTC (History)
3 users (show)

Fixed In Version: gluster-nagios-addons-0.2.7-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-23 05:27:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
screenshot of nagios UI (224.82 KB, image/png)
2016-04-25 05:46 UTC, Sweta Anandpara
no flags Details
Server and client logs (23.31 KB, application/vnd.oasis.opendocument.text)
2016-05-12 11:58 UTC, Sweta Anandpara
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1242 0 normal SHIPPED_LIVE Red Hat Gluster Storage Console 3.1 update 3 bug fixes 2016-06-23 09:01:46 UTC

Description Sweta Anandpara 2016-04-25 05:46:04 UTC
Created attachment 1150258 [details]
screenshot of nagios UI

Description of problem:
-----------------------

A new plugin 'volume heal info' (RFE BZ 1312207) is added, which displays the status/progress of self-heal running in the background. The states that it goes to, presently: 

OK: No split-brain entries found. All files are synced.
WARNING: When there are unsynced entries found, command execution fails.
CRITICAL: Self-heal is in progress

The state that should be flagged 'critical' shows as 'warning', and the state that should be shown as 'warning' is 'critical'


LOGIC that should ideally go behind 'warning': 

When self-heal is in progress, the system is not necessarily broken. It means that there is work going on in the background to fix it, and manual intervention is NOT needed. This state should ideally be a 'WARNING', which says to the user - 'Nothing to be alarmed. But do monitor'

LOGIC that should ideally go behind 'critical':

When the command execution fails(resulting in a state where the files continue to remain unsynced), that signifies that the system is not in a healthy state, and it can no longer fix things by itself. The user/admin would have to do *something in the backend, to make things right. This is the state where it should flag CRITICAL - which says: ' Go immediately and fix it'

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-server 3.7.9-2
Nagios-server-addons 0.2.4-1

How reproducible: Always

Comment 2 Sahina Bose 2016-04-27 06:03:45 UTC
With "heal info" command, there's no way to determine if heal is in progress. At any time, we can only determine the entries needing heal. 
Ideally if the entries needing heal do not decrease over time, then the plugin should go to Critical state. However, changing state based on Trends is not possible - admin has to monitor the plugin trend graph once the plugin state is warning.
So, in effect, the states of plugin

OK - no files need healing
WARNING - there are files requiring heal or if command could not be executed due to nrpe/other errors
UNKNOWN - command execution failed due to transaction in progress

Comment 3 Sahina Bose 2016-04-27 06:16:26 UTC
Moving this out of 3.1.3 as per comment 2.
Once we review the states expected, will either close it or implement changes.

Comment 5 Sahina Bose 2016-05-04 06:56:33 UTC
After reviewing the current implementation of "gluster volume heal info" - the output returns "Possibly undergoing heal" in 2 cases
1. File is actually undergoing heal
2. heal info command is executed simultaneously on 2 nodes, which acquires lock on file.

Moving the plugin state to "Critical" in such cases is misleading to the user. If files are undergoing heal - this is expected, and user only needs to be warned of this case, similar to the warning about unsynced entries. The plugin status needs to be changed.

Comment 6 Sahina Bose 2016-05-04 07:20:26 UTC
http://review.gluster.org/#/c/14200/

Comment 9 Sweta Anandpara 2016-05-12 11:58:37 UTC
Created attachment 1156640 [details]
Server and client logs

Comment 10 Sweta Anandpara 2016-05-12 11:59:30 UTC
Tested and verified this on the build glusterfs 3.7.9-4 , with nagios-server-addons 0.2.5-1 and gluster-nagios-addons 0.2.7-1

Had a replica2 and replica3 volume, killed a brick using 'kill 15' and created large file(s) from nfs/fuse mount. Verified that the 'volume heal info' goes to 'warning' - saying ' unsynced entries found'. The cli command 'gluster volume heal <volname> info' lists the number of files that are out of sync. 

Start the volume using force option, thereby restarting the brick process, in turn triggering self heal to heal the file(s) in the brick that has just come up. The nagios web UI continues to show the service 'volume heal info' as 'warning' as opposed to 'critical' that used to get shown before. When the healing completes, the service transitions to green. 

Moving this BZ to verified in 3.1.3. Detailed logs are attached.

Comment 12 errata-xmlrpc 2016-06-23 05:27:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1242


Note You need to log in before you can comment on or make changes to this bug.