Bug 1329936

Summary: Heal info plugin shows Critical state when files are healing, which is misleading
Product: Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: nagios-server-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: Sweta Anandpara <sanandpa>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.1CC: asrivast, rhinduja, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-nagios-addons-0.2.7-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-23 05:27:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1311817    
Attachments:
Description Flags
screenshot of nagios UI
none
Server and client logs none

Description Sweta Anandpara 2016-04-25 05:46:04 UTC
Created attachment 1150258 [details]
screenshot of nagios UI

Description of problem:
-----------------------

A new plugin 'volume heal info' (RFE BZ 1312207) is added, which displays the status/progress of self-heal running in the background. The states that it goes to, presently: 

OK: No split-brain entries found. All files are synced.
WARNING: When there are unsynced entries found, command execution fails.
CRITICAL: Self-heal is in progress

The state that should be flagged 'critical' shows as 'warning', and the state that should be shown as 'warning' is 'critical'


LOGIC that should ideally go behind 'warning': 

When self-heal is in progress, the system is not necessarily broken. It means that there is work going on in the background to fix it, and manual intervention is NOT needed. This state should ideally be a 'WARNING', which says to the user - 'Nothing to be alarmed. But do monitor'

LOGIC that should ideally go behind 'critical':

When the command execution fails(resulting in a state where the files continue to remain unsynced), that signifies that the system is not in a healthy state, and it can no longer fix things by itself. The user/admin would have to do *something in the backend, to make things right. This is the state where it should flag CRITICAL - which says: ' Go immediately and fix it'

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-server 3.7.9-2
Nagios-server-addons 0.2.4-1

How reproducible: Always

Comment 2 Sahina Bose 2016-04-27 06:03:45 UTC
With "heal info" command, there's no way to determine if heal is in progress. At any time, we can only determine the entries needing heal. 
Ideally if the entries needing heal do not decrease over time, then the plugin should go to Critical state. However, changing state based on Trends is not possible - admin has to monitor the plugin trend graph once the plugin state is warning.
So, in effect, the states of plugin

OK - no files need healing
WARNING - there are files requiring heal or if command could not be executed due to nrpe/other errors
UNKNOWN - command execution failed due to transaction in progress

Comment 3 Sahina Bose 2016-04-27 06:16:26 UTC
Moving this out of 3.1.3 as per comment 2.
Once we review the states expected, will either close it or implement changes.

Comment 5 Sahina Bose 2016-05-04 06:56:33 UTC
After reviewing the current implementation of "gluster volume heal info" - the output returns "Possibly undergoing heal" in 2 cases
1. File is actually undergoing heal
2. heal info command is executed simultaneously on 2 nodes, which acquires lock on file.

Moving the plugin state to "Critical" in such cases is misleading to the user. If files are undergoing heal - this is expected, and user only needs to be warned of this case, similar to the warning about unsynced entries. The plugin status needs to be changed.

Comment 6 Sahina Bose 2016-05-04 07:20:26 UTC
http://review.gluster.org/#/c/14200/

Comment 9 Sweta Anandpara 2016-05-12 11:58:37 UTC
Created attachment 1156640 [details]
Server and client logs

Comment 10 Sweta Anandpara 2016-05-12 11:59:30 UTC
Tested and verified this on the build glusterfs 3.7.9-4 , with nagios-server-addons 0.2.5-1 and gluster-nagios-addons 0.2.7-1

Had a replica2 and replica3 volume, killed a brick using 'kill 15' and created large file(s) from nfs/fuse mount. Verified that the 'volume heal info' goes to 'warning' - saying ' unsynced entries found'. The cli command 'gluster volume heal <volname> info' lists the number of files that are out of sync. 

Start the volume using force option, thereby restarting the brick process, in turn triggering self heal to heal the file(s) in the brick that has just come up. The nagios web UI continues to show the service 'volume heal info' as 'warning' as opposed to 'critical' that used to get shown before. When the healing completes, the service transitions to green. 

Moving this BZ to verified in 3.1.3. Detailed logs are attached.

Comment 12 errata-xmlrpc 2016-06-23 05:27:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1242