Hide Forgot
Created attachment 1150258 [details] screenshot of nagios UI Description of problem: ----------------------- A new plugin 'volume heal info' (RFE BZ 1312207) is added, which displays the status/progress of self-heal running in the background. The states that it goes to, presently: OK: No split-brain entries found. All files are synced. WARNING: When there are unsynced entries found, command execution fails. CRITICAL: Self-heal is in progress The state that should be flagged 'critical' shows as 'warning', and the state that should be shown as 'warning' is 'critical' LOGIC that should ideally go behind 'warning': When self-heal is in progress, the system is not necessarily broken. It means that there is work going on in the background to fix it, and manual intervention is NOT needed. This state should ideally be a 'WARNING', which says to the user - 'Nothing to be alarmed. But do monitor' LOGIC that should ideally go behind 'critical': When the command execution fails(resulting in a state where the files continue to remain unsynced), that signifies that the system is not in a healthy state, and it can no longer fix things by itself. The user/admin would have to do *something in the backend, to make things right. This is the state where it should flag CRITICAL - which says: ' Go immediately and fix it' Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-server 3.7.9-2 Nagios-server-addons 0.2.4-1 How reproducible: Always
With "heal info" command, there's no way to determine if heal is in progress. At any time, we can only determine the entries needing heal. Ideally if the entries needing heal do not decrease over time, then the plugin should go to Critical state. However, changing state based on Trends is not possible - admin has to monitor the plugin trend graph once the plugin state is warning. So, in effect, the states of plugin OK - no files need healing WARNING - there are files requiring heal or if command could not be executed due to nrpe/other errors UNKNOWN - command execution failed due to transaction in progress
Moving this out of 3.1.3 as per comment 2. Once we review the states expected, will either close it or implement changes.
After reviewing the current implementation of "gluster volume heal info" - the output returns "Possibly undergoing heal" in 2 cases 1. File is actually undergoing heal 2. heal info command is executed simultaneously on 2 nodes, which acquires lock on file. Moving the plugin state to "Critical" in such cases is misleading to the user. If files are undergoing heal - this is expected, and user only needs to be warned of this case, similar to the warning about unsynced entries. The plugin status needs to be changed.
http://review.gluster.org/#/c/14200/
Created attachment 1156640 [details] Server and client logs
Tested and verified this on the build glusterfs 3.7.9-4 , with nagios-server-addons 0.2.5-1 and gluster-nagios-addons 0.2.7-1 Had a replica2 and replica3 volume, killed a brick using 'kill 15' and created large file(s) from nfs/fuse mount. Verified that the 'volume heal info' goes to 'warning' - saying ' unsynced entries found'. The cli command 'gluster volume heal <volname> info' lists the number of files that are out of sync. Start the volume using force option, thereby restarting the brick process, in turn triggering self heal to heal the file(s) in the brick that has just come up. The nagios web UI continues to show the service 'volume heal info' as 'warning' as opposed to 'critical' that used to get shown before. When the healing completes, the service transitions to green. Moving this BZ to verified in 3.1.3. Detailed logs are attached.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1242