Bug 1109683

Summary: [Nagios] Volume self-heal service "CHECK_NRPE: Socket timeout after 10 seconds." when there are a lot of entries to heal
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: gluster-nagios-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED CANTFIX QA Contact: RHS-C QE <rhsc-qe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.0CC: asriram, rhsc-qe-bugs, sabose, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
When a volume has a large number of files to heal, the "volume self heal info" command takes time to return results and the nrpe plug-in times out as the default timeout is 10 seconds. Workaround : To increase the timeout to 10 minutes, use the -t option in the command as below: $USER1$/gluster/check_vol_server.py $ARG1$ $ARG2$ -o self-heal -t 600
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-30 11:12:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1087818    

Description Shruti Sampat 2014-06-16 06:45:56 UTC
Description of problem:
------------------------

When there are a lot of files to be healed, the command "gluster volume heal <vol-name> heal" takes some time to return, due to which check_nrpe for volume self-heal gets timed out. 

This causes the service to be in critical state.

For e.g., there were about 190062 entries to be healed on my setup, and the time taken by heal info command to run was about 20 minutes.
 
Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Saw it once.

Steps to Reproduce:
1. Create a distributed-replicate volume (2x2), start it and mount it on a client.
2. On the mount point perform kernel untar as follows -
# for i in {1..100}; do mkdir dir$i; tar xJf linux-3.0-rc1.tar.xz -C dir$i & done
3. Bring down one brick from each replica pair.
4. After a while bring the bricks up and stop the I/O at the mount point.
5. Observe the status of the volume self-heal service on the Nagios UI.

Actual results:
The volume self-heal service is critical because of nrpe socket time-out.

Expected results:
The service should not be critical, as self-heal running is not something that the admin should be alarmed about, unless heal fails, which is not the case here.

Additional info:

Comment 1 Dusmant 2014-06-17 13:10:34 UTC
This issue is with RHS and RHSC can not address this. We need to document it and see what time interval would probably suffice.

Comment 2 Shalaka 2014-06-18 05:58:13 UTC
Please add doc text for the known issue

Comment 3 Shalaka 2014-06-24 16:37:58 UTC
Please review and sign off the edited doc text.

Comment 4 Shruti Sampat 2014-07-25 07:36:20 UTC
Hi,

The self-heal status monitoring service remains in critical state for as long as the self-heal info command takes more than 10 seconds to return. After a while, if the command returns within 10 seconds (because there are less entries to heal), the service should ideally be in warning state. And then finally, when there are 0 entries to heal, the service should be OK.

I see that sometimes, the self-heal status monitoring service remains in critical state even when the command returns in less than 10 seconds. The Nagios server checks for heal info once in 10 minutes, so if the command was taking more than 10 minutes to execute at one point, and then it comes down quickly to 0 entries before the next check by the Nagios server, the user may never see the warning state of the service, as it would transition from critical to OK without ever reaching the warning state.

Comment 5 Sahina Bose 2014-07-25 14:28:49 UTC
(In reply to Shalaka from comment #3)
> Please review and sign off the edited doc text.

Shalaka,

The doc text needs to capture that Nagios service goes to critical state when the "volume self heal info" command takes time to execute when there are a large number of files

Comment 8 Sahina Bose 2018-01-30 11:12:42 UTC
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.