Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1109683 - [Nagios] Volume self-heal service "CHECK_NRPE: Socket timeout after 10 seconds." when there are a lot of entries to heal
[Nagios] Volume self-heal service "CHECK_NRPE: Socket timeout after 10 second...
Status: CLOSED CANTFIX
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-nagios-addons (Show other bugs)
3.0
Unspecified Unspecified
medium Severity high
: ---
: ---
Assigned To: Sahina Bose
RHS-C QE
: ZStream
Depends On:
Blocks: 1087818
  Show dependency treegraph
 
Reported: 2014-06-16 02:45 EDT by Shruti Sampat
Modified: 2018-01-30 06:12 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
When a volume has a large number of files to heal, the "volume self heal info" command takes time to return results and the nrpe plug-in times out as the default timeout is 10 seconds. Workaround : To increase the timeout to 10 minutes, use the -t option in the command as below: $USER1$/gluster/check_vol_server.py $ARG1$ $ARG2$ -o self-heal -t 600
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-01-30 06:12:42 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Shruti Sampat 2014-06-16 02:45:56 EDT
Description of problem:
------------------------

When there are a lot of files to be healed, the command "gluster volume heal <vol-name> heal" takes some time to return, due to which check_nrpe for volume self-heal gets timed out. 

This causes the service to be in critical state.

For e.g., there were about 190062 entries to be healed on my setup, and the time taken by heal info command to run was about 20 minutes.
 
Version-Release number of selected component (if applicable):
gluster-nagios-addons-0.1.2-1.el6rhs.x86_64

How reproducible:
Saw it once.

Steps to Reproduce:
1. Create a distributed-replicate volume (2x2), start it and mount it on a client.
2. On the mount point perform kernel untar as follows -
# for i in {1..100}; do mkdir dir$i; tar xJf linux-3.0-rc1.tar.xz -C dir$i & done
3. Bring down one brick from each replica pair.
4. After a while bring the bricks up and stop the I/O at the mount point.
5. Observe the status of the volume self-heal service on the Nagios UI.

Actual results:
The volume self-heal service is critical because of nrpe socket time-out.

Expected results:
The service should not be critical, as self-heal running is not something that the admin should be alarmed about, unless heal fails, which is not the case here.

Additional info:
Comment 1 Dusmant 2014-06-17 09:10:34 EDT
This issue is with RHS and RHSC can not address this. We need to document it and see what time interval would probably suffice.
Comment 2 Shalaka 2014-06-18 01:58:13 EDT
Please add doc text for the known issue
Comment 3 Shalaka 2014-06-24 12:37:58 EDT
Please review and sign off the edited doc text.
Comment 4 Shruti Sampat 2014-07-25 03:36:20 EDT
Hi,

The self-heal status monitoring service remains in critical state for as long as the self-heal info command takes more than 10 seconds to return. After a while, if the command returns within 10 seconds (because there are less entries to heal), the service should ideally be in warning state. And then finally, when there are 0 entries to heal, the service should be OK.

I see that sometimes, the self-heal status monitoring service remains in critical state even when the command returns in less than 10 seconds. The Nagios server checks for heal info once in 10 minutes, so if the command was taking more than 10 minutes to execute at one point, and then it comes down quickly to 0 entries before the next check by the Nagios server, the user may never see the warning state of the service, as it would transition from critical to OK without ever reaching the warning state.
Comment 5 Sahina Bose 2014-07-25 10:28:49 EDT
(In reply to Shalaka from comment #3)
> Please review and sign off the edited doc text.

Shalaka,

The doc text needs to capture that Nagios service goes to critical state when the "volume self heal info" command takes time to execute when there are a large number of files
Comment 8 Sahina Bose 2018-01-30 06:12:42 EST
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.

Note You need to log in before you can comment on or make changes to this bug.