Red Hat Bugzilla – Bug 1109683
[Nagios] Volume self-heal service "CHECK_NRPE: Socket timeout after 10 seconds." when there are a lot of entries to heal
Last modified: 2018-01-30 06:12:42 EST
Description of problem: ------------------------ When there are a lot of files to be healed, the command "gluster volume heal <vol-name> heal" takes some time to return, due to which check_nrpe for volume self-heal gets timed out. This causes the service to be in critical state. For e.g., there were about 190062 entries to be healed on my setup, and the time taken by heal info command to run was about 20 minutes. Version-Release number of selected component (if applicable): gluster-nagios-addons-0.1.2-1.el6rhs.x86_64 How reproducible: Saw it once. Steps to Reproduce: 1. Create a distributed-replicate volume (2x2), start it and mount it on a client. 2. On the mount point perform kernel untar as follows - # for i in {1..100}; do mkdir dir$i; tar xJf linux-3.0-rc1.tar.xz -C dir$i & done 3. Bring down one brick from each replica pair. 4. After a while bring the bricks up and stop the I/O at the mount point. 5. Observe the status of the volume self-heal service on the Nagios UI. Actual results: The volume self-heal service is critical because of nrpe socket time-out. Expected results: The service should not be critical, as self-heal running is not something that the admin should be alarmed about, unless heal fails, which is not the case here. Additional info:
This issue is with RHS and RHSC can not address this. We need to document it and see what time interval would probably suffice.
Please add doc text for the known issue
Please review and sign off the edited doc text.
Hi, The self-heal status monitoring service remains in critical state for as long as the self-heal info command takes more than 10 seconds to return. After a while, if the command returns within 10 seconds (because there are less entries to heal), the service should ideally be in warning state. And then finally, when there are 0 entries to heal, the service should be OK. I see that sometimes, the self-heal status monitoring service remains in critical state even when the command returns in less than 10 seconds. The Nagios server checks for heal info once in 10 minutes, so if the command was taking more than 10 minutes to execute at one point, and then it comes down quickly to 0 entries before the next check by the Nagios server, the user may never see the warning state of the service, as it would transition from critical to OK without ever reaching the warning state.
(In reply to Shalaka from comment #3) > Please review and sign off the edited doc text. Shalaka, The doc text needs to capture that Nagios service goes to critical state when the "volume self heal info" command takes time to execute when there are a large number of files
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.