Bug 1027559 - AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time interval
Summary: AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time int...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: RHGS 2.1.2
Assignee: Ravishankar N
QA Contact: spandura
URL:
Whiteboard:
Depends On:
Blocks: 1028663
TreeView+ depends on / blocked
 
Reported: 2013-11-07 05:09 UTC by Ravishankar N
Modified: 2015-05-13 16:31 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.4.0.43.1u2rhs-1
Doc Type: Bug Fix
Doc Text:
Cause: Timer function was not correctly updating the delta sleep time after which the task should be called again. Consequence: AFR self-heal daemon was not obeying the 'cluster.heal-timeout' interval and getting called almost every second. Fix: Fix the timer function that calculates this delta time. Result: Self-heal daemon now obeys the 'cluster.heal-timeout' interval
Clone Of:
: 1028663 (view as bug list)
Environment:
Last Closed: 2014-02-25 08:01:24 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC

Description Ravishankar N 2013-11-07 05:09:50 UTC
Description of problem:
The afr_start_crawl() function is called before expiry of "cluster.heal-timeout" seconds (which is 600 seconds by default)

Version-Release number of selected component (if applicable):
RHS 2.1

How reproducible:
Always

Steps to Reproduce:
1.Create and start a 1x2 replica volume using 2 different nodes.
2.gluster v set <VOLNAME> diagnostics.client-log-level DEBUG
3.gluster v set <VOLNAME> cluster.heal-timeout 300
4.tailf /var/log/glusterfs/glustershd.log (on either of the nodes)

[2013-11-07 10:32:06.058154] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0
.
.
.
[2013-11-07 10:32:07.059428] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0

Actual results:

The time interval between 2 successive invocations of afr_start_crawl() is just one second.

Expected results:
The crawler must start only once in "cluster.heal-timeout" seconds.

Additional info:

Comment 2 Ravishankar N 2013-11-11 04:50:07 UTC
Downstream review URL: https://code.engineering.redhat.com/gerrit/#/c/15473/

Comment 3 spandura 2013-12-24 10:17:30 UTC
Verified the bug on the build "glusterfs 3.4.0.52rhs built on Dec 19 2013 12:20:16" . Bug is fixed. Moving the bug to Verified state. 

Cases Verified on 1 x 3 replicate volume:
===========================================
1. Set the heal-timeout to 120 . Observed the crawl happening every 2 minutes. 

2. Set the heal-timeout to 60s. Observed the crawl happening every 1 minute.

3. Killed a brick process, Created lot of files and directories. Observed the crawl happening every 1 minute. Brought back the brick online and self-heal was also successful.

Comment 5 errata-xmlrpc 2014-02-25 08:01:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html


Note You need to log in before you can comment on or make changes to this bug.