Bug 1027559

Summary: AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time interval
Product: Red Hat Gluster Storage Reporter: Ravishankar N <ravishankar>
Component: glusterfsAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: spandura
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: grajaiya, spandura, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.43.1u2rhs-1 Doc Type: Bug Fix
Doc Text:
Cause: Timer function was not correctly updating the delta sleep time after which the task should be called again. Consequence: AFR self-heal daemon was not obeying the 'cluster.heal-timeout' interval and getting called almost every second. Fix: Fix the timer function that calculates this delta time. Result: Self-heal daemon now obeys the 'cluster.heal-timeout' interval
Story Points: ---
Clone Of:
: 1028663 (view as bug list) Environment:
Last Closed: 2014-02-25 08:01:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1028663    

Description Ravishankar N 2013-11-07 05:09:50 UTC
Description of problem:
The afr_start_crawl() function is called before expiry of "cluster.heal-timeout" seconds (which is 600 seconds by default)

Version-Release number of selected component (if applicable):
RHS 2.1

How reproducible:
Always

Steps to Reproduce:
1.Create and start a 1x2 replica volume using 2 different nodes.
2.gluster v set <VOLNAME> diagnostics.client-log-level DEBUG
3.gluster v set <VOLNAME> cluster.heal-timeout 300
4.tailf /var/log/glusterfs/glustershd.log (on either of the nodes)

[2013-11-07 10:32:06.058154] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0
.
.
.
[2013-11-07 10:32:07.059428] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0

Actual results:

The time interval between 2 successive invocations of afr_start_crawl() is just one second.

Expected results:
The crawler must start only once in "cluster.heal-timeout" seconds.

Additional info:

Comment 2 Ravishankar N 2013-11-11 04:50:07 UTC
Downstream review URL: https://code.engineering.redhat.com/gerrit/#/c/15473/

Comment 3 spandura 2013-12-24 10:17:30 UTC
Verified the bug on the build "glusterfs 3.4.0.52rhs built on Dec 19 2013 12:20:16" . Bug is fixed. Moving the bug to Verified state. 

Cases Verified on 1 x 3 replicate volume:
===========================================
1. Set the heal-timeout to 120 . Observed the crawl happening every 2 minutes. 

2. Set the heal-timeout to 60s. Observed the crawl happening every 1 minute.

3. Killed a brick process, Created lot of files and directories. Observed the crawl happening every 1 minute. Brought back the brick online and self-heal was also successful.

Comment 5 errata-xmlrpc 2014-02-25 08:01:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html