Bug 1027559

Summary:	AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time interval
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ravishankar N <ravishankar>
Component:	glusterfs	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED ERRATA	QA Contact:	spandura
Severity:	high	Docs Contact:
Priority:	medium
Version:	2.1	CC:	grajaiya, spandura, vagarwal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 2.1.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.43.1u2rhs-1	Doc Type:	Bug Fix
Doc Text:	Cause: Timer function was not correctly updating the delta sleep time after which the task should be called again. Consequence: AFR self-heal daemon was not obeying the 'cluster.heal-timeout' interval and getting called almost every second. Fix: Fix the timer function that calculates this delta time. Result: Self-heal daemon now obeys the 'cluster.heal-timeout' interval	Story Points:	---
Clone Of:
Clones:	1028663 (view as bug list)		Environment:
Last Closed:	2014-02-25 08:01:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1028663

Description Ravishankar N 2013-11-07 05:09:50 UTC

Description of problem:
The afr_start_crawl() function is called before expiry of "cluster.heal-timeout" seconds (which is 600 seconds by default)

Version-Release number of selected component (if applicable):
RHS 2.1

How reproducible:
Always

Steps to Reproduce:
1.Create and start a 1x2 replica volume using 2 different nodes.
2.gluster v set <VOLNAME> diagnostics.client-log-level DEBUG
3.gluster v set <VOLNAME> cluster.heal-timeout 300
4.tailf /var/log/glusterfs/glustershd.log (on either of the nodes)

[2013-11-07 10:32:06.058154] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0
.
.
.
[2013-11-07 10:32:07.059428] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0

Actual results:

The time interval between 2 successive invocations of afr_start_crawl() is just one second.

Expected results:
The crawler must start only once in "cluster.heal-timeout" seconds.

Additional info:

Comment 2 Ravishankar N 2013-11-11 04:50:07 UTC

Downstream review URL: https://code.engineering.redhat.com/gerrit/#/c/15473/

Comment 3 spandura 2013-12-24 10:17:30 UTC

Verified the bug on the build "glusterfs 3.4.0.52rhs built on Dec 19 2013 12:20:16" . Bug is fixed. Moving the bug to Verified state. 

Cases Verified on 1 x 3 replicate volume:
===========================================
1. Set the heal-timeout to 120 . Observed the crawl happening every 2 minutes. 

2. Set the heal-timeout to 60s. Observed the crawl happening every 1 minute.

3. Killed a brick process, Created lot of files and directories. Observed the crawl happening every 1 minute. Brought back the brick online and self-heal was also successful.

Comment 5 errata-xmlrpc 2014-02-25 08:01:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html