1027559 – AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time interval

Bug 1027559 - AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time interval

Summary: AFR: self-heal daemon crawler not obeying the 'cluster.heal-timeout' time int...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Ravishankar N
QA Contact:	spandura
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1028663
TreeView+	depends on / blocked

Reported:	2013-11-07 05:09 UTC by Ravishankar N
Modified:	2015-05-13 16:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.4.0.43.1u2rhs-1
Doc Type:	Bug Fix
Doc Text:	Cause: Timer function was not correctly updating the delta sleep time after which the task should be called again. Consequence: AFR self-heal daemon was not obeying the 'cluster.heal-timeout' interval and getting called almost every second. Fix: Fix the timer function that calculates this delta time. Result: Self-heal daemon now obeys the 'cluster.heal-timeout' interval
Clone Of:
Clones:	1028663 (view as bug list)
Environment:
Last Closed:	2014-02-25 08:01:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Ravishankar N 2013-11-07 05:09:50 UTC

Description of problem:
The afr_start_crawl() function is called before expiry of "cluster.heal-timeout" seconds (which is 600 seconds by default)

Version-Release number of selected component (if applicable):
RHS 2.1

How reproducible:
Always

Steps to Reproduce:
1.Create and start a 1x2 replica volume using 2 different nodes.
2.gluster v set <VOLNAME> diagnostics.client-log-level DEBUG
3.gluster v set <VOLNAME> cluster.heal-timeout 300
4.tailf /var/log/glusterfs/glustershd.log (on either of the nodes)

[2013-11-07 10:32:06.058154] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0
.
.
.
[2013-11-07 10:32:07.059428] D [afr-self-heald.c:1233:afr_start_crawl] 0-testvol-replicate-0: starting crawl 1 for testvol-client-0

Actual results:

The time interval between 2 successive invocations of afr_start_crawl() is just one second.

Expected results:
The crawler must start only once in "cluster.heal-timeout" seconds.

Additional info:

Comment 2 Ravishankar N 2013-11-11 04:50:07 UTC

Downstream review URL: https://code.engineering.redhat.com/gerrit/#/c/15473/

Comment 3 spandura 2013-12-24 10:17:30 UTC

Verified the bug on the build "glusterfs 3.4.0.52rhs built on Dec 19 2013 12:20:16" . Bug is fixed. Moving the bug to Verified state. 

Cases Verified on 1 x 3 replicate volume:
===========================================
1. Set the heal-timeout to 120 . Observed the crawl happening every 2 minutes. 

2. Set the heal-timeout to 60s. Observed the crawl happening every 1 minute.

3. Killed a brick process, Created lot of files and directories. Observed the crawl happening every 1 minute. Brought back the brick online and self-heal was also successful.

Comment 5 errata-xmlrpc 2014-02-25 08:01:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.