Bug 1297172

Summary: Client self-heals block the FOP that triggered the heals
Product: [Community] GlusterFS Reporter: Ravishankar N <ravishankar>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: mainlineCC: bhubbard, bugs, ppai
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8rc2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1300875 1313312 (view as bug list) Environment:
Last Closed: 2016-06-16 13:54:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1292314, 1293412, 1300875, 1313312    

Description Ravishankar N 2016-01-10 07:18:34 UTC
Description of problem:
If a lookup or a read transaction FOP triggers an inode refresh, the FOP does not return until the heal completes. For VM use cases, this could mean the VM appearing to go to an unresponsive state until the heal completes.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Create a 1x2 replica, fuse mount and create a file.
2.Disable self-heal daemon
2.Kill a brick, `dd` a few gigs into the file.
3.Bring the brick back up, do a hexdump of file from the mount.
4.Hexdump will stall spewing out data until the data heal completes (as seen from the mount log)

Actual results:
FOP blocks until heal is done.

Expected results:
FOP should not wait for heals- they could be made to happen in the background.

Comment 1 Vijay Bellur 2016-01-10 07:21:14 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 2 Vijay Bellur 2016-01-12 10:42:45 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 3 Brad Hubbard 2016-01-22 00:29:38 UTC
Raising severity based on the bugs depending on this one.

Comment 4 Vijay Bellur 2016-02-01 11:52:28 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#3) for review on master by Ravishankar N (ravishankar)

Comment 5 Vijay Bellur 2016-02-03 06:02:15 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#4) for review on master by Ravishankar N (ravishankar)

Comment 6 Vijay Bellur 2016-02-05 12:27:17 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#5) for review on master by Ravishankar N (ravishankar)

Comment 7 Vijay Bellur 2016-02-25 08:22:55 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#6) for review on master by Ravishankar N (ravishankar)

Comment 8 Vijay Bellur 2016-02-26 02:13:01 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#7) for review on master by Ravishankar N (ravishankar)

Comment 9 Vijay Bellur 2016-02-28 04:06:59 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#8) for review on master by Ravishankar N (ravishankar)

Comment 10 Vijay Bellur 2016-02-28 12:14:43 UTC
REVIEW: http://review.gluster.org/13207 (afr: Add throttled background client-side heals) posted (#9) for review on master by Ravishankar N (ravishankar)

Comment 11 Vijay Bellur 2016-03-01 11:23:32 UTC
COMMIT: http://review.gluster.org/13207 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 8210ca1a5c0e78e91c6fab7df7e002e39660b706
Author: Ravishankar N <ravishankar>
Date:   Sun Jan 10 09:19:34 2016 +0530

    afr: Add throttled background client-side heals
    
    If a heal is needed after inode refresh (lookup, read_txn), launch it in
    the background instead of blocking the fop (that triggered refresh) until the
    heal happens.
    
    afr_replies_interpret() is modified such that the heal is
    launched only if atleast one sink brick is up.
    
    Max. no of heals that can happen in parallel is configurable via the
    'background-self-heal-count' volume option. Any number greater than that
    is put in a wait queue whose length is configurable via
    'heal-wait-queue-leng' volume option. If the wait queue is also full,
    further heals will be ignored.
    
    Default values:  background-self-heal-count=8, heal-wait-queue-leng=128
    
    Change-Id: I1d4a52814cdfd43d90591b6d2ad7b6219937ce70
    BUG: 1297172
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/13207
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 12 Ravishankar N 2016-03-22 12:43:45 UTC
Moving it back to POST for a follow-up patch that adjusts op-verison.

Comment 13 Vijay Bellur 2016-03-22 13:02:41 UTC
REVIEW: http://review.gluster.org/13810 (glusterd/ afr: Fix op-version for background client-side heals) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 14 Vijay Bellur 2016-03-23 02:19:18 UTC
COMMIT: http://review.gluster.org/13810 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit b6edcbd6948f0252785672fde3db37cec6353d11
Author: Ravishankar N <root@ravi2.(none)>
Date:   Tue Mar 22 12:56:41 2016 +0000

    glusterd/ afr: Fix op-version for background client-side heals
    
    http://review.gluster.org/13207 tied cluster.heal-wait-queue-length to
    GD_OP_VERSION_3_7_9 but the patch will be merged in release-3.7 branch
    (http://review.gluster.org/#/c/13564/) only for 3.7.10.
    Hence change it on master also for uniformity.
    
    Change-Id: Id581695e58b0765f5652016cc2045f05e36b768f
    BUG: 1297172
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/13810
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 15 Niels de Vos 2016-06-16 13:54:13 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user