Bug 1177418 - entry self-heal in 3.5 and 3.6 are not compatible
Summary: entry self-heal in 3.5 and 3.6 are not compatible
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.6.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1168189
Blocks: glusterfs-3.6.2 1177339 1188522
TreeView+ depends on / blocked
 
Reported: 2014-12-27 07:30 UTC by Pranith Kumar K
Modified: 2015-02-11 09:10 UTC (History)
2 users (show)

Fixed In Version: glusterfs-3.6.2
Doc Type: Bug Fix
Doc Text:
Clone Of: 1168189
Environment:
Last Closed: 2015-02-11 09:10:52 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2014-12-27 07:30:50 UTC
+++ This bug was initially created as a clone of Bug #1168189 +++

Description of problem:
entry self-heal in 3.6 and above, takes full lock on the directory only for the duration of figuring out the xattrs of the directories where as 3.5 takes locks through out the entry-self-heal. If the cluster is heterogeneous then there is a chance that 3.6 self-heal is triggered and then 3.5 self-heal will also triggered and both the self-heal daemons of 3.5 and 3.6 do self-heal.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a replicate volume consisting 2 bricks on machines m1, m2 on version 3.5
2. Create a directory 'd' inside the mount on m2 and cd into it
3. While a brick was down on m2 create lot of files in this directory 'd'. I created 10000 files.
4. upgrade m1 to 3.6
5. Bring the bricks up and initiate self-heal of directories.
6. 3.6 self-heal daemon will start healing
7. access 'd' on mount in m2 then that will also trigger heal sometimes.

Actual results:
Self-heal is happening on directory 'd' by both self-heal daemons in 3.5, 3.6

Expected results:


Additional info:

--- Additional comment from Pranith Kumar K on 2014-11-26 06:26:03 EST ---

With the patch:
while 3.5 heal is in progress 3.6 heal was prevented:
[root@localhost ~]# grep d0555511ee7f0000 /var/log/glusterfs/bricks/brick.log | grep ENTRYLK
[2014-11-26 11:01:51.165591] I [entrylk.c:244:entrylk_trace_in] 0-r2-locks: [REQUEST] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=19} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=(null), domain: r2-replicate-0:self-heal}
[2014-11-26 11:01:51.165633] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [GRANTED] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=19} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=(null), domain: r2-replicate-0:self-heal}
[2014-11-26 11:01:51.173176] I [entrylk.c:244:entrylk_trace_in] 0-r2-locks: [REQUEST] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=21} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[2014-11-26 11:01:51.173242] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [TRYAGAIN] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=21} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[2014-11-26 11:01:51.184939] I [entrylk.c:244:entrylk_trace_in] 0-r2-locks: [REQUEST] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=23} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=UNLOCK, type=WRITE, basename=(null), domain: r2-replicate-0:self-heal}
[2014-11-26 11:01:51.184989] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [GRANTED] Locker = {Pid=18446744073709551610, lk-owner=d0555511ee7f0000, Client=0x7f51d6d21ac0, Frame=23} Lockee = {gfid=26625058-b5f2-4561-97da-ec9e7268119e, fd=(nil), path=/d} Lock = {lock=ENTRYLK, cmd=UNLOCK, type=WRITE, basename=(null), domain: r2-replicate-0:self-heal}

--- Additional comment from Pranith Kumar K on 2014-12-02 01:37:54 EST ---

In this test case, 3.6 does healing where as 3.5 heal will not get locks:

[root@localhost ~]# egrep "(aaaaaaaaaaa|TRYAGAIN)" /var/log/glusterfs/bricks/brick.log | grep ENTRY
[2014-12-02 06:14:54.502242] I [entrylk.c:244:entrylk_trace_in] 0-r2-locks: [REQUEST] Locker = {Pid=18446744073709551610, lk-owner=a41e327d2e7f0000, Client=0x7f831e1a4ac0, Frame=1064} Lockee = {gfid=fab813d6-2ef2-4885-a293-91476cc5d167, fd=(nil), path=<gfid:fab813d6-2ef2-4885-a293-91476cc5d167>} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[2014-12-02 06:14:54.502261] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [GRANTED] Locker = {Pid=18446744073709551610, lk-owner=a41e327d2e7f0000, Client=0x7f831e1a4ac0, Frame=1064} Lockee = {gfid=fab813d6-2ef2-4885-a293-91476cc5d167, fd=(nil), path=<gfid:fab813d6-2ef2-4885-a293-91476cc5d167>} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[2014-12-02 06:15:01.491651] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [TRYAGAIN] Locker = {Pid=18446744073709551615, lk-owner=08b4e7739a7f0000, Client=0x7f831e1e7880, Frame=1621} Lockee = {gfid=fab813d6-2ef2-4885-a293-91476cc5d167, fd=(nil), path=<gfid:fab813d6-2ef2-4885-a293-91476cc5d167>} Lock = {lock=ENTRYLK, cmd=LOCK_NB, type=WRITE, basename=(null), domain: r2-replicate-0}
[2014-12-02 06:18:36.741434] I [entrylk.c:244:entrylk_trace_in] 0-r2-locks: [REQUEST] Locker = {Pid=18446744073709551610, lk-owner=a41e327d2e7f0000, Client=0x7f831e1a4ac0, Frame=91112} Lockee = {gfid=fab813d6-2ef2-4885-a293-91476cc5d167, fd=(nil), path=<gfid:fab813d6-2ef2-4885-a293-91476cc5d167>} Lock = {lock=ENTRYLK, cmd=UNLOCK, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[2014-12-02 06:18:36.741604] I [entrylk.c:271:entrylk_trace_out] 0-r2-locks: [GRANTED] Locker = {Pid=18446744073709551610, lk-owner=a41e327d2e7f0000, Client=0x7f831e1a4ac0, Frame=91112} Lockee = {gfid=fab813d6-2ef2-4885-a293-91476cc5d167, fd=(nil), path=<gfid:fab813d6-2ef2-4885-a293-91476cc5d167>} Lock = {lock=ENTRYLK, cmd=UNLOCK, type=WRITE, basename=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}

--- Additional comment from Anand Avati on 2014-12-02 01:44:27 EST ---

REVIEW: http://review.gluster.org/9125 (features/locks: Add lk-owner checks in entrylk) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-02 01:44:32 EST ---

REVIEW: http://review.gluster.org/9227 (cluster/afr: Make entry-self-heal in afr-v2 compatible with afr-v1) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-26 04:45:53 EST ---

REVIEW: http://review.gluster.org/9227 (cluster/afr: Make entry-self-heal in afr-v2 compatible with afr-v1) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-26 04:46:01 EST ---

REVIEW: http://review.gluster.org/9125 (features/locks: Add lk-owner checks in entrylk) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-26 04:46:04 EST ---

REVIEW: http://review.gluster.org/9351 (mgmt/glusterd: Add option to enable lock trace) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-27 02:20:41 EST ---

COMMIT: http://review.gluster.org/9125 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 02b2172d9bc1557b3459388969077c75b659da82
Author: Pranith Kumar K <pkarampu>
Date:   Fri Nov 14 14:23:31 2014 +0530

    features/locks: Add lk-owner checks in entrylk
    
    For backward compatibility of entry-self-heal we need
    entrylks to be accepted by same lk-owner and same client.
    This patch introduces these changes.
    
    Change-Id: I67004cc5e657ba5ac09ceefbea823afdf06929e0
    BUG: 1168189
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/9125
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Tested-by: Gluster Build System <jenkins.com>

--- Additional comment from Anand Avati on 2014-12-27 02:21:03 EST ---

COMMIT: http://review.gluster.org/9227 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 2947752836bd3ddbc572b59cecd24557050ec2a5
Author: Pranith Kumar K <pkarampu>
Date:   Mon Nov 17 14:27:47 2014 +0530

    cluster/afr: Make entry-self-heal in afr-v2 compatible with afr-v1
    
    Problem:
    entry self-heal in 3.6 and above, takes full lock on the directory only for the
    duration of figuring out the xattrs of the directories where as 3.5 takes locks
    through out the entry-self-heal. If the cluster is heterogeneous then there is
    a chance that 3.6 self-heal is triggered and then 3.5 self-heal will also
    triggered and both the self-heal daemons of 3.5 and 3.6 do self-heal.
    
    Fix:
    In 3.6.x and above get an entry lock on a very long name before entry self-heal
    begins so that 3.5 entry self-heal will not get locks until 3.6.x entry
    self-heal completes.
    
    Change-Id: I71b6958dfe33056ed0a5a237e64e8506c3b0fccc
    BUG: 1168189
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/9227
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Tested-by: Gluster Build System <jenkins.com>

Comment 1 Anand Avati 2014-12-27 07:34:24 UTC
REVIEW: http://review.gluster.org/9354 (features/locks: Add lk-owner checks in entrylk) posted (#1) for review on release-3.6 by Pranith Kumar Karampuri (pkarampu)

Comment 2 Anand Avati 2014-12-27 07:34:28 UTC
REVIEW: http://review.gluster.org/9355 (cluster/afr: Make entry-self-heal in afr-v2 compatible with afr-v1) posted (#1) for review on release-3.6 by Pranith Kumar Karampuri (pkarampu)

Comment 3 Anand Avati 2015-01-05 06:59:15 UTC
COMMIT: http://review.gluster.org/9354 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit f36ea2a4ad60b523aeb0303d1882744280e7056d
Author: Pranith Kumar K <pkarampu>
Date:   Fri Nov 14 14:23:31 2014 +0530

    features/locks: Add lk-owner checks in entrylk
    
            Backport of http://review.gluster.org/9125
    
    For backward compatibility of entry-self-heal we need
    entrylks to be accepted by same lk-owner and same client.
    This patch introduces these changes.
    
    BUG: 1177418
    Change-Id: I83a0c1a9b13dce4b57e5bfce6339193a79b15648
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/9354
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 4 Anand Avati 2015-01-05 07:00:09 UTC
COMMIT: http://review.gluster.org/9355 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit 50998ae08c5a767468ee85cb5c53bb5554ff734a
Author: Pranith Kumar K <pkarampu>
Date:   Mon Nov 17 14:27:47 2014 +0530

    cluster/afr: Make entry-self-heal in afr-v2 compatible with afr-v1
    
            Backport of http://review.gluster.org/9227
    
    Problem:
    entry self-heal in 3.6 and above, takes full lock on the directory only for the
    duration of figuring out the xattrs of the directories where as 3.5 takes locks
    through out the entry-self-heal. If the cluster is heterogeneous then there is
    a chance that 3.6 self-heal is triggered and then 3.5 self-heal will also
    triggered and both the self-heal daemons of 3.5 and 3.6 do self-heal.
    
    Fix:
    In 3.6.x and above get an entry lock on a very long name before entry self-heal
    begins so that 3.5 entry self-heal will not get locks until 3.6.x entry
    self-heal completes.
    
    BUG: 1177418
    Change-Id: Iecf49d794c6b480e38563e39599a40067b3a21cb
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/9355
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 5 Raghavendra Bhat 2015-02-11 09:10:52 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.2, please reopen this bug report.

glusterfs-3.6.2 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.6.2.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137


Note You need to log in before you can comment on or make changes to this bug.