Bug 1473141

Summary:	cluster/dht: Fix hardlink migration failures
Product:	[Community] GlusterFS	Reporter:	Susant Kumar Palai <spalai>
Component:	distribute	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.10	CC:	bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.10.5	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1469964	Environment:
Last Closed:	2017-08-21 13:42:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1469964
Bug Blocks:	1469971

Description Susant Kumar Palai 2017-07-20 06:18:05 UTC

+++ This bug was initially created as a clone of Bug #1469964 +++

Description of problem:
There are few races in remove-brick hardlink migration code path detailed below.
    
 A brief about how hardlink migration works:
     - Different hardlinks (to the same file) may hash to different bricks,
    but their cached subvol will be same. Rebalance picks up the first hardlink,
    calculates it's  hash(call it TARGET) and set the hashed subvolume as an 
    xattr on the data file.
    - Now all the hardlinks those come after this will fetch that xattr and will
    create linkto files on TARGET (all linkto files for the hardlinks will be 
    hardlink   to each other on TARGET).
    - When number of hardlinks on source is equal to the number of hardlinks on
    TARGET, the data migration will happen.
    
    RACE:1
      Since rebalance is multi-threaded, the first lookup (which decides where 
      the TARGET subvol should be), can be called by two hardlink migration 
      parallely and they may end up creating linkto files on two different 
      TARGET subvols. Hence, hardlinks won't be migrated.
    
   
    RACE:2
      The linkto files on TARGET can be created by other clients also if they
      are doing lookup on the hardlinks.  Consider a scenario where you have 100 
      hardlinks.  When rebalance is migrating 99th hardlink, as a result of 
      continuous lookups from other client, linkcount on TARGET is equal to 
      source linkcount. Rebalance will migrate data on the 99th hardlink itself. 
      On 100th hardlink migration, hardlink will have TARGET as  cached 
      subvolume. If it's hash is also the same, then a migration will be 
      triggered from TARGET to TARGET leading to data loss.
    

 This is reproducible intermittently. Since this is related to hardlink migration, this happens only with remove-brick process.

--- Additional comment from Worker Ant on 2017-07-12 12:44:13 MVT ---

REVIEW: https://review.gluster.org/17755 (cluster/rebalance: Fix hardlink migration failures) posted (#1) for review on master by Susant Palai (spalai)

--- Additional comment from Worker Ant on 2017-07-12 13:55:31 MVT ---

REVIEW: https://review.gluster.org/17755 (cluster/rebalance: Fix hardlink migration failures) posted (#2) for review on master by Susant Palai (spalai)

--- Additional comment from Worker Ant on 2017-07-13 10:38:44 MVT ---

COMMIT: https://review.gluster.org/17755 committed in master by Raghavendra G (rgowdapp) 
------
commit 0d75e39834d4880dce0cb3c79bef4b70bb32874d
Author: Susant Palai <spalai>
Date:   Wed Jul 12 12:01:40 2017 +0530

    cluster/rebalance: Fix hardlink migration failures
    
    A brief about how hardlink migration works:
      - Different hardlinks (to the same file) may hash to different bricks,
    but their cached subvol will be same. Rebalance picks up the first hardlink,
    calculates it's  hash(call it TARGET) and set the hashed subvolume as an xattr
    on the data file.
      - Now all the hardlinks those come after this will fetch that xattr and will
    create linkto files on TARGET (all linkto files for the hardlinks will be hardlink
    to each other on TARGET).
      - When number of hardlinks on source is equal to the number of hardlinks on
    TARGET, the data migration will happen.
    
    RACE:1
      Since rebalance is multi-threaded, the first lookup (which decides where the TARGET
    subvol should be), can be called by two hardlink migration parallely and they may end
    up creating linkto files on two different TARGET subvols. Hence, hardlinks won't be
    migrated.
    
    Fix: Rely on the xattr response of lookup inside gf_defrag_handle_hardlink since it
    is executed under synclock.
    
    RACE:2
      The linkto files on TARGET can be created by other clients also if they are doing
    lookup on the hardlinks.  Consider a scenario where you have 100 hardlinks.  When
    rebalance is migrating 99th hardlink, as a result of continuous lookups from other
    client, linkcount on TARGET is equal to source linkcount. Rebalance will migrate data
    on the 99th hardlink itself. On 100th hardlink migration, hardlink will have TARGET as
    cached subvolume. If it's hash is also the same, then a migration will be triggered from
    TARGET to TARGET leading to data loss.
    
    Fix: Make sure before the final data migration, source is not same as destination.
    
    RACE:3
      Since a hardlink can be migrating to a non-hashed subvolume, a lookup from other
    client or even the rebalance it self, might delete the linkto file on TARGET leading
    to hardlinks never getting migrated.
    
    This will be addressed in a different patch in future.
    
    Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
    BUG: 1469964
    Signed-off-by: Susant Palai <spalai>
    Reviewed-on: https://review.gluster.org/17755
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: N Balachandran <nbalacha>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 1 Worker Ant 2017-07-20 06:40:47 UTC

REVIEW: https://review.gluster.org/17838 (cluster/rebalance: Fix hardlink migration failures) posted (#1) for review on release-3.10 by Susant Palai (spalai)

Comment 2 Worker Ant 2017-07-20 10:10:13 UTC

REVIEW: https://review.gluster.org/17838 (cluster/rebalance: Fix hardlink migration failures) posted (#2) for review on release-3.10 by Susant Palai (spalai)

Comment 3 Worker Ant 2017-08-11 19:37:26 UTC

REVIEW: https://review.gluster.org/17838 (cluster/rebalance: Fix hardlink migration failures) posted (#3) for review on release-3.10 by Shyamsundar Ranganathan (srangana)

Comment 4 Worker Ant 2017-08-11 20:03:49 UTC

COMMIT: https://review.gluster.org/17838 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit e0cd91f14eebee77c8ed332cedfd25547daa01d7
Author: Susant Palai <spalai>
Date:   Wed Jul 12 12:01:40 2017 +0530

    cluster/rebalance: Fix hardlink migration failures
    
    A brief about how hardlink migration works:
      - Different hardlinks (to the same file) may hash to different bricks,
    but their cached subvol will be same. Rebalance picks up the first hardlink,
    calculates it's  hash(call it TARGET) and set the hashed subvolume as an xattr
    on the data file.
      - Now all the hardlinks those come after this will fetch that xattr and will
    create linkto files on TARGET (all linkto files for the hardlinks will be hardlink
    to each other on TARGET).
      - When number of hardlinks on source is equal to the number of hardlinks on
    TARGET, the data migration will happen.
    
    RACE:1
      Since rebalance is multi-threaded, the first lookup (which decides where the TARGET
    subvol should be), can be called by two hardlink migration parallely and they may end
    up creating linkto files on two different TARGET subvols. Hence, hardlinks won't be
    migrated.
    
    Fix: Rely on the xattr response of lookup inside gf_defrag_handle_hardlink since it
    is executed under synclock.
    
    RACE:2
      The linkto files on TARGET can be created by other clients also if they are doing
    lookup on the hardlinks.  Consider a scenario where you have 100 hardlinks.  When
    rebalance is migrating 99th hardlink, as a result of continuous lookups from other
    client, linkcount on TARGET is equal to source linkcount. Rebalance will migrate data
    on the 99th hardlink itself. On 100th hardlink migration, hardlink will have TARGET as
    cached subvolume. If it's hash is also the same, then a migration will be triggered from
    TARGET to TARGET leading to data loss.
    
    Fix: Make sure before the final data migration, source is not same as destination.
    
    RACE:3
      Since a hardlink can be migrating to a non-hashed subvolume, a lookup from other
    client or even the rebalance it self, might delete the linkto file on TARGET leading
    to hardlinks never getting migrated.
    
    This will be addressed in a different patch in future.
    
    > Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
    > BUG: 1469964
    > Signed-off-by: Susant Palai <spalai>
    > Reviewed-on: https://review.gluster.org/17755
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: N Balachandran <nbalacha>
    > Reviewed-by: Raghavendra G <rgowdapp>
    > Signed-off-by: Susant Palai <spalai>
    
    Change-Id: If0f6852f0e662384ee3875a2ac9d19ac4a6cea98
    BUG: 1473141
    Signed-off-by: Susant Palai <spalai>
    Reviewed-on: https://review.gluster.org/17838
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 5 Shyamsundar 2017-08-21 13:42:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.5, please open a new bug report.

glusterfs-3.10.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-August/000079.html
[2] https://www.gluster.org/pipermail/gluster-users/