Bug 1415761 - [Remove-brick] Hardlink migration fails with "lookup failed (No such file or directory)" error messages in rebalance logs
Summary: [Remove-brick] Hardlink migration fails with "lookup failed (No such file or ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1409474 1419855 1420184 1420215
TreeView+ depends on / blocked
 
Reported: 2017-01-23 16:35 UTC by Nithya Balachandran
Modified: 2017-05-30 18:39 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1409474
: 1419855 1420184 1420215 (view as bug list)
Environment:
Last Closed: 2017-05-30 18:39:31 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Nithya Balachandran 2017-01-23 16:35:00 UTC
+++ This bug was initially created as a clone of Bug #1409474 +++

Description of problem:
=======================
If the dataset contains hardlinks and when we do a remove-brick operation, rebalance is failing to migrate few hardlinks. In the rebalance logs we are seeing the below lookup failure errors,

[2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory)


How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and start it.
2) FUSE mount the volume and create a dataset such that there are a lot of hardlinks
lets say,
for i in {1..20000};do touch f$i;done
for i in {1..20000};do ln f$i fl$i;done
3) Start remove-brick operation to trigger rebalance.

For few of the hardlinks you can see rebalance failures due to lookup failures.

Actual results:
===============
Hardlink migration is failing during remove-brick operation

Expected results:
=================
Hardlinks should be migrated without any errors/issues during remove-brick


================
After the rebalance failures, I can see few original files and hardlinks still present on the decommissioned bricks. So, a commit will result in loss of the files.




--- Additional comment from Prasad Desala on 2017-01-02 06:07:16 EST ---

The above output snippets of lookup errors in rebalance logs and ll from decommissioned bricks are taken from a different nodes.

Outputs from node 10.70.43.141:
===============================

[2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory)

[root@node1 ~]# ll /bricks/brick2/b2/* | grep -i rw
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4013
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4027
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4028
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4037
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4038
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4047
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5746
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5759
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5828
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5839
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5841
-rw-r--r--. 3 root root 0 Jan  2 11:17 /bricks/brick2/b2/f8016
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4013
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4027
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4028
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4037
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4038
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4047
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5746
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5759
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5828
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5839
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5841
-rw-r--r--. 3 root root 0 Jan  2 11:17 /bricks/brick2/b2/fl8016


Rebalance logs Errors:
======================
[2017-01-03 06:40:15.885029] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5769: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:16.047939] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5770: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:16.178511] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5776: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:17.786372] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl6450: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:40:18.483995] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl6466: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:41:19.202179] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl7536: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:41:19.690604] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl7551: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:06.334415] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9913: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:06.452281] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9920: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:42:06.472840] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9922: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:06.781910] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9938: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:42:06.800052] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9940: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:37.065830] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9563: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.321748] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9564: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.350976] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9566: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.372147] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9567: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:56.941938] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11382: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:57.075788] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11383: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:43:41.016772] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl12808: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:43:52.374158] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11814: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:52.860047] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11820: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:52.963148] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11821: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:53.189461] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11836: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:44:49.132674] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl13827: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:44:49.141978] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl13834: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.011654] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15846: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.450021] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15860: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.458259] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15872: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.610044] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15875: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.056754] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17948: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.240254] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17960: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.249345] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17966: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.561897] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17977: lookup failed on newdr-replicate-2 (No such file or directory)

Comment 1 Nithya Balachandran 2017-01-23 16:45:32 UTC
RCA:

The remove-brick operation will migrate files with hardlinks (unlike a regular rebalance). The following steps are performed:

1. dht_setxattr (key = GF_XATTR_FILE_MIGRATE_KEY) sets the target/hashed subvolume for a migrate file operation in local->rebalance.target_node.

2. For a hardlink, dht_migrate_file () will use the hashed subvol of the first link to be migrated as the hashed subvolume. This might not match the value in local->rebalance.target_node for the other links.

3. dht_migrate_file returns 0 if __is_file_migratable () / __check_file_has_hardlink returns -2 (indicating that the file is a hardlink).

4. rebalance_task_completion updates the cached subvol in inode_ctx with the value of local->rebalance.target_node. This is incorrect and causes the lookup failures for successive hardlink lookups as the file does not exist on that subvol.


Solution:

Do not call dht_layout_preset in rebalance_task_completion as it will be done as part of the syncop_lookup called after a successful file migration in dht_migrate_file.

Comment 2 Nithya Balachandran 2017-01-23 16:57:09 UTC
Upstream patch: 
https://review.gluster.org/#/c/16457/1

Comment 3 Worker Ant 2017-01-30 06:18:19 UTC
REVIEW: https://review.gluster.org/16457 (cluster/dht: Don't update layout in rebalance_task_completion) posted (#3) for review on master by N Balachandran (nbalacha@redhat.com)

Comment 4 Worker Ant 2017-02-06 07:24:27 UTC
COMMIT: https://review.gluster.org/16457 committed in master by Raghavendra G (rgowdapp@redhat.com) 
------
commit ddf05f3d1e39cc920251c809e9ba42fe42b2c5f2
Author: N Balachandran <nbalacha@redhat.com>
Date:   Mon Jan 23 22:19:01 2017 +0530

    cluster/dht: Don't update layout in rebalance_task_completion
    
    Updating the layout in the dht inode_ctx in
    rebalance_task_completion after the file is migrated
    is erroneous in case of files with hardlinks.
    This step can be skipped as the layout will be set
    in the syncop_lookup call post the migration in
    dht_migrate_file.
    
    Change-Id: I24ac798a919585d91a117d6a207e6a31b88486c6
    BUG: 1415761
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: https://review.gluster.org/16457
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    Reviewed-by: Susant Palai <spalai@redhat.com>

Comment 5 Shyamsundar 2017-05-30 18:39:31 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.