+++ This bug was initially created as a clone of Bug #1415761 +++ +++ This bug was initially created as a clone of Bug #1409474 +++ Description of problem: ======================= If the dataset contains hardlinks and when we do a remove-brick operation, rebalance is failing to migrate few hardlinks. In the rebalance logs we are seeing the below lookup failure errors, [2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory) How reproducible: ================= Always Steps to Reproduce: =================== 1) Create a Distributed-Replicate volume and start it. 2) FUSE mount the volume and create a dataset such that there are a lot of hardlinks lets say, for i in {1..20000};do touch f$i;done for i in {1..20000};do ln f$i fl$i;done 3) Start remove-brick operation to trigger rebalance. For few of the hardlinks you can see rebalance failures due to lookup failures. Actual results: =============== Hardlink migration is failing during remove-brick operation Expected results: ================= Hardlinks should be migrated without any errors/issues during remove-brick ================ After the rebalance failures, I can see few original files and hardlinks still present on the decommissioned bricks. So, a commit will result in loss of the files. --- Additional comment from Prasad Desala on 2017-01-02 06:07:16 EST --- The above output snippets of lookup errors in rebalance logs and ll from decommissioned bricks are taken from a different nodes. Outputs from node server1 =============================== [2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory) [root@node1 ~]# ll /bricks/brick2/b2/* | grep -i rw -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4013 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4027 -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4028 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4037 -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4038 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/f4047 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/f5746 -rw-r--r--. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/f5759 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/f5828 -rw-r--r--. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/f5839 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/f5841 -rw-r--r--. 3 root root 0 Jan 2 11:17 /bricks/brick2/b2/f8016 -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4013 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4027 -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4028 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4037 -rw-r--r--. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4038 -rw-rw----. 3 root root 0 Jan 2 11:13 /bricks/brick2/b2/fl4047 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/fl5746 -rw-r--r--. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/fl5759 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/fl5828 -rw-r--r--. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/fl5839 -rw-rw----. 3 root root 0 Jan 2 11:15 /bricks/brick2/b2/fl5841 -rw-r--r--. 3 root root 0 Jan 2 11:17 /bricks/brick2/b2/fl8016 Rebalance logs Errors: ====================== [2017-01-03 06:40:15.885029] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5769: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:40:16.047939] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5770: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:40:16.178511] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl5776: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:40:17.786372] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl6450: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:40:18.483995] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl6466: lookup failed on newdr-replicate-3 (No such file or directory) [2017-01-03 06:41:19.202179] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl7536: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:41:19.690604] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl7551: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:42:06.334415] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9913: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:42:06.452281] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9920: lookup failed on newdr-replicate-3 (No such file or directory) [2017-01-03 06:42:06.472840] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9922: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:42:06.781910] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9938: lookup failed on newdr-replicate-3 (No such file or directory) [2017-01-03 06:42:06.800052] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9940: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:42:37.065830] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9563: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:42:37.321748] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9564: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:42:37.350976] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9566: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:42:37.372147] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl9567: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:42:56.941938] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11382: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:42:57.075788] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11383: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:43:41.016772] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl12808: lookup failed on newdr-replicate-1 (No such file or directory) [2017-01-03 06:43:52.374158] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11814: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:43:52.860047] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11820: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:43:52.963148] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11821: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:43:53.189461] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl11836: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:44:49.132674] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl13827: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:44:49.141978] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl13834: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:45:39.011654] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15846: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:45:39.450021] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15860: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:45:39.458259] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15872: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:45:39.610044] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl15875: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:46:33.056754] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17948: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:46:33.240254] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17960: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:46:33.249345] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17966: lookup failed on newdr-replicate-2 (No such file or directory) [2017-01-03 06:46:33.561897] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file failed:/fl17977: lookup failed on newdr-replicate-2 (No such file or directory) --- Additional comment from Nithya Balachandran on 2017-01-23 11:45:32 EST --- RCA: The remove-brick operation will migrate files with hardlinks (unlike a regular rebalance). The following steps are performed: 1. dht_setxattr (key = GF_XATTR_FILE_MIGRATE_KEY) sets the target/hashed subvolume for a migrate file operation in local->rebalance.target_node. 2. For a hardlink, dht_migrate_file () will use the hashed subvol of the first link to be migrated as the hashed subvolume. This might not match the value in local->rebalance.target_node for the other links. 3. dht_migrate_file returns 0 if __is_file_migratable () / __check_file_has_hardlink returns -2 (indicating that the file is a hardlink). 4. rebalance_task_completion updates the cached subvol in inode_ctx with the value of local->rebalance.target_node. This is incorrect and causes the lookup failures for successive hardlink lookups as the file does not exist on that subvol. Solution: Do not call dht_layout_preset in rebalance_task_completion as it will be done as part of the syncop_lookup called after a successful file migration in dht_migrate_file. --- Additional comment from Nithya Balachandran on 2017-01-23 11:57:09 EST --- Upstream patch: https://review.gluster.org/#/c/16457/1 --- Additional comment from Worker Ant on 2017-01-30 01:18:19 EST --- REVIEW: https://review.gluster.org/16457 (cluster/dht: Don't update layout in rebalance_task_completion) posted (#3) for review on master by N Balachandran (nbalacha) --- Additional comment from Worker Ant on 2017-02-06 02:24:27 EST --- COMMIT: https://review.gluster.org/16457 committed in master by Raghavendra G (rgowdapp) ------ commit ddf05f3d1e39cc920251c809e9ba42fe42b2c5f2 Author: N Balachandran <nbalacha> Date: Mon Jan 23 22:19:01 2017 +0530 cluster/dht: Don't update layout in rebalance_task_completion Updating the layout in the dht inode_ctx in rebalance_task_completion after the file is migrated is erroneous in case of files with hardlinks. This step can be skipped as the layout will be set in the syncop_lookup call post the migration in dht_migrate_file. Change-Id: I24ac798a919585d91a117d6a207e6a31b88486c6 BUG: 1415761 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: https://review.gluster.org/16457 NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp> Reviewed-by: Susant Palai <spalai>
REVIEW: https://review.gluster.org/16554 (cluster/dht: Don't update layout in rebalance_task_completion) posted (#1) for review on release-3.10 by N Balachandran (nbalacha)
COMMIT: https://review.gluster.org/16554 committed in release-3.10 by Shyamsundar Ranganathan (srangana) ------ commit b1d35c6fbaf5e0e958c69ec9c99a5d87649e52bb Author: N Balachandran <nbalacha> Date: Mon Jan 23 22:19:01 2017 +0530 cluster/dht: Don't update layout in rebalance_task_completion Updating the layout in the dht inode_ctx in rebalance_task_completion after the file is migrated is erroneous in case of files with hardlinks. This step can be skipped as the layout will be set in the syncop_lookup call post the migration in dht_migrate_file. > Change-Id: I24ac798a919585d91a117d6a207e6a31b88486c6 > BUG: 1415761 > Signed-off-by: N Balachandran <nbalacha> > Reviewed-on: https://review.gluster.org/16457 > NetBSD-regression: NetBSD Build System <jenkins.org> > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Raghavendra G <rgowdapp> > Reviewed-by: Susant Palai <spalai> (cherry picked from commit ddf05f3d1e39cc920251c809e9ba42fe42b2c5f2) Signed-off-by: N Balachandran <nbalacha> Change-Id: I46176f8605cfb782aa17c2bc9ffd7a5f0d07dd88 BUG: 1419855 Reviewed-on: https://review.gluster.org/16554 Tested-by: N Balachandran <nbalacha> Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report. glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html [2] https://www.gluster.org/pipermail/gluster-users/