A bug in the remove-brick code can cause file migration on some files with multiple hardlinks to fail. Files may be left behind on the removed brick. These will not be available on the gluster volume once the remove-brick operation is committed.
Workaround:
Once the remove-brick operation is complete, check for any files left behind on the removed bricks and copy them to the volume via a mount point.
Description of problem:
=======================
If the dataset contains hardlinks and when we do a remove-brick operation, rebalance is failing to migrate few hardlinks. In the rebalance logs we are seeing the below lookup failure errors,
[2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory)
[2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory)
[2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory)
Version-Release number of selected component (if applicable):
3.8.4-10.el7rhgs.x86_64
How reproducible:
=================
Always
Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and start it.
2) FUSE mount the volume and create a dataset such that there are more number of hardlinks
lets say,
for i in {1..20000};do touch f$i;done
for i in {1..20000};do ln f$i fl$i;done
3) Start remove-brick operation to trigger rebalance.
For few of the hardlinks you can see rebalance failures due to lookup failures.
Actual results:
===============
Hardlink migration is failing during remove-brick operation
Expected results:
=================
Hardlinks should be migrated without any errors/issues during remove-brick
(In reply to Nithya Balachandran from comment #4)
> Unrelated, but why are
>
> disperse.shd-max-threads: 1
> disperse.shd-wait-qlength: 1024
>
>
> visible for a non-disperse volume?
Looks like a BUG, I will file a new BZ for this issue.
Comment 14Nithya Balachandran
2017-01-24 07:06:37 UTC
Verified this BZ on glusterfs version 3.8.4-28.el7rhgs.x86_64.
Followed the same steps as in the description, during remove-brick operation hardlinks are getting migrated without any failures/issues and I am not seeing the errors reported in this BZ in rebalance logs.
Moving this BZ to Verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2017:2774
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2017:2774
Description of problem: ======================= If the dataset contains hardlinks and when we do a remove-brick operation, rebalance is failing to migrate few hardlinks. In the rebalance logs we are seeing the below lookup failure errors, [2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory) Version-Release number of selected component (if applicable): 3.8.4-10.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== 1) Create a Distributed-Replicate volume and start it. 2) FUSE mount the volume and create a dataset such that there are more number of hardlinks lets say, for i in {1..20000};do touch f$i;done for i in {1..20000};do ln f$i fl$i;done 3) Start remove-brick operation to trigger rebalance. For few of the hardlinks you can see rebalance failures due to lookup failures. Actual results: =============== Hardlink migration is failing during remove-brick operation Expected results: ================= Hardlinks should be migrated without any errors/issues during remove-brick