Description of problem: ======================= If the dataset contains hardlinks and when we do a remove-brick operation, rebalance is failing to migrate few hardlinks. In the rebalance logs we are seeing the below lookup failure errors, [2017-01-02 06:41:06.277232] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.510761] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.541836] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:06.947640] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:07.360477] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:44.231718] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or directory) [2017-01-02 06:41:49.990234] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:50.217159] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.594092] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:51.873224] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or directory) [2017-01-02 06:41:58.151533] E [MSGID: 109023] [dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or directory) Version-Release number of selected component (if applicable): 3.8.4-10.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== 1) Create a Distributed-Replicate volume and start it. 2) FUSE mount the volume and create a dataset such that there are more number of hardlinks lets say, for i in {1..20000};do touch f$i;done for i in {1..20000};do ln f$i fl$i;done 3) Start remove-brick operation to trigger rebalance. For few of the hardlinks you can see rebalance failures due to lookup failures. Actual results: =============== Hardlink migration is failing during remove-brick operation Expected results: ================= Hardlinks should be migrated without any errors/issues during remove-brick
Isn't this similar to https://bugzilla.redhat.com/show_bug.cgi?id=1399513 ?
Unrelated, but why are disperse.shd-max-threads: 1 disperse.shd-wait-qlength: 1024 visible for a non-disperse volume?
(In reply to Nithya Balachandran from comment #4) > Unrelated, but why are > > disperse.shd-max-threads: 1 > disperse.shd-wait-qlength: 1024 > > > visible for a non-disperse volume? Looks like a BUG, I will file a new BZ for this issue.
Upstream patch: https://review.gluster.org/#/c/16457
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101306/
Verified this BZ on glusterfs version 3.8.4-28.el7rhgs.x86_64. Followed the same steps as in the description, during remove-brick operation hardlinks are getting migrated without any failures/issues and I am not seeing the errors reported in this BZ in rebalance logs. Moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774