Bug 1463248 - [Remove-brick] Hardlink migration fails with "migrate-data failed for $file [Unknown error 109023]" errors in rebalance logs
Summary: [Remove-brick] Hardlink migration fails with "migrate-data failed for $file [...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Susant Kumar Palai
QA Contact: Prasad Desala
URL:
Whiteboard: dht-data-loss
Depends On: 1464495
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-20 12:35 UTC by Prasad Desala
Modified: 2020-06-10 12:12 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1464495 (view as bug list)
Environment:
Last Closed: 2020-02-18 06:06:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1469971 0 unspecified CLOSED cluster/dht: Fix hardlink migration failures 2021-02-22 00:41:40 UTC

Internal Links: 1469971

Description Prasad Desala 2017-06-20 12:35:57 UTC
Description of problem:
=======================
With a large dataset of files and hardlinks, remove-brick migration for few files is failing throwing below errors in rebalance logs. Also, due to the migration failure there are few files left on the decomissioned bricks, so we will end up loosing the files on the mountpoint if we commit the remove-brick.

[2017-06-20 10:55:29.207269] I [MSGID: 109045] [dht-common.c:2015:dht_lookup_everywhere_cbk] 0-distrep-dht: attempting deletion of stale linkfile /fl13739 on distrep-readdir-ahead-1 (hashed subvol is distrep-readdir-ahead-3)
[2017-06-20 10:55:29.215749] I [MSGID: 109069] [dht-common.c:1327:dht_lookup_unlink_cbk] 0-distrep-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for /fl13739
[2017-06-20 10:55:29.235774] I [dht-rebalance.c:1514:dht_migrate_file] 0-distrep-dht: /fl13739: attempting to move from distrep-readdir-ahead-0 to distrep-readdir-ahead-3
[2017-06-20 10:55:29.243632] I [dht-rebalance.c:403:gf_defrag_handle_hardlink] 0-distrep-dht: Attempting to migrate hardlink fl13739 with gfid be689428-c8e4-45f4-9871-95ddf9e31719 from distrep-readdir-ahead-0 -> distrep-readdir-ahead-3
[2017-06-20 10:55:29.255513] W [MSGID: 114031] [client-rpc-fops.c:2777:client3_3_link_cbk] 0-distrep-client-2: remote operation failed: (/fl13739 -> /fl13739) [Stale file handle]
[2017-06-20 10:55:29.262225] W [MSGID: 114031] [client-rpc-fops.c:2777:client3_3_link_cbk] 0-distrep-client-3: remote operation failed: (/fl13739 -> /fl13739) [Stale file handle]
[2017-06-20 10:55:29.266772] E [MSGID: 109084] [dht-rebalance.c:459:gf_defrag_handle_hardlink] 0-distrep-dht: link of fl13739 -> be689428-c8e4-45f4-9871-95ddf9e31719 failed on  subvol distrep-readdir-ahead-1 [Stale file handle]
[2017-06-20 10:55:29.266967] W [MSGID: 109023] [dht-rebalance.c:522:__check_file_has_hardlink] 0-distrep-dht: Migrate file failed:/fl13739: failed to migrate file with link
[2017-06-20 10:55:29.272951] E [MSGID: 116] [dht-rebalance.c:2667:gf_defrag_migrate_single_file] 0-distrep-dht: migrate-data failed for /fl13739 [Unknown error 109023]
~

Version-Release number of selected component (if applicable):
3.8.4-28.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
===================
1) Create a distributed-replicate volume and start it.
2)  Enable brick mux "cluster.brick-multiplex" and set the below options to enable parallel -readdirp
gluster volume set <VOLNAME> performance.parallel-readdir on
gluster volume set <VOLNAME> rda-cache-limit 10MB
3) cifs mount it on multiple clients.
4) Perform below tasks simultaneously from multiple clients,
     a) From client-1, touch -->  for i in {1..20000};do touch f$i;done
     b) From client-2, create hard links for the created files , for i in {1..20000};do ln f$i fl$i;done
     c) From client-3, change the permissions for the created files, for i in {1..20000};do chmod 660 f$i;done
     d) From client-4, do a continuous lookup from two terminals.
5) While the tasks in step-4 are in progress, add few bricks to the volume and start rebalance.
6) Wait till step-4 and step-5 completes.
7) Now, remove the bricks added in step-5 (with continuous lookups from multiple clients)

Remove-brick completed with many failures and there are few files left on the decommissioned bricks.

Actual results:
===============
Remove-brick is failing to migrate few files.

Expected results:
=================
All the files should be migrated without any errors/issues during remove-brick

Comment 11 Atin Mukherjee 2017-06-27 06:07:20 UTC
upstream patch : https://review.gluster.org/17619


Note You need to log in before you can comment on or make changes to this bug.