Bug 1099189 - [DHT:REBALANCE] Migration failures during rebalance with "failed to get trusted.distribute.linkinfo key - Stale file handle" message
Summary: [DHT:REBALANCE] Migration failures during rebalance with "failed to get trust...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.0
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.0.4
Assignee: Nithya Balachandran
QA Contact: Amit Chaurasia
URL:
Whiteboard:
Depends On: 1162306
Blocks: 1171330
TreeView+ depends on / blocked
 
Reported: 2014-05-19 17:28 UTC by shylesh
Modified: 2015-10-28 00:10 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.6.0.50-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-30 15:43:14 UTC
Embargoed:


Attachments (Terms of Use)

Description shylesh 2014-05-19 17:28:21 UTC
Description of problem:
Migration of some files fails with error "failed to get trusted.distribute.linkinfo key - Stale file handle"

Version-Release number of selected component (if applicable):
[root@rhs-client4 d0]# rpm -qa| grep glus
gluster-swift-object-1.10.0-2.el6rhs.noarch
glusterfs-libs-3.6.0.3-1.el6rhs.x86_64
glusterfs-cli-3.6.0.3-1.el6rhs.x86_64
gluster-swift-1.10.0-2.el6rhs.noarch
gluster-nagios-common-0.1.0-26.git2b35b66.el6rhs.x86_64
gluster-swift-proxy-1.10.0-2.el6rhs.noarch
gluster-swift-account-1.10.0-2.el6rhs.noarch
glusterfs-3.6.0.3-1.el6rhs.x86_64
glusterfs-api-3.6.0.3-1.el6rhs.x86_64
glusterfs-server-3.6.0.3-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.3-1.el6rhs.x86_64
gluster-swift-container-1.10.0-2.el6rhs.noarch
gluster-swift-plugin-1.10.0-5.el6rhs.noarch
gluster-nagios-addons-0.1.0-57.git9d252a3.el6rhs.x86_64
samba-glusterfs-3.6.9-168.1.el6rhs.x86_64
vdsm-gluster-4.14.5-21.git7a3d0f0.el6rhs.noarch
glusterfs-fuse-3.6.0.3-1.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.3-1.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a distribute volume of 2 bricks
2. Populate mount point with decent directory depths and many files
3. add a brick and rebalance

Actual results:
For some of the files migration fails 

[root@rhs-client4 mnt]# gluster v rebalance dist status                                                                                                                            Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost            18501         1.5GB        113426             9             0            completed            1722.00
     rhs-client39.lab.eng.blr.redhat.com            15116       432.1MB        114205             7             0            completed            1722.00
volume rebalance: dist: success:



[root@rhs-client4 mnt]# gluster v info dist

Volume Name: dist
Type: Distribute
Volume ID: 83ee6103-00e8-48de-9d62-f4277becb93b
Status: Started
Snap Volume: no
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/d0
Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/d1
Brick3: rhs-client4.lab.eng.blr.redhat.com:/home/d2



Error message from rebalance log
-------------------------------
[2014-05-19 11:03:09.206607] E [dht-common.c:1939:dht_vgetxattr_cbk] 0-dist-dht: Subvolume dist-client-1 returned -1 (Stale file handle)
[2014-05-19 11:03:09.206643] E [dht-rebalance.c:1323:gf_defrag_migrate_data] 0-dist-dht: Failed to get node-uuid for /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/


[2014-05-19 11:21:44.475811] W [client-rpc-fops.c:1087:client3_3_getxattr_cbk] 0-dist-client-2: remote operation failed: Stale file handle. Path: /dir1/dir2/dir
3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/di
r31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/autofs (920af359-1122-4c5a-b20
8-99351f15525c). Key: trusted.glusterfs.pathinfo

[2014-05-19 11:21:44.475973] E [dht-rebalance.c:1376:gf_defrag_migrate_data] 0-dist-dht: /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/autofs: failed to get trusted.distribute.linkinfo key - Stale file handle




xattrs from the bricks
---------------------

BRICK0
-------
[root@rhs-client4 d0]# getfattr -d -m . -e hex ../d*/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/
# file: ../d0/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/
trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8
trusted.glusterfs.dht=0x00000001000000000000000055555554

BRICK2
----------
# file: ../d2/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/
trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9

BRICK1
------------
[root@rhs-client39 d1]# getfattr -d -m . -e hex dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/
# file: dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/
trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff




attached the sosreports

Comment 4 senaik 2014-09-10 11:39:49 UTC
Faced similar issue while testing on version glusterfs 3.6.0.28

24 node cluster
12x2 dist repl volume 

To an existing 12 node cluster, added 12 more bricks and started rebalance while IO was going on simultaneously , rebalance failures were reported on a few nodes

[root@dhcp-8-29-222 ~]# gluster v rebalance Volume0_Scale status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             2533        59.0GB         22855             1             0            completed            7900.00
                             10.8.30.229             2403        40.0GB         23018             2             0            completed            7899.00
                              10.8.30.28                0        0Bytes         22746             0             0            completed            7899.00
                             10.8.29.220               24        24.0GB           175             0             0          in progress            9347.00
                             10.8.30.126                0        0Bytes         22743             0             0            completed            7899.00
                             10.8.29.209                0        0Bytes         22586             0             0            completed            7899.00
                              10.8.30.89               17      302Bytes         22773             0             0            completed            7899.00
                              10.8.30.26                0        0Bytes         22743             0             0            completed            7897.00
                              10.8.30.87                0        0Bytes         22838             0             0            completed            7899.00
                             10.8.29.203                0        0Bytes         22818             0             0            completed            7897.00
                              10.8.29.36             2439        30.0GB         23214             0             0            completed            7899.00
                             10.8.29.193                0        0Bytes         22745             0             0            completed            7899.00
                              10.8.30.29               34        30.4KB         22633             0             0            completed            7901.00
                             10.8.29.212             2482        51.0GB         22579             2             0            completed            7900.00
                              10.8.30.27                0        0Bytes         22692             0             0            completed            7899.00
                              10.8.30.30                0        0Bytes         22733             0             0            completed            7898.00
                             10.8.29.179             2458        49.0GB         22855             0             0            completed            7902.00
                             10.8.29.197                0        0Bytes         22833             0             0            completed            7899.00
                             10.8.29.116                0        0Bytes         22770             0             0            completed            7898.00
                             10.8.29.210                0        0Bytes         22541             0             0            completed            7899.00
                             10.8.29.190                0        0Bytes         22631             0             0            completed            7898.00
                              10.8.30.88               39        27.7KB         22812             0             0            completed            7897.00
                              10.8.30.31               39       198.6KB         22221             0             0            completed            7898.00
                             10.8.29.196               20      491Bytes         22776             0             0            completed            7897.00
volume rebalance: Volume0_Scale: success: 



Log messages : 
=============
>/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_migrate_f
ile+0x43f) [0x7f64d93f2f2f]))) 0-dict: dict is NULL
[2014-09-10 11:03:01.093454] E [dht-rebalance.c:1574:gf_defrag_migrate_data] 0-Volume0_Scale-dht: /fuse_etc.18/rc.d/init.d/ipmievd: failed to get trusted.distribute.linkinfo key - Stale file handle
[2014-09-10 11:03:01.104883] I [dht-rebalance.c:878:dht_migrate_file] 0-Volume0_Scale-dht: /fuse_etc.18/rc.d/init.d/killall: attempting to move from Volume0_Scale-replicate-0 to Volume0_Scale-replicate-2
[2014-09-10 11:03:01.105958] W [dict.c:480:dict_unref] (-->/lib64/libc.so.6() [0x39a3c43bf0] (-->/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12) [0x31fd45ba62] (-->/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_migrate_file+0x2a6) [0x7f64d93f2d96]))) 0-dict: dict is NULL

Comment 5 Raghavendra G 2014-10-17 10:30:14 UTC
What were the operations that were run on mount-point while rebalance was running? If it is unlink or rmdir, its highly likely that this and bz 1151384 are duplicates of each other.

Comment 6 shylesh 2014-10-27 11:09:50 UTC
(In reply to Raghavendra G from comment #5)
> What were the operations that were run on mount-point while rebalance was
> running? If it is unlink or rmdir, its highly likely that this and bz
> 1151384 are duplicates of each other.

AFAIK no I/O was done during rebalance, data creation was done before starting rebalance

Comment 9 Amit Chaurasia 2015-03-19 10:15:20 UTC
Did not see the error messages in the latest build on any nodes.
Marking the bug verified.

Comment 10 Amit Chaurasia 2015-03-19 10:17:44 UTC
Did not see the error messages in the latest build on any nodes.
Marking the bug verified.


Note You need to log in before you can comment on or make changes to this bug.