Bug 1099189
| Summary: | [DHT:REBALANCE] Migration failures during rebalance with "failed to get trusted.distribute.linkinfo key - Stale file handle" message | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | shylesh <shmohan> |
| Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Chaurasia <achauras> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | rhgs-3.0 | CC: | achauras, annair, mzywusko, senaik, shmohan, ssamanta, vagarwal |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | RHGS 3.0.4 | ||
| Hardware: | x86_64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | glusterfs-3.6.0.50-1 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-03-30 15:43:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1162306 | ||
| Bug Blocks: | 1171330 | ||
Faced similar issue while testing on version glusterfs 3.6.0.28
24 node cluster
12x2 dist repl volume
To an existing 12 node cluster, added 12 more bricks and started rebalance while IO was going on simultaneously , rebalance failures were reported on a few nodes
[root@dhcp-8-29-222 ~]# gluster v rebalance Volume0_Scale status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 2533 59.0GB 22855 1 0 completed 7900.00
10.8.30.229 2403 40.0GB 23018 2 0 completed 7899.00
10.8.30.28 0 0Bytes 22746 0 0 completed 7899.00
10.8.29.220 24 24.0GB 175 0 0 in progress 9347.00
10.8.30.126 0 0Bytes 22743 0 0 completed 7899.00
10.8.29.209 0 0Bytes 22586 0 0 completed 7899.00
10.8.30.89 17 302Bytes 22773 0 0 completed 7899.00
10.8.30.26 0 0Bytes 22743 0 0 completed 7897.00
10.8.30.87 0 0Bytes 22838 0 0 completed 7899.00
10.8.29.203 0 0Bytes 22818 0 0 completed 7897.00
10.8.29.36 2439 30.0GB 23214 0 0 completed 7899.00
10.8.29.193 0 0Bytes 22745 0 0 completed 7899.00
10.8.30.29 34 30.4KB 22633 0 0 completed 7901.00
10.8.29.212 2482 51.0GB 22579 2 0 completed 7900.00
10.8.30.27 0 0Bytes 22692 0 0 completed 7899.00
10.8.30.30 0 0Bytes 22733 0 0 completed 7898.00
10.8.29.179 2458 49.0GB 22855 0 0 completed 7902.00
10.8.29.197 0 0Bytes 22833 0 0 completed 7899.00
10.8.29.116 0 0Bytes 22770 0 0 completed 7898.00
10.8.29.210 0 0Bytes 22541 0 0 completed 7899.00
10.8.29.190 0 0Bytes 22631 0 0 completed 7898.00
10.8.30.88 39 27.7KB 22812 0 0 completed 7897.00
10.8.30.31 39 198.6KB 22221 0 0 completed 7898.00
10.8.29.196 20 491Bytes 22776 0 0 completed 7897.00
volume rebalance: Volume0_Scale: success:
Log messages :
=============
>/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_migrate_f
ile+0x43f) [0x7f64d93f2f2f]))) 0-dict: dict is NULL
[2014-09-10 11:03:01.093454] E [dht-rebalance.c:1574:gf_defrag_migrate_data] 0-Volume0_Scale-dht: /fuse_etc.18/rc.d/init.d/ipmievd: failed to get trusted.distribute.linkinfo key - Stale file handle
[2014-09-10 11:03:01.104883] I [dht-rebalance.c:878:dht_migrate_file] 0-Volume0_Scale-dht: /fuse_etc.18/rc.d/init.d/killall: attempting to move from Volume0_Scale-replicate-0 to Volume0_Scale-replicate-2
[2014-09-10 11:03:01.105958] W [dict.c:480:dict_unref] (-->/lib64/libc.so.6() [0x39a3c43bf0] (-->/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12) [0x31fd45ba62] (-->/usr/lib64/glusterfs/3.6.0.28/xlator/cluster/distribute.so(dht_migrate_file+0x2a6) [0x7f64d93f2d96]))) 0-dict: dict is NULL
What were the operations that were run on mount-point while rebalance was running? If it is unlink or rmdir, its highly likely that this and bz 1151384 are duplicates of each other. (In reply to Raghavendra G from comment #5) > What were the operations that were run on mount-point while rebalance was > running? If it is unlink or rmdir, its highly likely that this and bz > 1151384 are duplicates of each other. AFAIK no I/O was done during rebalance, data creation was done before starting rebalance Did not see the error messages in the latest build on any nodes. Marking the bug verified. Did not see the error messages in the latest build on any nodes. Marking the bug verified. |
Description of problem: Migration of some files fails with error "failed to get trusted.distribute.linkinfo key - Stale file handle" Version-Release number of selected component (if applicable): [root@rhs-client4 d0]# rpm -qa| grep glus gluster-swift-object-1.10.0-2.el6rhs.noarch glusterfs-libs-3.6.0.3-1.el6rhs.x86_64 glusterfs-cli-3.6.0.3-1.el6rhs.x86_64 gluster-swift-1.10.0-2.el6rhs.noarch gluster-nagios-common-0.1.0-26.git2b35b66.el6rhs.x86_64 gluster-swift-proxy-1.10.0-2.el6rhs.noarch gluster-swift-account-1.10.0-2.el6rhs.noarch glusterfs-3.6.0.3-1.el6rhs.x86_64 glusterfs-api-3.6.0.3-1.el6rhs.x86_64 glusterfs-server-3.6.0.3-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.3-1.el6rhs.x86_64 gluster-swift-container-1.10.0-2.el6rhs.noarch gluster-swift-plugin-1.10.0-5.el6rhs.noarch gluster-nagios-addons-0.1.0-57.git9d252a3.el6rhs.x86_64 samba-glusterfs-3.6.9-168.1.el6rhs.x86_64 vdsm-gluster-4.14.5-21.git7a3d0f0.el6rhs.noarch glusterfs-fuse-3.6.0.3-1.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.3-1.el6rhs.x86_64 How reproducible: Steps to Reproduce: 1. created a distribute volume of 2 bricks 2. Populate mount point with decent directory depths and many files 3. add a brick and rebalance Actual results: For some of the files migration fails [root@rhs-client4 mnt]# gluster v rebalance dist status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 18501 1.5GB 113426 9 0 completed 1722.00 rhs-client39.lab.eng.blr.redhat.com 15116 432.1MB 114205 7 0 completed 1722.00 volume rebalance: dist: success: [root@rhs-client4 mnt]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 83ee6103-00e8-48de-9d62-f4277becb93b Status: Started Snap Volume: no Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/d0 Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/d1 Brick3: rhs-client4.lab.eng.blr.redhat.com:/home/d2 Error message from rebalance log ------------------------------- [2014-05-19 11:03:09.206607] E [dht-common.c:1939:dht_vgetxattr_cbk] 0-dist-dht: Subvolume dist-client-1 returned -1 (Stale file handle) [2014-05-19 11:03:09.206643] E [dht-rebalance.c:1323:gf_defrag_migrate_data] 0-dist-dht: Failed to get node-uuid for /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/ [2014-05-19 11:21:44.475811] W [client-rpc-fops.c:1087:client3_3_getxattr_cbk] 0-dist-client-2: remote operation failed: Stale file handle. Path: /dir1/dir2/dir 3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/di r31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/autofs (920af359-1122-4c5a-b20 8-99351f15525c). Key: trusted.glusterfs.pathinfo [2014-05-19 11:21:44.475973] E [dht-rebalance.c:1376:gf_defrag_migrate_data] 0-dist-dht: /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/autofs: failed to get trusted.distribute.linkinfo key - Stale file handle xattrs from the bricks --------------------- BRICK0 ------- [root@rhs-client4 d0]# getfattr -d -m . -e hex ../d*/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/ # file: ../d0/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/ trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8 trusted.glusterfs.dht=0x00000001000000000000000055555554 BRICK2 ---------- # file: ../d2/dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/ trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 BRICK1 ------------ [root@rhs-client39 d1]# getfattr -d -m . -e hex dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/ # file: dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/dir9/dir10/dir11/dir12/dir13/dir14/dir15/dir16/dir17/dir18/dir19/dir20/dir21/dir22/dir23/dir24/dir25/dir26/dir27/dir28/dir29/dir30/dir31/dir32/dir33/dir34/dir35/dir36/dir37/dir38/dir39/dir40/dir41/dir42/dir43/dir44/dir45/dir46/dir47/dir48/dir49/dir50/rc.d/init.d/ trusted.gfid=0x269b8041f7914c7c8ad86ed616df04b8 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff attached the sosreports