Bug 1237059
| Summary: | DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Triveni Rao <trao> | |
| Component: | distribute | Assignee: | Susant Kumar Palai <spalai> | |
| Status: | CLOSED ERRATA | QA Contact: | RajeshReddy <rmekala> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | rhgs-3.1 | CC: | annair, asriram, asrivast, bmohanra, byarlaga, mzywusko, nbalacha, rgowdapp, rhs-bugs, rmekala, sankarshan, sashinde, shmohan, smohan, spalai | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | RHGS 3.1.2 | Flags: | bmohanra:
needinfo+
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.5-5 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the rebalance status showed failed when the replica pair bricks were brought down in a distributed replicated volume. With this fix, rebalance will skip the error affected directory and continue with other directories. The rebalance process on a distributed-replicated volume will not be aborted even if a brick from a replica pair goes down.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1257076 1319592 (view as bug list) | Environment: | ||
| Last Closed: | 2016-03-01 05:27:38 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1216951, 1243815, 1257076, 1260783, 1318196 | |||
I just tried with single brick down scenario i could re-produce the problem on the volume.
Volume Name: kit
Type: Distributed-Replicate
Volume ID: ae805fc4-45c2-4d80-94e8-ce50336bc3c4
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.57:/rhs/brick1/s0
Brick2: 10.70.35.136:/rhs/brick1/s0
Brick3: 10.70.35.57:/rhs/brick2/s0
Brick4: 10.70.35.136:/rhs/brick2/s0
Options Reconfigured:
performance.readdir-ahead: on
[root@casino-vm1 ~]#
[root@casino-vm1 ~]#
[root@casino-vm1 ~]# gluster v status kit
Status of volume: kit
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/s0 49218 0 Y 26079
Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477
Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097
Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495
NFS Server on localhost 2049 0 Y 26116
Self-heal Daemon on localhost N/A N/A Y 26124
NFS Server on 10.70.35.136 2049 0 Y 15514
Self-heal Daemon on 10.70.35.136 N/A N/A Y 15522
Task Status of Volume kit
------------------------------------------------------------------------------
There are no active volume tasks
[root@casino-vm1 ~]# gluster v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0
volume add-brick: success
(reverse-i-search)`gluster ': ^Custer v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0
[root@casino-vm1 ~]# gluster v rebalance kit start force
volume rebalance: kit: success: Rebalance on kit has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 6062fa7e-094b-48b9-9885-9a0ccb75e668
[root@casino-vm1 ~]# gluster v rebalance kit status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 116 0Bytes 762 0 0 in progress 4.00
10.70.35.136 0 0Bytes 0 0 0 in progress 4.00
volume rebalance: kit: success:
[root@casino-vm1 ~]# kill -9 26079
[root@casino-vm1 ~]# gluster v status kit
Status of volume: kit
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/s0 N/A N/A N N/A
Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477
Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097
Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495
Brick 10.70.35.57:/rhs/brick4/s0 49220 0 Y 26498
Brick 10.70.35.136:/rhs/brick4/s0 49198 0 Y 15878
NFS Server on localhost 2049 0 Y 26517
Self-heal Daemon on localhost N/A N/A Y 26525
NFS Server on 10.70.35.136 2049 0 Y 15897
Self-heal Daemon on 10.70.35.136 N/A N/A Y 15905
Task Status of Volume kit
------------------------------------------------------------------------------
Task : Rebalance
ID : 6062fa7e-094b-48b9-9885-9a0ccb75e668
Status : failed
[root@casino-vm1 ~]# gluster v rebalance kit status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 649 0Bytes 2222 2 2 failed 17.00
10.70.35.136 0 0Bytes 0 0 0 completed 7.00
volume rebalance: kit: success:
[root@casino-vm1 ~]#
Log messages:
root@casino-vm1 ~]# less /var/log/glusterfs/kit-rebalance.log
[2015-06-30 03:52:02.022740] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/kit --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895155977 --socket-file /var/run/gluster/gluster-rebalance-ae805fc4-45c2-4d80-94e8-ce50336bc3c4.sock --pid-file /var/lib/glusterd/vols/kit/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/kit-rebalance.log)
[2015-06-30 03:52:02.038890] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-06-30 03:52:07.042428] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'commit-hash' for volume 'kit-dht' with value '2895155977'
[2015-06-30 03:52:07.042448] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'node-uuid' for volume 'kit-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b'
[2015-06-30 03:52:07.042458] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'rebalance-cmd' for volume 'kit-dht' with value '5'
[2015-06-30 03:52:07.042467] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'readdir-optimize' for volume 'kit-dht' with value 'on'
[2015-06-30 03:52:07.042477] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'assert-no-child-down' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042486] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'lookup-unhashed' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042495] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'use-readdirp' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042505] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'readdir-failover' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042515] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'entry-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'metadata-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042533] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'data-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042543] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-1: adding option 'readdir-failover' for volume 'ki...skipping...
[2015-06-30 03:52:29.762368] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b36 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.766353] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b37: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.808333] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b10 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.813728] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b11: attempting to move from kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.819245] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b37 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.826413] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b51: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.874437] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b11 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.878284] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b13: attempting to move from kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.883269] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b51 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.890494] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b56: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.908997] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-kit-client-0: (0cb784e4-4545-4f7c-b2ad-c05cfb8b072e) remote_fd is -1. EBADFD [File descriptor in bad state]
[2015-06-30 03:52:29.909498] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-kit-dht: /x11: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data
[2015-06-30 03:52:29.909513] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry
[2015-06-30 03:52:29.909681] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-kit-dht: Fix layout failed for /x11
[2015-06-30 03:52:29.909912] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2015-06-30 03:52:29.910002] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4
[2015-06-30 03:52:29.930303] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b13 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.933258] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b56 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.934002] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-kit-dht: Rebalance is failed. Time taken is 17.00 secs
[2015-06-30 03:52:29.934024] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-kit-dht: Files migrated: 649, size: 0, lookups: 2222, failures: 2, skipped: 2
[2015-06-30 03:52:29.934153] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down
(END)
Doc text is edited. Please sign off to be included in Known Issues. Doc text is edited. Please sign off to be included in Known Issues. Sorry, i am providing qe-ack again. Upstream patch: http://review.gluster.org/#/c/12013/ As this is yet to be merged upstream, moving this bug back to POST. Tested with build glusterfs-server-3.7.5-5, after killing the brick process of any of the replica pair, re-balance is not continuing from the other replica pair
[root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 0 0 completed 1.00
rhs-client18.lab.eng.blr.redhat.com 625 2.7KB 1519 1 0 completed 25.00
volume rebalance: dht1x2: success:
[root@rhs-client19 glusterfs]#
Out of 15k files it scanned 1519 only, when all brick process are up and running, i re-initiated the re balance this time it scanned 15k files
[root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 2342 12.3KB 6421 0 0 completed 189.00
rhs-client18.lab.eng.blr.redhat.com 3028 15.9KB 8990 0 0 completed 217.00
volume rebalance: dht1x2: success:
Tested with multiple directories, and if one of the brick process of any of the replica pair down rebalance of other directories going fine and if volume has only one directory then rebalance can't continue with other replica and it is expected so marking this as verified *** Bug 1064481 has been marked as a duplicate of this bug. *** Hi Susant, The doc text is edited. Do signoff on the same if it looks OK. Is there a way I can see the doc text I submitted? Just want to verify as if I remember correctly there was no reference of rename in it. updated the doc text based on my discussion with susant. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html |
Document URL: DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on Description: DHT-rebalance: rebalance status shows failed when all the replica pair bricks are brought down in distrep volume while re-name of files going on. Steps: ========== 1. i have distrep 6x2 volume, fuse mounted, after adding brick to the volume kicked off the rebalance. also started the renaming of the files on mount point. 2. when rebalance was going on, i brought down all the 6 replica pair bricks on the volume. 3. rebalance status showed as failed on the node1 and completed on node2 even though the rename was still going on. Expected result: ================ rebalance status should not show as failed though the replica pair available to serve the data. Actual result: ================= rebalance status failed. Output: ======== root@casino-vm1 ~]# gluster v info tester Volume Name: tester Type: Distributed-Replicate Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6 Status: Started Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/p0 Brick2: 10.70.35.136:/rhs/brick1/p0 Brick3: 10.70.35.57:/rhs/brick2/p0 Brick4: 10.70.35.136:/rhs/brick2/p0 Brick5: 10.70.35.57:/rhs/brick3/p0 Brick6: 10.70.35.136:/rhs/brick3/p0 Brick7: 10.70.35.57:/rhs/brick3/p1 Brick8: 10.70.35.136:/rhs/brick3/p1 Brick9: 10.70.35.57:/rhs/brick3/p2 Brick10: 10.70.35.136:/rhs/brick3/p2 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# gluster v add-brick tester 10.70.35.57:/rhs/brick4/p2 10.70.35.136:/rhs/brick4/p2 volume add-brick: success [root@casino-vm1 ~]# gluster v rebalance tester start force volume rebalance: tester: success: Rebalance on tester has been started successfully. Use rebalance status command to check status of the rebalance pr ocess. ID: 8d196476-112e-4f34-af0e-8233d106602e [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 288 0Bytes 1727 0 0 in progress 10.00 10.70.35.136 0 0Bytes 0 0 0 in progress 10.00 volume rebalance: tester: success: [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status tester Status of volume: tester Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/p0 49212 0 Y 20970 Brick 10.70.35.136:/rhs/brick1/p0 49190 0 Y 11418 Brick 10.70.35.57:/rhs/brick2/p0 49213 0 Y 20988 Brick 10.70.35.136:/rhs/brick2/p0 49191 0 Y 11436 Brick 10.70.35.57:/rhs/brick3/p0 49214 0 Y 21348 Brick 10.70.35.136:/rhs/brick3/p0 49192 0 Y 11760 Brick 10.70.35.57:/rhs/brick3/p1 49215 0 Y 22923 Brick 10.70.35.136:/rhs/brick3/p1 49193 0 Y 12958 Brick 10.70.35.57:/rhs/brick3/p2 49216 0 Y 23425 Brick 10.70.35.136:/rhs/brick3/p2 49194 0 Y 13554 Brick 10.70.35.57:/rhs/brick4/p2 49217 0 Y 24200 Brick 10.70.35.136:/rhs/brick4/p2 49195 0 Y 13843 NFS Server on localhost 2049 0 Y 24219 Self-heal Daemon on localhost N/A N/A Y 24227 NFS Server on 10.70.35.136 2049 0 Y 13863 Self-heal Daemon on 10.70.35.136 N/A N/A Y 13871 Task Status of Volume tester ------------------------------------------------------------------------------ Task : Rebalance ID : 8d196476-112e-4f34-af0e-8233d106602e Status : in progress [root@casino-vm1 ~]# kill -9 20970 [root@casino-vm1 ~]# kill -9 20988 [root@casino-vm1 ~]# kill -9 21348 [root@casino-vm1 ~]# kill -9 22923 [root@casino-vm1 ~]# kill -9 23425 [root@casino-vm1 ~]# kill -9 24200 [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status tester Status of volume: tester Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick1/p0 49190 0 Y 11418 Brick 10.70.35.57:/rhs/brick2/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick2/p0 49191 0 Y 11436 Brick 10.70.35.57:/rhs/brick3/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p0 49192 0 Y 11760 Brick 10.70.35.57:/rhs/brick3/p1 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p1 49193 0 Y 12958 Brick 10.70.35.57:/rhs/brick3/p2 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p2 49194 0 Y 13554 Brick 10.70.35.57:/rhs/brick4/p2 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick4/p2 49195 0 Y 13843 NFS Server on localhost 2049 0 Y 24219 Self-heal Daemon on localhost N/A N/A Y 24227 NFS Server on 10.70.35.136 2049 0 Y 13863 Self-heal Daemon on 10.70.35.136 N/A N/A Y 13871 Task Status of Volume tester ------------------------------------------------------------------------------ Task : Rebalance ID : 8d196476-112e-4f34-af0e-8233d106602e Status : failed [root@casino-vm1 ~]# [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1122 0Bytes 3081 2 2 failed 32.00 10.70.35.136 0 0Bytes 0 0 0 completed 11.00 volume rebalance: tester: success: [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1122 0Bytes 3081 2 2 failed 32.00 10.70.35.136 0 0Bytes 0 0 0 completed 11.00 volume rebalance: tester: success: [root@casino-vm1 ~]# gluster v info tester Volume Name: tester Type: Distributed-Replicate Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/p0 Brick2: 10.70.35.136:/rhs/brick1/p0 Brick3: 10.70.35.57:/rhs/brick2/p0 Brick4: 10.70.35.136:/rhs/brick2/p0 Brick5: 10.70.35.57:/rhs/brick3/p0 Brick6: 10.70.35.136:/rhs/brick3/p0 Brick7: 10.70.35.57:/rhs/brick3/p1 Brick8: 10.70.35.136:/rhs/brick3/p1 Brick9: 10.70.35.57:/rhs/brick3/p2 Brick10: 10.70.35.136:/rhs/brick3/p2 Brick11: 10.70.35.57:/rhs/brick4/p2 Brick12: 10.70.35.136:/rhs/brick4/p2 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# root@casino-vm1 ~]# less /var/log/glusterfs/tester-rebalance.log [2015-06-30 00:30:51.556515] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/tester --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895059420 --socket-file /var/run/gluster/gluster-rebalance-251d0979-9e43-42ff-82bd-ca9d5a77aef6.sock --pid-file /var/lib/glusterd/vols/tester/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/tester-rebalance.log) [2015-06-30 00:30:51.570006] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-06-30 00:30:56.561424] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'commit-hash' for volume 'tester-dht' with value '2895059420' [2015-06-30 00:30:56.561454] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'node-uuid' for volume 'tester-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b' [2015-06-30 00:30:56.561472] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'rebalance-cmd' for volume 'tester-dht' with value '5' [2015-06-30 00:30:56.561488] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'readdir-optimize' for volume 'tester-dht' with value 'on' [2015-06-30 00:30:56.561507] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'assert-no-child-down' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'lookup-unhashed' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561541] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'use-readdirp' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561560] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'readdir-failover' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561577] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'entry-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561594] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'metadata-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561615] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'data-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561635] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-1: adding option 'readdir-failover' for volume ...skipping... ster-replicate-3 to tester-replicate-5 [2015-06-30 01:59:44.288123] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b1: attempting to move from tester-replicate-2 to tester-replicate-0 [2015-06-30 01:59:44.313758] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /x13/y94 from subvolume tester-replicate-3 to tester-replicate-5 [2015-06-30 01:59:44.317638] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b18: attempting to move from tester-replicate-3 to tester-replicate-2 [2015-06-30 01:59:44.333657] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b1 from subvolume tester-replicate-2 to tester-replicate-0 [2015-06-30 01:59:44.337874] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b8: attempting to move from tester-replicate-0 to tester-replicate-5 [2015-06-30 01:59:44.355379] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b18 from subvolume tester-replicate-3 to tester-replicate-2 [2015-06-30 01:59:44.361126] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b12: attempting to move from tester-replicate-4 to tester-replicate-2 [2015-06-30 01:59:44.363052] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-tester-client-0: (56659ea2-b5ad-4eb2-8712-08658a9ed3ed) remote_fd is -1. EBADFD [File descriptor in bad state] [2015-06-30 01:59:44.363522] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-tester-dht: /x15: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data [2015-06-30 01:59:44.363538] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry [2015-06-30 01:59:44.363819] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-tester-dht: Fix layout failed for /x15 [2015-06-30 01:59:44.364234] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2015-06-30 01:59:44.364509] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 [2015-06-30 01:59:44.385508] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b8 from subvolume tester-replicate-0 to tester-replicate-5 [2015-06-30 01:59:44.391952] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b12 from subvolume tester-replicate-4 to tester-replicate-2 [2015-06-30 01:59:44.393227] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-tester-dht: Rebalance is failed. Time taken is 32.00 secs [2015-06-30 01:59:44.393260] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-tester-dht: Files migrated: 1122, size: 0, lookups: 3081, failures: 2, skipped: 2 [2015-06-30 01:59:44.393492] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down