Document URL: DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on Description: DHT-rebalance: rebalance status shows failed when all the replica pair bricks are brought down in distrep volume while re-name of files going on. Steps: ========== 1. i have distrep 6x2 volume, fuse mounted, after adding brick to the volume kicked off the rebalance. also started the renaming of the files on mount point. 2. when rebalance was going on, i brought down all the 6 replica pair bricks on the volume. 3. rebalance status showed as failed on the node1 and completed on node2 even though the rename was still going on. Expected result: ================ rebalance status should not show as failed though the replica pair available to serve the data. Actual result: ================= rebalance status failed. Output: ======== root@casino-vm1 ~]# gluster v info tester Volume Name: tester Type: Distributed-Replicate Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6 Status: Started Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/p0 Brick2: 10.70.35.136:/rhs/brick1/p0 Brick3: 10.70.35.57:/rhs/brick2/p0 Brick4: 10.70.35.136:/rhs/brick2/p0 Brick5: 10.70.35.57:/rhs/brick3/p0 Brick6: 10.70.35.136:/rhs/brick3/p0 Brick7: 10.70.35.57:/rhs/brick3/p1 Brick8: 10.70.35.136:/rhs/brick3/p1 Brick9: 10.70.35.57:/rhs/brick3/p2 Brick10: 10.70.35.136:/rhs/brick3/p2 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# gluster v add-brick tester 10.70.35.57:/rhs/brick4/p2 10.70.35.136:/rhs/brick4/p2 volume add-brick: success [root@casino-vm1 ~]# gluster v rebalance tester start force volume rebalance: tester: success: Rebalance on tester has been started successfully. Use rebalance status command to check status of the rebalance pr ocess. ID: 8d196476-112e-4f34-af0e-8233d106602e [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 288 0Bytes 1727 0 0 in progress 10.00 10.70.35.136 0 0Bytes 0 0 0 in progress 10.00 volume rebalance: tester: success: [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status tester Status of volume: tester Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/p0 49212 0 Y 20970 Brick 10.70.35.136:/rhs/brick1/p0 49190 0 Y 11418 Brick 10.70.35.57:/rhs/brick2/p0 49213 0 Y 20988 Brick 10.70.35.136:/rhs/brick2/p0 49191 0 Y 11436 Brick 10.70.35.57:/rhs/brick3/p0 49214 0 Y 21348 Brick 10.70.35.136:/rhs/brick3/p0 49192 0 Y 11760 Brick 10.70.35.57:/rhs/brick3/p1 49215 0 Y 22923 Brick 10.70.35.136:/rhs/brick3/p1 49193 0 Y 12958 Brick 10.70.35.57:/rhs/brick3/p2 49216 0 Y 23425 Brick 10.70.35.136:/rhs/brick3/p2 49194 0 Y 13554 Brick 10.70.35.57:/rhs/brick4/p2 49217 0 Y 24200 Brick 10.70.35.136:/rhs/brick4/p2 49195 0 Y 13843 NFS Server on localhost 2049 0 Y 24219 Self-heal Daemon on localhost N/A N/A Y 24227 NFS Server on 10.70.35.136 2049 0 Y 13863 Self-heal Daemon on 10.70.35.136 N/A N/A Y 13871 Task Status of Volume tester ------------------------------------------------------------------------------ Task : Rebalance ID : 8d196476-112e-4f34-af0e-8233d106602e Status : in progress [root@casino-vm1 ~]# kill -9 20970 [root@casino-vm1 ~]# kill -9 20988 [root@casino-vm1 ~]# kill -9 21348 [root@casino-vm1 ~]# kill -9 22923 [root@casino-vm1 ~]# kill -9 23425 [root@casino-vm1 ~]# kill -9 24200 [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status tester Status of volume: tester Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick1/p0 49190 0 Y 11418 Brick 10.70.35.57:/rhs/brick2/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick2/p0 49191 0 Y 11436 Brick 10.70.35.57:/rhs/brick3/p0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p0 49192 0 Y 11760 Brick 10.70.35.57:/rhs/brick3/p1 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p1 49193 0 Y 12958 Brick 10.70.35.57:/rhs/brick3/p2 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick3/p2 49194 0 Y 13554 Brick 10.70.35.57:/rhs/brick4/p2 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick4/p2 49195 0 Y 13843 NFS Server on localhost 2049 0 Y 24219 Self-heal Daemon on localhost N/A N/A Y 24227 NFS Server on 10.70.35.136 2049 0 Y 13863 Self-heal Daemon on 10.70.35.136 N/A N/A Y 13871 Task Status of Volume tester ------------------------------------------------------------------------------ Task : Rebalance ID : 8d196476-112e-4f34-af0e-8233d106602e Status : failed [root@casino-vm1 ~]# [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1122 0Bytes 3081 2 2 failed 32.00 10.70.35.136 0 0Bytes 0 0 0 completed 11.00 volume rebalance: tester: success: [root@casino-vm1 ~]# gluster v rebalance tester status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1122 0Bytes 3081 2 2 failed 32.00 10.70.35.136 0 0Bytes 0 0 0 completed 11.00 volume rebalance: tester: success: [root@casino-vm1 ~]# gluster v info tester Volume Name: tester Type: Distributed-Replicate Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/p0 Brick2: 10.70.35.136:/rhs/brick1/p0 Brick3: 10.70.35.57:/rhs/brick2/p0 Brick4: 10.70.35.136:/rhs/brick2/p0 Brick5: 10.70.35.57:/rhs/brick3/p0 Brick6: 10.70.35.136:/rhs/brick3/p0 Brick7: 10.70.35.57:/rhs/brick3/p1 Brick8: 10.70.35.136:/rhs/brick3/p1 Brick9: 10.70.35.57:/rhs/brick3/p2 Brick10: 10.70.35.136:/rhs/brick3/p2 Brick11: 10.70.35.57:/rhs/brick4/p2 Brick12: 10.70.35.136:/rhs/brick4/p2 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# root@casino-vm1 ~]# less /var/log/glusterfs/tester-rebalance.log [2015-06-30 00:30:51.556515] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/tester --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895059420 --socket-file /var/run/gluster/gluster-rebalance-251d0979-9e43-42ff-82bd-ca9d5a77aef6.sock --pid-file /var/lib/glusterd/vols/tester/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/tester-rebalance.log) [2015-06-30 00:30:51.570006] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-06-30 00:30:56.561424] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'commit-hash' for volume 'tester-dht' with value '2895059420' [2015-06-30 00:30:56.561454] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'node-uuid' for volume 'tester-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b' [2015-06-30 00:30:56.561472] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'rebalance-cmd' for volume 'tester-dht' with value '5' [2015-06-30 00:30:56.561488] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'readdir-optimize' for volume 'tester-dht' with value 'on' [2015-06-30 00:30:56.561507] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'assert-no-child-down' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'lookup-unhashed' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561541] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'use-readdirp' for volume 'tester-dht' with value 'yes' [2015-06-30 00:30:56.561560] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'readdir-failover' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561577] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'entry-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561594] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'metadata-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561615] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'data-self-heal' for volume 'tester-replicate-2' with value 'off' [2015-06-30 00:30:56.561635] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-1: adding option 'readdir-failover' for volume ...skipping... ster-replicate-3 to tester-replicate-5 [2015-06-30 01:59:44.288123] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b1: attempting to move from tester-replicate-2 to tester-replicate-0 [2015-06-30 01:59:44.313758] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /x13/y94 from subvolume tester-replicate-3 to tester-replicate-5 [2015-06-30 01:59:44.317638] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b18: attempting to move from tester-replicate-3 to tester-replicate-2 [2015-06-30 01:59:44.333657] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b1 from subvolume tester-replicate-2 to tester-replicate-0 [2015-06-30 01:59:44.337874] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b8: attempting to move from tester-replicate-0 to tester-replicate-5 [2015-06-30 01:59:44.355379] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b18 from subvolume tester-replicate-3 to tester-replicate-2 [2015-06-30 01:59:44.361126] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b12: attempting to move from tester-replicate-4 to tester-replicate-2 [2015-06-30 01:59:44.363052] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-tester-client-0: (56659ea2-b5ad-4eb2-8712-08658a9ed3ed) remote_fd is -1. EBADFD [File descriptor in bad state] [2015-06-30 01:59:44.363522] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-tester-dht: /x15: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data [2015-06-30 01:59:44.363538] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry [2015-06-30 01:59:44.363819] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-tester-dht: Fix layout failed for /x15 [2015-06-30 01:59:44.364234] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2015-06-30 01:59:44.364509] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 [2015-06-30 01:59:44.385508] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b8 from subvolume tester-replicate-0 to tester-replicate-5 [2015-06-30 01:59:44.391952] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b12 from subvolume tester-replicate-4 to tester-replicate-2 [2015-06-30 01:59:44.393227] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-tester-dht: Rebalance is failed. Time taken is 32.00 secs [2015-06-30 01:59:44.393260] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-tester-dht: Files migrated: 1122, size: 0, lookups: 3081, failures: 2, skipped: 2 [2015-06-30 01:59:44.393492] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down
I just tried with single brick down scenario i could re-produce the problem on the volume. Volume Name: kit Type: Distributed-Replicate Volume ID: ae805fc4-45c2-4d80-94e8-ce50336bc3c4 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/s0 Brick2: 10.70.35.136:/rhs/brick1/s0 Brick3: 10.70.35.57:/rhs/brick2/s0 Brick4: 10.70.35.136:/rhs/brick2/s0 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status kit Status of volume: kit Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/s0 49218 0 Y 26079 Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477 Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097 Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495 NFS Server on localhost 2049 0 Y 26116 Self-heal Daemon on localhost N/A N/A Y 26124 NFS Server on 10.70.35.136 2049 0 Y 15514 Self-heal Daemon on 10.70.35.136 N/A N/A Y 15522 Task Status of Volume kit ------------------------------------------------------------------------------ There are no active volume tasks [root@casino-vm1 ~]# gluster v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0 volume add-brick: success (reverse-i-search)`gluster ': ^Custer v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0 [root@casino-vm1 ~]# gluster v rebalance kit start force volume rebalance: kit: success: Rebalance on kit has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 6062fa7e-094b-48b9-9885-9a0ccb75e668 [root@casino-vm1 ~]# gluster v rebalance kit status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 116 0Bytes 762 0 0 in progress 4.00 10.70.35.136 0 0Bytes 0 0 0 in progress 4.00 volume rebalance: kit: success: [root@casino-vm1 ~]# kill -9 26079 [root@casino-vm1 ~]# gluster v status kit Status of volume: kit Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/s0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477 Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097 Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495 Brick 10.70.35.57:/rhs/brick4/s0 49220 0 Y 26498 Brick 10.70.35.136:/rhs/brick4/s0 49198 0 Y 15878 NFS Server on localhost 2049 0 Y 26517 Self-heal Daemon on localhost N/A N/A Y 26525 NFS Server on 10.70.35.136 2049 0 Y 15897 Self-heal Daemon on 10.70.35.136 N/A N/A Y 15905 Task Status of Volume kit ------------------------------------------------------------------------------ Task : Rebalance ID : 6062fa7e-094b-48b9-9885-9a0ccb75e668 Status : failed [root@casino-vm1 ~]# gluster v rebalance kit status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 649 0Bytes 2222 2 2 failed 17.00 10.70.35.136 0 0Bytes 0 0 0 completed 7.00 volume rebalance: kit: success: [root@casino-vm1 ~]# Log messages: root@casino-vm1 ~]# less /var/log/glusterfs/kit-rebalance.log [2015-06-30 03:52:02.022740] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/kit --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895155977 --socket-file /var/run/gluster/gluster-rebalance-ae805fc4-45c2-4d80-94e8-ce50336bc3c4.sock --pid-file /var/lib/glusterd/vols/kit/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/kit-rebalance.log) [2015-06-30 03:52:02.038890] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-06-30 03:52:07.042428] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'commit-hash' for volume 'kit-dht' with value '2895155977' [2015-06-30 03:52:07.042448] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'node-uuid' for volume 'kit-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b' [2015-06-30 03:52:07.042458] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'rebalance-cmd' for volume 'kit-dht' with value '5' [2015-06-30 03:52:07.042467] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'readdir-optimize' for volume 'kit-dht' with value 'on' [2015-06-30 03:52:07.042477] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'assert-no-child-down' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042486] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'lookup-unhashed' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042495] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'use-readdirp' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042505] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'readdir-failover' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042515] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'entry-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'metadata-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042533] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'data-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042543] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-1: adding option 'readdir-failover' for volume 'ki...skipping... [2015-06-30 03:52:29.762368] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b36 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.766353] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b37: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.808333] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b10 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.813728] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b11: attempting to move from kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.819245] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b37 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.826413] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b51: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.874437] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b11 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.878284] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b13: attempting to move from kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.883269] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b51 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.890494] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b56: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.908997] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-kit-client-0: (0cb784e4-4545-4f7c-b2ad-c05cfb8b072e) remote_fd is -1. EBADFD [File descriptor in bad state] [2015-06-30 03:52:29.909498] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-kit-dht: /x11: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data [2015-06-30 03:52:29.909513] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry [2015-06-30 03:52:29.909681] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-kit-dht: Fix layout failed for /x11 [2015-06-30 03:52:29.909912] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2015-06-30 03:52:29.910002] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 [2015-06-30 03:52:29.930303] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b13 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.933258] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b56 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.934002] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-kit-dht: Rebalance is failed. Time taken is 17.00 secs [2015-06-30 03:52:29.934024] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-kit-dht: Files migrated: 649, size: 0, lookups: 2222, failures: 2, skipped: 2 [2015-06-30 03:52:29.934153] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down (END)
Doc text is edited. Please sign off to be included in Known Issues.
Sorry, i am providing qe-ack again.
Upstream patch: http://review.gluster.org/#/c/12013/ As this is yet to be merged upstream, moving this bug back to POST.
Tested with build glusterfs-server-3.7.5-5, after killing the brick process of any of the replica pair, re-balance is not continuing from the other replica pair [root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 completed 1.00 rhs-client18.lab.eng.blr.redhat.com 625 2.7KB 1519 1 0 completed 25.00 volume rebalance: dht1x2: success: [root@rhs-client19 glusterfs]# Out of 15k files it scanned 1519 only, when all brick process are up and running, i re-initiated the re balance this time it scanned 15k files [root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2342 12.3KB 6421 0 0 completed 189.00 rhs-client18.lab.eng.blr.redhat.com 3028 15.9KB 8990 0 0 completed 217.00 volume rebalance: dht1x2: success:
Tested with multiple directories, and if one of the brick process of any of the replica pair down rebalance of other directories going fine and if volume has only one directory then rebalance can't continue with other replica and it is expected so marking this as verified
*** Bug 1064481 has been marked as a duplicate of this bug. ***
Hi Susant, The doc text is edited. Do signoff on the same if it looks OK.
Is there a way I can see the doc text I submitted? Just want to verify as if I remember correctly there was no reference of rename in it.
updated the doc text based on my discussion with susant.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html