Bug 1237059

Summary:	DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Triveni Rao <trao>
Component:	distribute	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED ERRATA	QA Contact:	RajeshReddy <rmekala>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	annair, asriram, asrivast, bmohanra, byarlaga, mzywusko, nbalacha, rgowdapp, rhs-bugs, rmekala, sankarshan, sashinde, shmohan, smohan, spalai
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.2	Flags:	bmohanra: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.5-5	Doc Type:	Bug Fix
Doc Text:	Previously, the rebalance status showed failed when the replica pair bricks were brought down in a distributed replicated volume. With this fix, rebalance will skip the error affected directory and continue with other directories. The rebalance process on a distributed-replicated volume will not be aborted even if a brick from a replica pair goes down.	Story Points:	---
Clone Of:
Clones:	1257076 1319592 (view as bug list)		Environment:
Last Closed:	2016-03-01 05:27:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1216951, 1243815, 1257076, 1260783, 1318196

Description Triveni Rao 2015-06-30 09:58:08 UTC

Document URL: 

DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on

Description:

DHT-rebalance: rebalance status shows failed when all the replica pair bricks are brought down in distrep volume while re-name of files going on.

Steps:
==========

1. i have distrep 6x2 volume, fuse mounted, after adding brick to the volume kicked off the rebalance. also started the renaming of the files on mount point.

2. when rebalance was going on, i brought down all the 6 replica pair bricks on the volume.
3. rebalance status showed as failed on the node1 and completed on node2 even though the rename was still going on.


Expected result:
================
rebalance status should not show as failed though the replica pair available to serve the data.

Actual result:
=================
rebalance status failed.



Output:
========


root@casino-vm1 ~]# gluster v info tester

Volume Name: tester  
Type: Distributed-Replicate
Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp  
Bricks:
Brick1: 10.70.35.57:/rhs/brick1/p0
Brick2: 10.70.35.136:/rhs/brick1/p0
Brick3: 10.70.35.57:/rhs/brick2/p0
Brick4: 10.70.35.136:/rhs/brick2/p0
Brick5: 10.70.35.57:/rhs/brick3/p0
Brick6: 10.70.35.136:/rhs/brick3/p0
Brick7: 10.70.35.57:/rhs/brick3/p1
Brick8: 10.70.35.136:/rhs/brick3/p1
Brick9: 10.70.35.57:/rhs/brick3/p2
Brick10: 10.70.35.136:/rhs/brick3/p2
Options Reconfigured:
performance.readdir-ahead: on
[root@casino-vm1 ~]# gluster v add-brick tester 10.70.35.57:/rhs/brick4/p2 10.70.35.136:/rhs/brick4/p2
volume add-brick: success
[root@casino-vm1 ~]# gluster v rebalance tester start force
volume rebalance: tester: success: Rebalance on tester has been started successfully. Use rebalance status command to check status of the rebalance pr
ocess.
ID: 8d196476-112e-4f34-af0e-8233d106602e


[root@casino-vm1 ~]# gluster v rebalance tester status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              288        0Bytes          1727             0             0          in progress              10.00
                            10.70.35.136                0        0Bytes             0             0             0          in progress              10.00
volume rebalance: tester: success:
[root@casino-vm1 ~]# 
[root@casino-vm1 ~]# gluster v status tester
Status of volume: tester
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/p0            49212     0          Y       20970
Brick 10.70.35.136:/rhs/brick1/p0           49190     0          Y       11418
Brick 10.70.35.57:/rhs/brick2/p0            49213     0          Y       20988
Brick 10.70.35.136:/rhs/brick2/p0           49191     0          Y       11436
Brick 10.70.35.57:/rhs/brick3/p0            49214     0          Y       21348
Brick 10.70.35.136:/rhs/brick3/p0           49192     0          Y       11760
Brick 10.70.35.57:/rhs/brick3/p1            49215     0          Y       22923
Brick 10.70.35.136:/rhs/brick3/p1           49193     0          Y       12958
Brick 10.70.35.57:/rhs/brick3/p2            49216     0          Y       23425
Brick 10.70.35.136:/rhs/brick3/p2           49194     0          Y       13554
Brick 10.70.35.57:/rhs/brick4/p2            49217     0          Y       24200
Brick 10.70.35.136:/rhs/brick4/p2           49195     0          Y       13843
NFS Server on localhost                     2049      0          Y       24219
Self-heal Daemon on localhost               N/A       N/A        Y       24227
NFS Server on 10.70.35.136                  2049      0          Y       13863
Self-heal Daemon on 10.70.35.136            N/A       N/A        Y       13871
Task Status of Volume tester
------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : 8d196476-112e-4f34-af0e-8233d106602e
Status               : in progress

[root@casino-vm1 ~]# kill -9 20970
[root@casino-vm1 ~]# kill -9 20988
[root@casino-vm1 ~]# kill -9 21348
[root@casino-vm1 ~]# kill -9 22923
[root@casino-vm1 ~]# kill -9 23425
[root@casino-vm1 ~]# kill -9 24200
[root@casino-vm1 ~]#
[root@casino-vm1 ~]# gluster v status tester
Status of volume: tester
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/p0            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick1/p0           49190     0          Y       11418
Brick 10.70.35.57:/rhs/brick2/p0            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick2/p0           49191     0          Y       11436
Brick 10.70.35.57:/rhs/brick3/p0            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick3/p0           49192     0          Y       11760
Brick 10.70.35.57:/rhs/brick3/p1            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick3/p1           49193     0          Y       12958
Brick 10.70.35.57:/rhs/brick3/p2            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick3/p2           49194     0          Y       13554
Brick 10.70.35.57:/rhs/brick4/p2            N/A       N/A        N       N/A
Brick 10.70.35.136:/rhs/brick4/p2           49195     0          Y       13843
NFS Server on localhost                     2049      0          Y       24219
Self-heal Daemon on localhost               N/A       N/A        Y       24227
NFS Server on 10.70.35.136                  2049      0          Y       13863
Self-heal Daemon on 10.70.35.136            N/A       N/A        Y       13871

Task Status of Volume tester
------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : 8d196476-112e-4f34-af0e-8233d106602e
Status               : failed

[root@casino-vm1 ~]# 
[root@casino-vm1 ~]# 
[root@casino-vm1 ~]# gluster v rebalance tester status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1122        0Bytes          3081             2             2               failed              32.00
                            10.70.35.136                0        0Bytes             0             0             0            completed              11.00
volume rebalance: tester: success:
[root@casino-vm1 ~]# gluster v rebalance tester status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1122        0Bytes          3081             2             2               failed              32.00
                            10.70.35.136                0        0Bytes             0             0             0            completed              11.00
volume rebalance: tester: success:
[root@casino-vm1 ~]# gluster v info tester

Volume Name: tester  
Type: Distributed-Replicate
Volume ID: 251d0979-9e43-42ff-82bd-ca9d5a77aef6
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp  
Bricks:
Brick1: 10.70.35.57:/rhs/brick1/p0
Brick2: 10.70.35.136:/rhs/brick1/p0
Brick3: 10.70.35.57:/rhs/brick2/p0
Brick4: 10.70.35.136:/rhs/brick2/p0
Brick5: 10.70.35.57:/rhs/brick3/p0
Brick6: 10.70.35.136:/rhs/brick3/p0
Brick7: 10.70.35.57:/rhs/brick3/p1
Brick8: 10.70.35.136:/rhs/brick3/p1
Brick9: 10.70.35.57:/rhs/brick3/p2
Brick10: 10.70.35.136:/rhs/brick3/p2
Brick11: 10.70.35.57:/rhs/brick4/p2
Brick12: 10.70.35.136:/rhs/brick4/p2
Options Reconfigured:
performance.readdir-ahead: on
[root@casino-vm1 ~]# 


root@casino-vm1 ~]# less /var/log/glusterfs/tester-rebalance.log 
[2015-06-30 00:30:51.556515] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/tester --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895059420 --socket-file /var/run/gluster/gluster-rebalance-251d0979-9e43-42ff-82bd-ca9d5a77aef6.sock --pid-file /var/lib/glusterd/vols/tester/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/tester-rebalance.log)
[2015-06-30 00:30:51.570006] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-06-30 00:30:56.561424] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'commit-hash' for volume 'tester-dht' with value '2895059420'
[2015-06-30 00:30:56.561454] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'node-uuid' for volume 'tester-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b'
[2015-06-30 00:30:56.561472] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'rebalance-cmd' for volume 'tester-dht' with value '5'
[2015-06-30 00:30:56.561488] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'readdir-optimize' for volume 'tester-dht' with value 'on'
[2015-06-30 00:30:56.561507] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'assert-no-child-down' for volume 'tester-dht' with value 'yes'
[2015-06-30 00:30:56.561524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'lookup-unhashed' for volume 'tester-dht' with value 'yes'
[2015-06-30 00:30:56.561541] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-dht: adding option 'use-readdirp' for volume 'tester-dht' with value 'yes'
[2015-06-30 00:30:56.561560] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'readdir-failover' for volume 'tester-replicate-2' with value 'off'
[2015-06-30 00:30:56.561577] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'entry-self-heal' for volume 'tester-replicate-2' with value 'off'
[2015-06-30 00:30:56.561594] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'metadata-self-heal' for volume 'tester-replicate-2' with value 'off'
[2015-06-30 00:30:56.561615] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-2: adding option 'data-self-heal' for volume 'tester-replicate-2' with value 'off'
[2015-06-30 00:30:56.561635] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-tester-replicate-1: adding option 'readdir-failover' for volume ...skipping...
ster-replicate-3 to tester-replicate-5
[2015-06-30 01:59:44.288123] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b1: attempting to move from tester-replicate-2 to tester-replicate-0
[2015-06-30 01:59:44.313758] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /x13/y94 from subvolume tester-replicate-3 to tester-replicate-5
[2015-06-30 01:59:44.317638] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b18: attempting to move from tester-replicate-3 to tester-replicate-2
[2015-06-30 01:59:44.333657] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b1 from subvolume tester-replicate-2 to tester-replicate-0
[2015-06-30 01:59:44.337874] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b8: attempting to move from tester-replicate-0 to tester-replicate-5
[2015-06-30 01:59:44.355379] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b18 from subvolume tester-replicate-3 to tester-replicate-2
[2015-06-30 01:59:44.361126] I [dht-rebalance.c:1002:dht_migrate_file] 0-tester-dht: /a14/b12: attempting to move from tester-replicate-4 to tester-replicate-2
[2015-06-30 01:59:44.363052] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-tester-client-0:  (56659ea2-b5ad-4eb2-8712-08658a9ed3ed) remote_fd is -1. EBADFD [File descriptor in bad state]
[2015-06-30 01:59:44.363522] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-tester-dht: /x15: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data
[2015-06-30 01:59:44.363538] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry
[2015-06-30 01:59:44.363819] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-tester-dht: Fix layout failed for /x15
[2015-06-30 01:59:44.364234] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2015-06-30 01:59:44.364509] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4
[2015-06-30 01:59:44.385508] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b8 from subvolume tester-replicate-0 to tester-replicate-5
[2015-06-30 01:59:44.391952] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-tester-dht: completed migration of /a14/b12 from subvolume tester-replicate-4 to tester-replicate-2
[2015-06-30 01:59:44.393227] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-tester-dht: Rebalance is failed. Time taken is 32.00 secs
[2015-06-30 01:59:44.393260] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-tester-dht: Files migrated: 1122, size: 0, lookups: 3081, failures: 2, skipped: 2
[2015-06-30 01:59:44.393492] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down

Comment 2 Triveni Rao 2015-06-30 10:56:57 UTC

I just tried with single brick down scenario i could re-produce the problem on the volume.

Volume Name: kit
Type: Distributed-Replicate
Volume ID: ae805fc4-45c2-4d80-94e8-ce50336bc3c4
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.57:/rhs/brick1/s0
Brick2: 10.70.35.136:/rhs/brick1/s0
Brick3: 10.70.35.57:/rhs/brick2/s0
Brick4: 10.70.35.136:/rhs/brick2/s0
Options Reconfigured:
performance.readdir-ahead: on
[root@casino-vm1 ~]#
[root@casino-vm1 ~]#
[root@casino-vm1 ~]# gluster v status kit
Status of volume: kit
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/s0            49218     0          Y       26079
Brick 10.70.35.136:/rhs/brick1/s0           49196     0          Y       15477
Brick 10.70.35.57:/rhs/brick2/s0            49219     0          Y       26097
Brick 10.70.35.136:/rhs/brick2/s0           49197     0          Y       15495
NFS Server on localhost                     2049      0          Y       26116
Self-heal Daemon on localhost               N/A       N/A        Y       26124
NFS Server on 10.70.35.136                  2049      0          Y       15514
Self-heal Daemon on 10.70.35.136            N/A       N/A        Y       15522

Task Status of Volume kit
------------------------------------------------------------------------------
There are no active volume tasks

[root@casino-vm1 ~]# gluster v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0
volume add-brick: success
(reverse-i-search)`gluster ': ^Custer v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0
[root@casino-vm1 ~]# gluster v rebalance kit start force
volume rebalance: kit: success: Rebalance on kit has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 6062fa7e-094b-48b9-9885-9a0ccb75e668

[root@casino-vm1 ~]# gluster v rebalance kit status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              116        0Bytes           762             0             0          in progress               4.00
                            10.70.35.136                0        0Bytes             0             0             0          in progress               4.00
volume rebalance: kit: success: 
[root@casino-vm1 ~]# kill -9 26079
[root@casino-vm1 ~]# gluster v status kit
Status of volume: kit
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.57:/rhs/brick1/s0            N/A       N/A        N       N/A  
Brick 10.70.35.136:/rhs/brick1/s0           49196     0          Y       15477
Brick 10.70.35.57:/rhs/brick2/s0            49219     0          Y       26097
Brick 10.70.35.136:/rhs/brick2/s0           49197     0          Y       15495
Brick 10.70.35.57:/rhs/brick4/s0            49220     0          Y       26498
Brick 10.70.35.136:/rhs/brick4/s0           49198     0          Y       15878
NFS Server on localhost                     2049      0          Y       26517
Self-heal Daemon on localhost               N/A       N/A        Y       26525
NFS Server on 10.70.35.136                  2049      0          Y       15897
Self-heal Daemon on 10.70.35.136            N/A       N/A        Y       15905
 
Task Status of Volume kit
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 6062fa7e-094b-48b9-9885-9a0ccb75e668
Status               : failed              
 
[root@casino-vm1 ~]# gluster v rebalance kit status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              649        0Bytes          2222             2             2               failed              17.00
                            10.70.35.136                0        0Bytes             0             0             0            completed               7.00
volume rebalance: kit: success: 
[root@casino-vm1 ~]# 


Log messages:

root@casino-vm1 ~]# less /var/log/glusterfs/kit-rebalance.log
[2015-06-30 03:52:02.022740] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/kit --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895155977 --socket-file /var/run/gluster/gluster-rebalance-ae805fc4-45c2-4d80-94e8-ce50336bc3c4.sock --pid-file /var/lib/glusterd/vols/kit/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/kit-rebalance.log)
[2015-06-30 03:52:02.038890] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-06-30 03:52:07.042428] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'commit-hash' for volume 'kit-dht' with value '2895155977'
[2015-06-30 03:52:07.042448] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'node-uuid' for volume 'kit-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b'
[2015-06-30 03:52:07.042458] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'rebalance-cmd' for volume 'kit-dht' with value '5'
[2015-06-30 03:52:07.042467] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'readdir-optimize' for volume 'kit-dht' with value 'on'
[2015-06-30 03:52:07.042477] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'assert-no-child-down' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042486] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'lookup-unhashed' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042495] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'use-readdirp' for volume 'kit-dht' with value 'yes'
[2015-06-30 03:52:07.042505] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'readdir-failover' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042515] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'entry-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'metadata-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042533] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'data-self-heal' for volume 'kit-replicate-2' with value 'off'
[2015-06-30 03:52:07.042543] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-1: adding option 'readdir-failover' for volume 'ki...skipping...
[2015-06-30 03:52:29.762368] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b36 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.766353] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b37: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.808333] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b10 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.813728] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b11: attempting to move from kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.819245] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b37 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.826413] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b51: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.874437] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b11 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.878284] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b13: attempting to move from kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.883269] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b51 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.890494] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b56: attempting to move from kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.908997] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-kit-client-0:  (0cb784e4-4545-4f7c-b2ad-c05cfb8b072e) remote_fd is -1. EBADFD [File descriptor in bad state]
[2015-06-30 03:52:29.909498] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-kit-dht: /x11: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data
[2015-06-30 03:52:29.909513] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry
[2015-06-30 03:52:29.909681] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-kit-dht: Fix layout failed for /x11
[2015-06-30 03:52:29.909912] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2015-06-30 03:52:29.910002] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4
[2015-06-30 03:52:29.930303] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b13 from subvolume kit-replicate-0 to kit-replicate-2
[2015-06-30 03:52:29.933258] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b56 from subvolume kit-replicate-1 to kit-replicate-2
[2015-06-30 03:52:29.934002] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-kit-dht: Rebalance is failed. Time taken is 17.00 secs
[2015-06-30 03:52:29.934024] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-kit-dht: Files migrated: 649, size: 0, lookups: 2222, failures: 2, skipped: 2
[2015-06-30 03:52:29.934153] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down
(END)

Comment 4 monti lawrence 2015-07-22 19:32:50 UTC

Doc text is edited. Please sign off to be included in Known Issues.

Comment 5 monti lawrence 2015-07-22 19:33:38 UTC

Doc text is edited. Please sign off to be included in Known Issues.

Comment 9 Bhaskarakiran 2015-08-20 10:22:13 UTC

Sorry, i am providing qe-ack again.

Comment 15 Susant Kumar Palai 2015-10-12 08:44:46 UTC

Upstream patch: http://review.gluster.org/#/c/12013/

As this is yet to be merged upstream, moving this bug back to POST.

Comment 21 RajeshReddy 2015-11-06 13:43:54 UTC

Tested with build glusterfs-server-3.7.5-5, after killing the brick process of any of the replica pair, re-balance is not continuing from the other replica pair

  

[root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status 
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             0             0            completed               1.00
     rhs-client18.lab.eng.blr.redhat.com              625         2.7KB          1519             1             0            completed              25.00
volume rebalance: dht1x2: success: 
[root@rhs-client19 glusterfs]# 

Out of 15k files it scanned 1519 only, when all brick process are up and running, i re-initiated the re balance this time it scanned 15k files   

[root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status 
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             2342        12.3KB          6421             0             0            completed             189.00
     rhs-client18.lab.eng.blr.redhat.com             3028        15.9KB          8990             0             0            completed             217.00
volume rebalance: dht1x2: success:

Comment 22 RajeshReddy 2015-11-09 10:46:52 UTC

Tested with multiple directories, and if one of the brick process of any of the replica pair down rebalance of other directories going fine and if volume has only one directory then rebalance can't continue with other replica and it is expected so marking this as verified

Comment 23 Susant Kumar Palai 2015-11-27 12:06:59 UTC

*** Bug 1064481 has been marked as a duplicate of this bug. ***

Comment 24 Bhavana 2016-02-23 09:10:03 UTC

Hi Susant,

The doc text is edited. Do signoff on the same if it looks OK.

Comment 25 Susant Kumar Palai 2016-02-23 09:31:47 UTC

Is there a way I can see the doc text I submitted? Just want to verify as if I remember correctly there was no reference of rename in it.

Comment 26 Bhavana 2016-02-24 09:38:43 UTC

updated the doc text based on my discussion with susant.

Comment 28 errata-xmlrpc 2016-03-01 05:27:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html