Description of problem: Having a 6x2 volume, did an add-brick operation, followed by rebalance. The volume was already having around 100GB of data inside a directory. There was no I/O happening, just rebalance going on. Still I find that rebalance status says, "stopped" for one node. and that same it says, in the rebalance logs that, [2013-10-29 14:36:51.255142] W [dht-common.c:5131:dht_notify] 0-dist-rep11-dht: Received CHILD_DOWN. Exiting [2013-10-29 14:36:51.255150] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance But if do the gluster volume status, all the bricks are up. The above mentioned logs are followed by the "critical" message in the rebalance.log C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-6: cannot lookup the saved frame corresponding to xid (849056) Version-Release number of selected component (if applicable): glusterfs-3.4.0.36rhs How reproducible: seen on one setup Steps to Reproduce: 1. here I was a volume with these settings, Volume Name: dist-rep11 Type: Distributed-Replicate Volume ID: 57a3fe8c-974c-4a8c-a5ec-6112e00ae9de Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.58:/rhs/bricks/d1r1-n11 Brick2: 10.70.37.196:/rhs/bricks/d1r2-n11 Brick3: 10.70.37.138:/rhs/bricks/d2r1-n11 Brick4: 10.70.37.186:/rhs/bricks/d2r2-n11 Brick5: 10.70.37.58:/rhs/bricks/d3r1-n11 Brick6: 10.70.37.196:/rhs/bricks/d3r2-n11 Brick7: 10.70.37.138:/rhs/bricks/d4r1-n11 Brick8: 10.70.37.186:/rhs/bricks/d4r2-n11 Brick9: 10.70.37.58:/rhs/bricks/d5r1-n11 Brick10: 10.70.37.196:/rhs/bricks/d5r2-n11 Brick11: 10.70.37.138:/rhs/bricks/d6r1-n11 Brick12: 10.70.37.186:/rhs/bricks/d6r2-n11 Brick13: 10.70.37.58:/rhs/bricks/d1r1-n11-add Brick14: 10.70.37.196:/rhs/bricks/d1r2-n11-add Options Reconfigured: features.quota-deem-statfs: on features.quota: on [root@nfs1 ~]# gluster volume quota dist-rep11 list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- / 300.0GB 80% 106.6GB 193.4GB /dir1 100.0GB 80% 100.0GB 0Bytes 2. add-brick and rebalance Actual results: present status of rebalance, [root@nfs1 ~]# gluster volume rebalance dist-rep11 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 52501 603.2MB 324071 0 59384 in progress 64108.00 10.70.37.186 96 742.3KB 459200 0 33 in progress 64107.00 10.70.37.138 46183 517.3MB 216020 0 26566 in progress 64107.00 10.70.37.196 0 0Bytes 302926 5 0 stopped 11098.00 volume rebalance: dist-rep11: success: from 10.70.37.196, rebalance logs, [2013-10-29 14:36:51.255045] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-6: cannot lookup the saved frame corresponding to xid (849056) [2013-10-29 14:36:51.255085] W [socket.c:1804:__socket_read_reply] 0-dist-rep11-client-6: notify for event MAP_XID failed [2013-10-29 14:36:51.255121] I [client.c:2103:client_rpc_notify] 0-dist-rep11-client-6: disconnected from 10.70.37.138:49159. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-29 14:36:51.255134] E [afr-common.c:3919:afr_notify] 0-dist-rep11-replicate-3: All subvolumes are down. Going offline until atleast one of them comes back up. [2013-10-29 14:36:51.255142] W [dht-common.c:5131:dht_notify] 0-dist-rep11-dht: Received CHILD_DOWN. Exiting [2013-10-29 14:36:51.255150] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance [2013-10-29 14:36:51.315190] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-10: cannot lookup the saved frame corresponding to xid (843570) [2013-10-29 14:36:51.315233] W [socket.c:1804:__socket_read_reply] 0-dist-rep11-client-10: notify for event MAP_XID failed [2013-10-29 14:36:51.315298] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x3720e0e0f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3720e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3720e0db4e]))) 0-dist-rep11-client-10: forced unwinding frame type(GlusterFS 3.3) op(GETXATTR(18)) called at 2013-10-29 14:34:53.989042 (xid=0x843658x) [2013-10-29 14:36:51.315314] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-dist-rep11-client-10: remote operation failed: Transport endpoint is not connected. Path: /linux_untar1382971802/linux-2.6.31.1/arch/sparc/mm/tlb.c (b8884399-daf9-422e-a462-ae88469ba227). Key: trusted.glusterfs.node-uuid [2013-10-29 14:36:51.315339] W [afr-inode-read.c:843:afr_getxattr_node_uuid_cbk] 0-dist-rep11-replicate-5: op_ret (-1): Re-querying afr-child (1/2) [2013-10-29 14:36:51.315425] I [client.c:2103:client_rpc_notify] 0-dist-rep11-client-10: disconnected from 10.70.37.138:49160. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-29 14:36:51.316421] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch/sparc/mm [2013-10-29 14:36:51.316636] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch/sparc [2013-10-29 14:36:51.316867] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch [2013-10-29 14:36:51.317071] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1 [2013-10-29 14:36:51.317263] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802 [2013-10-29 14:36:51.317524] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 11098.00 secs [2013-10-29 14:36:51.317538] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 302926, failures: 5, skipped: 0 [2013-10-29 14:36:51.317967] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3a2f6e894d] (-->/lib64/libpthread.so.0() [0x3a2fe07851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down Expected results: could not make out why the rebalance is displayed as stopped. If it is finished it should say completed, Additional info: Status of volume: dist-rep11 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.58:/rhs/bricks/d1r1-n11 49155 Y 15848 Brick 10.70.37.196:/rhs/bricks/d1r2-n11 49155 Y 9080 Brick 10.70.37.138:/rhs/bricks/d2r1-n11 49158 Y 2589 Brick 10.70.37.186:/rhs/bricks/d2r2-n11 49155 Y 9085 Brick 10.70.37.58:/rhs/bricks/d3r1-n11 49156 Y 15859 Brick 10.70.37.196:/rhs/bricks/d3r2-n11 49156 Y 9091 Brick 10.70.37.138:/rhs/bricks/d4r1-n11 49159 Y 2596 Brick 10.70.37.186:/rhs/bricks/d4r2-n11 49156 Y 9096 Brick 10.70.37.58:/rhs/bricks/d5r1-n11 49157 Y 15870 Brick 10.70.37.196:/rhs/bricks/d5r2-n11 49157 Y 9102 Brick 10.70.37.138:/rhs/bricks/d6r1-n11 49160 Y 2601 Brick 10.70.37.186:/rhs/bricks/d6r2-n11 49157 Y 9107 Brick 10.70.37.58:/rhs/bricks/d1r1-n11-add 49158 Y 2238 Brick 10.70.37.196:/rhs/bricks/d1r2-n11-add 49158 Y 27544 NFS Server on localhost 2049 Y 20286 Self-heal Daemon on localhost N/A Y 20293 Quota Daemon on localhost N/A Y 20300 NFS Server on 10.70.37.58 2049 Y 2250 Self-heal Daemon on 10.70.37.58 N/A Y 2257 Quota Daemon on 10.70.37.58 N/A Y 2277 NFS Server on 10.70.37.196 2049 Y 27557 Self-heal Daemon on 10.70.37.196 N/A Y 27564 Quota Daemon on 10.70.37.196 N/A Y 27579 NFS Server on 10.70.37.186 2049 Y 27304 Self-heal Daemon on 10.70.37.186 N/A Y 27311 Quota Daemon on 10.70.37.186 N/A Y 27318 Task ID Status ---- -- ------ Rebalance 08fcf21d-e0e2-4bc5-a511-f7e1601c9b90 1
Seems to be issue in re-balance. Changing the component to DHT