Description of problem: ======================= rebalance :- rebalance process was not terminated on 2 servers after rebalance stop command is executed and file migration was not in progress on any of the server Version-Release number of selected component (if applicable): ============================================================= 3.4.0.59rhs-1.2.toyota.hotfix.el6rhs.x86_64 How reproducible: ================= got twice Steps to Reproduce: =================== 1.30 bricks on 4 server. add 16 bricks start rebalance and start I/O from multiple mount point 2. after a while stop rebalance process. Status says completed/stopped. 3. found that on 2 server rebalance process was still running after 20 min. and it was not migrating any files as result unable to start rebalance again [root@rhs-client4 ~]# ps auxwww | grep reb root 7531 0.7 1.4 622400 231664 ? Ssl 16:15 0:20 /usr/sbin/glusterfs -s localhost --volfile-id sat --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=0772d1e1-8317-44a1-95a1-2dc8b6d95d35 --socket-file /var/lib/glusterd/vols/sat/rebalance/0772d1e1-8317-44a1-95a1-2dc8b6d95d35.sock --pid-file /var/lib/glusterd/vols/sat/rebalance/0772d1e1-8317-44a1-95a1-2dc8b6d95d35.pid -l /var/log/glusterfs/sat-rebalance.log root 8093 0.0 0.0 log snippet :- 2014-08-09 11:42:05.778604] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x315800f524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x315800f063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x315800ef7e]))) 0-sat-client-13: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2014-08-09 11:40:23.067342 (xid=0x9x) [2014-08-09 11:42:05.778636] W [client-handshake.c:1882:client_dump_version_cbk] 0-sat-client-13: received RPC status error [2014-08-09 11:42:05.778668] I [client.c:2103:client_rpc_notify] 0-sat-client-13: disconnected from 10.70.36.63:49163. Client process will keep trying to connect to glusterd until brick's port is available. [2014-08-09 11:42:06.076095] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-sat-client-13: changing port to 49163 (from 0) [2014-08-09 11:43:48.793166] W [socket.c:522:__socket_rwv] 0-sat-client-13: readv on 10.70.36.63:49163 failed (Connection reset by peer) [2014-08-09 11:43:48.793341] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x315800f524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x315800f063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x315800ef7e]))) 0-sat-client-13: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2014-08-09 11:42:06.082466 (xid=0x12x) [2014-08-09 11:43:48.793363] W [client-handshake.c:1882:client_dump_version_cbk] 0-sat-client-13: received RPC status error [2014-08-09 11:43:48.793392] I [client.c:2103:client_rpc_notify] 0-sat-client-13: disconnected from 10.70.36.63:49163. Client process will keep trying to connect to glusterd until brick's port is available. [2014-08-09 11:43:49.092561] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-sat-client-13: changing port to 49163 (from 0) [2014-08-09 11:45:31.810479] W [socket.c:522:__socket_rwv] 0-sat-client-13: readv on 10.70.36.63:49163 failed (Connection reset by peer) Actual results: =============== - no file is in migration still rebalance process is not terminated. Expected results: ================ If stop command is executed, rebalance process should be terminated once migration of current file is completed Additional info: ================
From the logs, I see that the rebalance process was already completed on 2 of the nodes and was still running on 2 nodes when the stop command was issued. After the stop was issued, rebalance continued on those 2 nodes. The rebalance logs show that they received the stop request. Since the rebalance processes received the stop request, the only reason for them to continue running would have been because a file was still under migration. Rachana, also assumed this and waited for sometime for something to happen. Later, the rebalance process was straced to see if a file was being migrated. It was observed that there was a lot of readv() being done on a file ('data16893') but nothing was being written. Getfattr on this file shows that it was supposed to be migrating, ''' [root@rhs-gp-srv16 ~]# getfattr -d -m . /home/sat*/data16* getfattr: Removing leading '/' from absolute path names # file: home/sat11/data16893 trusted.afr.sat-client-10=0sAAAAAAAAAAAAAAAA trusted.afr.sat-client-11=0sAAAAAAAAAAAAAAAA trusted.gfid=0srfd62t6vQC23vaNdrm1iRw== trusted.glusterfs.dht.linkto="sat-replicate-12" # file: home/sat19/data16893 trusted.gfid=0srfd62t6vQC23vaNdrm1iRw== trusted.glusterfs.dht.linkto="sat-replicate-5" ''' But it was observed that there was no size change on the destination (sat19). Since the problem appears to be a rebalance/dht issue, assigning this to the dht team. @Rachana, is it possible to get the strace logs that you obtained?
Cloning this to 3.1. To be fixed in future.