Bug 1128428 - rebalance :- rebalance process was not terminated on 2 servers after rebalance stop
Summary: rebalance :- rebalance process was not terminated on 2 servers after rebalanc...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: Matt Zywusko
URL:
Whiteboard:
Depends On:
Blocks: 1286121
TreeView+ depends on / blocked
 
Reported: 2014-08-10 12:01 UTC by Rachana Patel
Modified: 2018-01-16 06:28 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286121 (view as bug list)
Environment:
Last Closed: 2015-11-27 11:33:16 UTC
Embargoed:


Attachments (Terms of Use)

Description Rachana Patel 2014-08-10 12:01:43 UTC
Description of problem:
=======================
rebalance :- rebalance process was not terminated on 2 servers after rebalance stop command is executed and  file migration was not in progress on any of the server


Version-Release number of selected component (if applicable):
=============================================================
3.4.0.59rhs-1.2.toyota.hotfix.el6rhs.x86_64


How reproducible:
=================
got twice

Steps to Reproduce:
===================
1.30 bricks on 4 server. add 16 bricks start rebalance and start I/O from multiple mount point
2. after a while stop rebalance process. Status says completed/stopped.
3. found that on 2 server rebalance process was still running after 20 min. and it was not migrating any files

as result unable to start rebalance again


[root@rhs-client4 ~]# ps auxwww | grep reb
root      7531  0.7  1.4 622400 231664 ?       Ssl  16:15   0:20 /usr/sbin/glusterfs -s localhost --volfile-id sat --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=0772d1e1-8317-44a1-95a1-2dc8b6d95d35 --socket-file /var/lib/glusterd/vols/sat/rebalance/0772d1e1-8317-44a1-95a1-2dc8b6d95d35.sock --pid-file /var/lib/glusterd/vols/sat/rebalance/0772d1e1-8317-44a1-95a1-2dc8b6d95d35.pid -l /var/log/glusterfs/sat-rebalance.log
root      8093  0.0  0.0

log snippet :-
2014-08-09 11:42:05.778604] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x315800f524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x315800f063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x315800ef7e]))) 0-sat-client-13: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2014-08-09 11:40:23.067342 (xid=0x9x)
[2014-08-09 11:42:05.778636] W [client-handshake.c:1882:client_dump_version_cbk] 0-sat-client-13: received RPC status error
[2014-08-09 11:42:05.778668] I [client.c:2103:client_rpc_notify] 0-sat-client-13: disconnected from 10.70.36.63:49163. Client process will keep trying to connect to glusterd until brick's port is available. 
[2014-08-09 11:42:06.076095] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-sat-client-13: changing port to 49163 (from 0)
[2014-08-09 11:43:48.793166] W [socket.c:522:__socket_rwv] 0-sat-client-13: readv on 10.70.36.63:49163 failed (Connection reset by peer)
[2014-08-09 11:43:48.793341] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x315800f524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x315800f063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x315800ef7e]))) 0-sat-client-13: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2014-08-09 11:42:06.082466 (xid=0x12x)
[2014-08-09 11:43:48.793363] W [client-handshake.c:1882:client_dump_version_cbk] 0-sat-client-13: received RPC status error
[2014-08-09 11:43:48.793392] I [client.c:2103:client_rpc_notify] 0-sat-client-13: disconnected from 10.70.36.63:49163. Client process will keep trying to connect to glusterd until brick's port is available. 
[2014-08-09 11:43:49.092561] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-sat-client-13: changing port to 49163 (from 0)
[2014-08-09 11:45:31.810479] W [socket.c:522:__socket_rwv] 0-sat-client-13: readv on 10.70.36.63:49163 failed (Connection reset by peer)


Actual results:
===============
- no file is in migration still rebalance process is not terminated.


Expected results:
================
If stop command is executed, rebalance process should be terminated once migration of current file is completed

Additional info:
================

Comment 3 Kaushal 2014-08-11 10:49:00 UTC
From the logs, I see that the rebalance process was already completed on 2 of the nodes and was still running on 2 nodes when the stop command was issued. After the stop was issued, rebalance continued on those 2 nodes.

The rebalance logs show that they received the stop request. Since the rebalance processes received the stop request, the only reason for them to continue running would have been because a file was still under migration.

Rachana, also assumed this and waited for sometime for something to happen. Later,  the rebalance process was straced to see if a file was being migrated. It was observed that there was a lot of readv() being done on a file ('data16893') but nothing was being written.
Getfattr on this file shows that it was supposed to be migrating,
'''
[root@rhs-gp-srv16 ~]# getfattr -d -m . /home/sat*/data16*
getfattr: Removing leading '/' from absolute path names
# file: home/sat11/data16893
trusted.afr.sat-client-10=0sAAAAAAAAAAAAAAAA
trusted.afr.sat-client-11=0sAAAAAAAAAAAAAAAA
trusted.gfid=0srfd62t6vQC23vaNdrm1iRw==
trusted.glusterfs.dht.linkto="sat-replicate-12"
# file: home/sat19/data16893
trusted.gfid=0srfd62t6vQC23vaNdrm1iRw==
trusted.glusterfs.dht.linkto="sat-replicate-5"
'''

But it was observed that there was no size change on the destination (sat19).

Since the problem appears to be a rebalance/dht issue, assigning this to the dht team.

@Rachana, is it possible to get the strace logs that you obtained?

Comment 6 Susant Kumar Palai 2015-11-27 11:33:16 UTC
Cloning this to 3.1. To be fixed in future.


Note You need to log in before you can comment on or make changes to this bug.