Created attachment 584921 [details] sos report Description of problem: while rebalance is running brought down the network and rebalance process crashed Version-Release number of selected component (if applicable): 3.3.0qa41 How reproducible: Steps to Reproduce: 1. created a 2x2 distributed-replicate volume (4 node cluster) 2. filled up with some data so that rebalance takes for a while to finish 3. Add-brick and start rebalance. 4. run "service network stop" on one of the node Actual results: [root@gqac022 mnt]# gluster volume rebalance giga status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 25 25000000 3837 0 completed 10.16.157.66 30 30000000 1167 0 failed 10.16.157.72 29 29000000 3087 0 completed 10.16.157.69 16 16000000 3242 0 completed The status says failed for that particaular node after node comes back.rebalance process was crashed on the machine. Additional info: (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533 #2 0x0000000000407ff8 in mgmt_submit_request () #3 0x000000000040cca8 in glusterfs_rebalance_event_notify () #4 0x00007f6873812964 in gf_defrag_start_crawl (data=<value optimized out>) at dht-rebalance.c:1486 #5 0x0000003ee7a4b322 in synctask_wrap (old_task=<value optimized out>) at syncop.c:120 #6 0x000000358ea43610 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) p rsp_iobref $1 = (struct iobref *) 0x0 attached the sosreport: volume name:- giga log path: var/log/glusterfs/giga-rebalance.log
#1 0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533 (gdb) p cbkfn $3 = (fop_cbk_fn_t) 0
After rebalance run's to completion, it sends its status across to the local glusterd. This crash happens, as there is a n/w disconnect, as the implementation does not send a cbk_fn. Setting the priority/severity to medium as the crash happen after rebalance has completed, and there in the case of a n/w bought down.
CHANGE: http://review.gluster.com/3359 (glusterfs/rebalance: Register cbk for glusterfs_rebalance_event_notify) merged in master by Anand Avati (avati)
Now no crash will happen after now.