Bug 822086

Summary: Crash in rebalance when network goes down
Product: [Community] GlusterFS Reporter: shylesh <shmohan>
Component: coreAssignee: shishir gowda <sgowda>
Status: CLOSED CURRENTRELEASE QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: medium    
Version: pre-releaseCC: gluster-bugs, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:26:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: 3.3.0qa43 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    
Attachments:
Description Flags
sos report none

Description shylesh 2012-05-16 10:07:44 UTC
Created attachment 584921 [details]
sos report

Description of problem:
while rebalance is running brought down the network and rebalance process crashed

Version-Release number of selected component (if applicable):
3.3.0qa41

How reproducible:


Steps to Reproduce:
1. created a 2x2  distributed-replicate volume (4 node cluster)
2. filled up with some data so that rebalance takes for a while to finish
3. Add-brick and start rebalance.
4. run "service network stop" on one of the node
  
Actual results:

[root@gqac022 mnt]# gluster volume rebalance giga status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost               25     25000000         3837            0      completed
                            10.16.157.66               30     30000000         1167            0         failed
                            10.16.157.72               29     29000000         3087            0      completed
                            10.16.157.69               16     16000000         3242            0      completed


The status says failed for that particaular node after node comes back.rebalance process was crashed on the machine.
 
 
Additional info:
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533
#2  0x0000000000407ff8 in mgmt_submit_request ()
#3  0x000000000040cca8 in glusterfs_rebalance_event_notify ()
#4  0x00007f6873812964 in gf_defrag_start_crawl (data=<value optimized out>) at dht-rebalance.c:1486
#5  0x0000003ee7a4b322 in synctask_wrap (old_task=<value optimized out>) at syncop.c:120
#6  0x000000358ea43610 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()


(gdb) p rsp_iobref
$1 = (struct iobref *) 0x0

attached the sosreport: volume name:- giga
log path: var/log/glusterfs/giga-rebalance.log

Comment 1 shylesh 2012-05-16 11:05:10 UTC
#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533



(gdb) p cbkfn
$3 = (fop_cbk_fn_t) 0

Comment 2 shishir gowda 2012-05-17 05:39:45 UTC
After rebalance run's to completion, it sends its status across to the local glusterd. This crash happens, as there is a n/w disconnect, as the implementation does not send a cbk_fn.
Setting the priority/severity to medium as the crash happen after rebalance has completed, and there in the case of a n/w bought down.

Comment 3 Anand Avati 2012-05-19 02:28:59 UTC
CHANGE: http://review.gluster.com/3359 (glusterfs/rebalance: Register cbk for glusterfs_rebalance_event_notify) merged in master by Anand Avati (avati)

Comment 4 shylesh 2012-05-25 04:58:51 UTC
Now no crash will happen after now.