822086 – Crash in rebalance when network goes down

Bug 822086 - Crash in rebalance when network goes down

Summary: Crash in rebalance when network goes down

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	pre-release
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	shishir gowda
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	817967
TreeView+	depends on / blocked

Reported:	2012-05-16 10:07 UTC by shylesh
Modified:	2013-12-09 01:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:26:12 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:	3.3.0qa43
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sos report (5.76 MB, application/x-xz) 2012-05-16 10:07 UTC, shylesh	no flags	Details
View All

Description shylesh 2012-05-16 10:07:44 UTC

Created attachment 584921 [details]
sos report

Description of problem:
while rebalance is running brought down the network and rebalance process crashed

Version-Release number of selected component (if applicable):
3.3.0qa41

How reproducible:


Steps to Reproduce:
1. created a 2x2  distributed-replicate volume (4 node cluster)
2. filled up with some data so that rebalance takes for a while to finish
3. Add-brick and start rebalance.
4. run "service network stop" on one of the node
  
Actual results:

[root@gqac022 mnt]# gluster volume rebalance giga status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost               25     25000000         3837            0      completed
                            10.16.157.66               30     30000000         1167            0         failed
                            10.16.157.72               29     29000000         3087            0      completed
                            10.16.157.69               16     16000000         3242            0      completed


The status says failed for that particaular node after node comes back.rebalance process was crashed on the machine.
 
 
Additional info:
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533
#2  0x0000000000407ff8 in mgmt_submit_request ()
#3  0x000000000040cca8 in glusterfs_rebalance_event_notify ()
#4  0x00007f6873812964 in gf_defrag_start_crawl (data=<value optimized out>) at dht-rebalance.c:1486
#5  0x0000003ee7a4b322 in synctask_wrap (old_task=<value optimized out>) at syncop.c:120
#6  0x000000358ea43610 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()


(gdb) p rsp_iobref
$1 = (struct iobref *) 0x0

attached the sosreport: volume name:- giga
log path: var/log/glusterfs/giga-rebalance.log

Comment 1 shylesh 2012-05-16 11:05:10 UTC

#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533



(gdb) p cbkfn
$3 = (fop_cbk_fn_t) 0

Comment 2 shishir gowda 2012-05-17 05:39:45 UTC

After rebalance run's to completion, it sends its status across to the local glusterd. This crash happens, as there is a n/w disconnect, as the implementation does not send a cbk_fn.
Setting the priority/severity to medium as the crash happen after rebalance has completed, and there in the case of a n/w bought down.

Comment 3 Anand Avati 2012-05-19 02:28:59 UTC

CHANGE: http://review.gluster.com/3359 (glusterfs/rebalance: Register cbk for glusterfs_rebalance_event_notify) merged in master by Anand Avati (avati)

Comment 4 shylesh 2012-05-25 04:58:51 UTC

Now no crash will happen after now.

Note You need to log in before you can comment on or make changes to this bug.