Bug 822086

Summary:

Crash in rebalance when network goes down

Product:

[Community] GlusterFS

Reporter:

shylesh <shmohan>

Component:

core

Assignee:

shishir gowda <sgowda>

Status:

CLOSED CURRENTRELEASE

QA Contact:

shylesh <shmohan>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

pre-release

CC:

gluster-bugs, nsathyan

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-24 17:26:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

3.3.0qa43

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

817967

Attachments:

Description	Flags
sos report	none

Description shylesh 2012-05-16 10:07:44 UTC

Created attachment 584921 [details]
sos report

Description of problem:
while rebalance is running brought down the network and rebalance process crashed

Version-Release number of selected component (if applicable):
3.3.0qa41

How reproducible:


Steps to Reproduce:
1. created a 2x2  distributed-replicate volume (4 node cluster)
2. filled up with some data so that rebalance takes for a while to finish
3. Add-brick and start rebalance.
4. run "service network stop" on one of the node
  
Actual results:

[root@gqac022 mnt]# gluster volume rebalance giga status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost               25     25000000         3837            0      completed
                            10.16.157.66               30     30000000         1167            0         failed
                            10.16.157.72               29     29000000         3087            0      completed
                            10.16.157.69               16     16000000         3242            0      completed


The status says failed for that particaular node after node comes back.rebalance process was crashed on the machine.
 
 
Additional info:
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533
#2  0x0000000000407ff8 in mgmt_submit_request ()
#3  0x000000000040cca8 in glusterfs_rebalance_event_notify ()
#4  0x00007f6873812964 in gf_defrag_start_crawl (data=<value optimized out>) at dht-rebalance.c:1486
#5  0x0000003ee7a4b322 in synctask_wrap (old_task=<value optimized out>) at syncop.c:120
#6  0x000000358ea43610 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()


(gdb) p rsp_iobref
$1 = (struct iobref *) 0x0

attached the sosreport: volume name:- giga
log path: var/log/glusterfs/giga-rebalance.log

Comment 1 shylesh 2012-05-16 11:05:10 UTC

#1  0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, 
    proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533



(gdb) p cbkfn
$3 = (fop_cbk_fn_t) 0

Comment 2 shishir gowda 2012-05-17 05:39:45 UTC

After rebalance run's to completion, it sends its status across to the local glusterd. This crash happens, as there is a n/w disconnect, as the implementation does not send a cbk_fn.
Setting the priority/severity to medium as the crash happen after rebalance has completed, and there in the case of a n/w bought down.

Comment 3 Anand Avati 2012-05-19 02:28:59 UTC

CHANGE: http://review.gluster.com/3359 (glusterfs/rebalance: Register cbk for glusterfs_rebalance_event_notify) merged in master by Anand Avati (avati)

Comment 4 shylesh 2012-05-25 04:58:51 UTC

Now no crash will happen after now.