Bug 822086
Summary: | Crash in rebalance when network goes down | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | shylesh <shmohan> | ||||
Component: | core | Assignee: | shishir gowda <sgowda> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | shylesh <shmohan> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | pre-release | CC: | gluster-bugs, nsathyan | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-07-24 17:26:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | 3.3.0qa43 | Category: | --- | ||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 817967 | ||||||
Attachments: |
|
#1 0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533 (gdb) p cbkfn $3 = (fop_cbk_fn_t) 0 After rebalance run's to completion, it sends its status across to the local glusterd. This crash happens, as there is a n/w disconnect, as the implementation does not send a cbk_fn. Setting the priority/severity to medium as the crash happen after rebalance has completed, and there in the case of a n/w bought down. CHANGE: http://review.gluster.com/3359 (glusterfs/rebalance: Register cbk for glusterfs_rebalance_event_notify) merged in master by Anand Avati (avati) Now no crash will happen after now. |
Created attachment 584921 [details] sos report Description of problem: while rebalance is running brought down the network and rebalance process crashed Version-Release number of selected component (if applicable): 3.3.0qa41 How reproducible: Steps to Reproduce: 1. created a 2x2 distributed-replicate volume (4 node cluster) 2. filled up with some data so that rebalance takes for a while to finish 3. Add-brick and start rebalance. 4. run "service network stop" on one of the node Actual results: [root@gqac022 mnt]# gluster volume rebalance giga status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 25 25000000 3837 0 completed 10.16.157.66 30 30000000 1167 0 failed 10.16.157.72 29 29000000 3087 0 completed 10.16.157.69 16 16000000 3242 0 completed The status says failed for that particaular node after node comes back.rebalance process was crashed on the machine. Additional info: (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000003ee7e0fe05 in rpc_clnt_submit (rpc=<value optimized out>, prog=0x1c35a00, procnum=5, cbkfn=0, proghdr=0x7f680c000070, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f680c000c30, frame=0x7f6876ea8ea4, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1533 #2 0x0000000000407ff8 in mgmt_submit_request () #3 0x000000000040cca8 in glusterfs_rebalance_event_notify () #4 0x00007f6873812964 in gf_defrag_start_crawl (data=<value optimized out>) at dht-rebalance.c:1486 #5 0x0000003ee7a4b322 in synctask_wrap (old_task=<value optimized out>) at syncop.c:120 #6 0x000000358ea43610 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) p rsp_iobref $1 = (struct iobref *) 0x0 attached the sosreport: volume name:- giga log path: var/log/glusterfs/giga-rebalance.log