Bug 772142

Summary: glusterd brick-ops hits a deadlock
Product: [Community] GlusterFS Reporter: shishir gowda <sgowda>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED CURRENTRELEASE QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: amarts, gluster-bugs, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:58:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: glusterfs-3.3.0,master Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description shishir gowda 2012-01-06 06:08:43 UTC
Description of problem:
when rpc_clnt_submit fails, glusterd hits a deadlock in op_sm

Version-Release number of selected component (if applicable):
mainline (but with custom patch)

How reproducible:
Fairly easy (not

Steps to Reproduce:
1. make rpc_clnt_submit fail in brick-ops
(to reproduce, please ask for the patch file which exposes this issue)
  
Actual results:
glusterd hangs

Expected results:


Additional info:


(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007f48318bf1e5 in _L_lock_883 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f48318bf03a in __pthread_mutex_lock (mutex=0x7f482f23e700) at pthread_mutex_lock.c:61
#3  0x00007f482efc7b07 in glusterd_op_sm () at glusterd-op-sm.c:3377
#4  0x00007f482efdf03f in glusterd3_1_brick_op_cbk (req=0xb482dc, iov=0x0, count=0, myframe=0x7f48302e81f0) at glusterd-rpc-ops.c:1690
#5  0x00007f4831cfcc12 in rpc_clnt_submit (rpc=0x83e570, prog=0x7f482f23cfa0, procnum=4, cbkfn=0x7f482efded15 <glusterd3_1_brick_op_cbk>, 
    proghdr=0x7fffffbe5810, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x8374c0, frame=0x7f48302e81f0, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1511
#6  0x00007f482efc9690 in glusterd_submit_request (rpc=0x83e570, req=0x837490, frame=0x7f48302e81f0, prog=0x7f482f23cfa0, procnum=4, 
    iobref=0x8374c0, this=0x823710, cbkfn=0x7f482efded15 <glusterd3_1_brick_op_cbk>, xdrproc=0x7f4831adec3e <xdr_gd1_mgmt_brick_op_req>)
    at glusterd-utils.c:369
#7  0x00007f482efdf3e6 in glusterd3_1_brick_op (frame=0x0, this=0x823710, data=0x82a400) at glusterd-rpc-ops.c:1756
#8  0x00007f482efc71bf in glusterd_op_ac_send_brick_op (event=0x82a3d0, ctx=0x0) at glusterd-op-sm.c:2888
#9  0x00007f482efc7c85 in glusterd_op_sm () at glusterd-op-sm.c:3395
#10 0x00007f482eff7195 in glusterd_handle_defrag_volume (req=0x7f482eede02c) at glusterd-rebalance.c:557
#11 0x00007f4831cf11ef in rpcsvc_handle_rpc_call (svc=0x824ec0, trans=0x82db50, msg=0x82da20) at rpcsvc.c:507
#12 0x00007f4831cf156c in rpcsvc_notify (trans=0x82db50, mydata=0x824ec0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x82da20) at rpcsvc.c:603
#13 0x00007f4831cf718c in rpc_transport_notify (this=0x82db50, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x82da20) at rpc-transport.c:498
#14 0x00007f482ecd22e7 in socket_event_poll_in (this=0x82db50) at socket.c:1675
#15 0x00007f482ecd2850 in socket_event_handler (fd=15, idx=7, data=0x82db50, poll_in=1, poll_out=0, poll_err=0) at socket.c:1790
#16 0x00007f4831f4bcf0 in event_dispatch_epoll_handler (event_pool=0x81e500, events=0x82b6e0, i=0) at event.c:794
#17 0x00007f4831f4bf03 in event_dispatch_epoll (event_pool=0x81e500) at event.c:856
#18 0x00007f4831f4c276 in event_dispatch (event_pool=0x81e500) at event.c:956
#19 0x0000000000407db8 in main (argc=1, argv=0x7fffffbe5eb8) at glusterfsd.c:1606

Lock is taken in:
frame->9: glusterd_op_sm()
(void ) pthread_mutex_lock (&gd_op_sm_lock);

Retried without releasing in:
frame->3: glusterd_op_sm()
(void ) pthread_mutex_lock (&gd_op_sm_lock); <---dead lock

Comment 1 Anand Avati 2012-02-03 15:40:52 UTC
CHANGE: http://review.gluster.com/2625 (glusterd: Changed op_sm_queue locking mechanism to accomodate nested calls to op_sm) merged in master by Vijay Bellur (vijay)

Comment 2 Amar Tumballi 2012-06-04 06:59:46 UTC
shishir, next time, don't file bugs with 'please ask for patch to reproduce' but attach it with bugzilla anyways. Helps to verify the bug. Anyways, I am trying to verify this by making rpc_submit_reply fail in brick-op.

Comment 3 Amar Tumballi 2012-06-04 07:17:13 UTC
----------
diff --git a/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c b/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c
index a7ccda7..b06caf7 100644
--- a/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c
+++ b/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c
@@ -1958,6 +1958,7 @@ glusterd3_1_brick_op (call_frame_t *frame, xlator_t *this,
                         goto out;
                 }
 
+                req->name = NULL;
                 ret = glusterd_submit_request (rpc, req, dummy_frame,
                                                priv->gfs_mgmt,
                                                req->op, NULL,
---------


Tested with above patch, and it doesn't crash now on glusterfs-3.3.0 and master branch.