Description of problem: when rpc_clnt_submit fails, glusterd hits a deadlock in op_sm Version-Release number of selected component (if applicable): mainline (but with custom patch) How reproducible: Fairly easy (not Steps to Reproduce: 1. make rpc_clnt_submit fail in brick-ops (to reproduce, please ask for the patch file which exposes this issue) Actual results: glusterd hangs Expected results: Additional info: (gdb) bt #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x00007f48318bf1e5 in _L_lock_883 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007f48318bf03a in __pthread_mutex_lock (mutex=0x7f482f23e700) at pthread_mutex_lock.c:61 #3 0x00007f482efc7b07 in glusterd_op_sm () at glusterd-op-sm.c:3377 #4 0x00007f482efdf03f in glusterd3_1_brick_op_cbk (req=0xb482dc, iov=0x0, count=0, myframe=0x7f48302e81f0) at glusterd-rpc-ops.c:1690 #5 0x00007f4831cfcc12 in rpc_clnt_submit (rpc=0x83e570, prog=0x7f482f23cfa0, procnum=4, cbkfn=0x7f482efded15 <glusterd3_1_brick_op_cbk>, proghdr=0x7fffffbe5810, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x8374c0, frame=0x7f48302e81f0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1511 #6 0x00007f482efc9690 in glusterd_submit_request (rpc=0x83e570, req=0x837490, frame=0x7f48302e81f0, prog=0x7f482f23cfa0, procnum=4, iobref=0x8374c0, this=0x823710, cbkfn=0x7f482efded15 <glusterd3_1_brick_op_cbk>, xdrproc=0x7f4831adec3e <xdr_gd1_mgmt_brick_op_req>) at glusterd-utils.c:369 #7 0x00007f482efdf3e6 in glusterd3_1_brick_op (frame=0x0, this=0x823710, data=0x82a400) at glusterd-rpc-ops.c:1756 #8 0x00007f482efc71bf in glusterd_op_ac_send_brick_op (event=0x82a3d0, ctx=0x0) at glusterd-op-sm.c:2888 #9 0x00007f482efc7c85 in glusterd_op_sm () at glusterd-op-sm.c:3395 #10 0x00007f482eff7195 in glusterd_handle_defrag_volume (req=0x7f482eede02c) at glusterd-rebalance.c:557 #11 0x00007f4831cf11ef in rpcsvc_handle_rpc_call (svc=0x824ec0, trans=0x82db50, msg=0x82da20) at rpcsvc.c:507 #12 0x00007f4831cf156c in rpcsvc_notify (trans=0x82db50, mydata=0x824ec0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x82da20) at rpcsvc.c:603 #13 0x00007f4831cf718c in rpc_transport_notify (this=0x82db50, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x82da20) at rpc-transport.c:498 #14 0x00007f482ecd22e7 in socket_event_poll_in (this=0x82db50) at socket.c:1675 #15 0x00007f482ecd2850 in socket_event_handler (fd=15, idx=7, data=0x82db50, poll_in=1, poll_out=0, poll_err=0) at socket.c:1790 #16 0x00007f4831f4bcf0 in event_dispatch_epoll_handler (event_pool=0x81e500, events=0x82b6e0, i=0) at event.c:794 #17 0x00007f4831f4bf03 in event_dispatch_epoll (event_pool=0x81e500) at event.c:856 #18 0x00007f4831f4c276 in event_dispatch (event_pool=0x81e500) at event.c:956 #19 0x0000000000407db8 in main (argc=1, argv=0x7fffffbe5eb8) at glusterfsd.c:1606 Lock is taken in: frame->9: glusterd_op_sm() (void ) pthread_mutex_lock (&gd_op_sm_lock); Retried without releasing in: frame->3: glusterd_op_sm() (void ) pthread_mutex_lock (&gd_op_sm_lock); <---dead lock
CHANGE: http://review.gluster.com/2625 (glusterd: Changed op_sm_queue locking mechanism to accomodate nested calls to op_sm) merged in master by Vijay Bellur (vijay)
shishir, next time, don't file bugs with 'please ask for patch to reproduce' but attach it with bugzilla anyways. Helps to verify the bug. Anyways, I am trying to verify this by making rpc_submit_reply fail in brick-op.
---------- diff --git a/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c b/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c index a7ccda7..b06caf7 100644 --- a/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c +++ b/xlators/mgmt/glusterd/src/glusterd-rpc-ops.c @@ -1958,6 +1958,7 @@ glusterd3_1_brick_op (call_frame_t *frame, xlator_t *this, goto out; } + req->name = NULL; ret = glusterd_submit_request (rpc, req, dummy_frame, priv->gfs_mgmt, req->op, NULL, --------- Tested with above patch, and it doesn't crash now on glusterfs-3.3.0 and master branch.