Description of problem: ======================= Gluster volume creation command failes with time out error and after that gluster commands are failing with error "Another transaction is in progress" as there is a deadlock. Version-Release number of selected component (if applicable): ============================================================= 3.7dev-0.803.gitf64666f.el6.x86_64 How reproducible: ================ intermittent Steps to Reproduce: 1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster 2. create a volume using below command which gave time out error :- root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1 Error : Request timed out 3. after a while (10-15 min) while checking status found that all commands are failing as below [root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1 volume create: BitRot1: failed: Volume BitRot1 already exists [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. Actual results: =============== Due to Deadlock commands are failing saying "another transaction is in progress " Additional info: =============== (gdb) bt #0 0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at rpc-clnt.c:1712 #4 0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>) at rpc-clnt.c:1634 #5 rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670 #6 0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at rpc-clnt-ping.c:265 #7 0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183 #8 0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0 #9 0x00000032912e88fd in clone () from /lib64/libc.so.6 log snippet:- [2015-03-26 07:35:49.746017] E [glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management: Volume BitRot1 already exists [2015-03-26 07:35:57.377967] I [glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#4) for review on master by Niels de Vos (ndevos)
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur (vbellur) ------ commit d448fd187dde46bfb0d20354613912f6aa477904 Author: Krishnan Parthasarathi <kparthas> Date: Mon Feb 9 17:10:49 2015 +0530 rpc: fix deadlock when unref is inside conn->lock In ping-timer implementation, the timer event takes a ref on the rpc object. This ref needs to be removed after every timeout event. ping-timer mechanism could be holding the last ref. For e.g, when a peer is detached and its rpc object was unref'd. In this case, ping-timer mechanism would try to acquire conn->mutex to perform the 'last' unref while being inside the critical section already. This will result in a deadlock. Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea BUG: 1206134 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/9613 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
The following link provides a test case written for GlusterFS regression test framework. This wasn't merged in the repo since it is Linux-specific. This test can be used as a representative for recreating this issue. http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user