+++ This bug was initially created as a clone of Bug #1206134 +++ Description of problem: ======================= Gluster volume creation command failes with time out error and after that gluster commands are failing with error "Another transaction is in progress" as there is a deadlock. Version-Release number of selected component (if applicable): ============================================================= 3.7dev-0.803.gitf64666f.el6.x86_64 How reproducible: ================ intermittent Steps to Reproduce: 1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster 2. create a volume using below command which gave time out error :- root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1 Error : Request timed out 3. after a while (10-15 min) while checking status found that all commands are failing as below [root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1 volume create: BitRot1: failed: Volume BitRot1 already exists [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. [root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime. Actual results: =============== Due to Deadlock commands are failing saying "another transaction is in progress " Additional info: =============== (gdb) bt #0 0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at rpc-clnt.c:1712 #4 0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>) at rpc-clnt.c:1634 #5 rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670 #6 0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at rpc-clnt-ping.c:265 #7 0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183 #8 0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0 #9 0x00000032912e88fd in clone () from /lib64/libc.so.6 log snippet:- [2015-03-26 07:35:49.746017] E [glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management: Volume BitRot1 already exists [2015-03-26 07:35:57.377967] I [glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 [2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504 [2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1 --- Additional comment from Rachana Patel on 2015-03-26 08:25:09 EDT --- sosreport@http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1206134/ --- Additional comment from Anand Avati on 2015-03-30 08:42:17 EDT --- REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#2) for review on master by Krishnan Parthasarathi (kparthas) --- Additional comment from Anand Avati on 2015-04-06 08:31:26 EDT --- REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#3) for review on master by Krishnan Parthasarathi (kparthas) --- Additional comment from Anand Avati on 2015-04-06 13:16:17 EDT --- REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#4) for review on master by Niels de Vos (ndevos) --- Additional comment from Anand Avati on 2015-04-07 02:27:54 EDT --- REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#5) for review on master by Krishnan Parthasarathi (kparthas) --- Additional comment from Anand Avati on 2015-04-09 23:05:24 EDT --- COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur (vbellur) ------ commit d448fd187dde46bfb0d20354613912f6aa477904 Author: Krishnan Parthasarathi <kparthas> Date: Mon Feb 9 17:10:49 2015 +0530 rpc: fix deadlock when unref is inside conn->lock In ping-timer implementation, the timer event takes a ref on the rpc object. This ref needs to be removed after every timeout event. ping-timer mechanism could be holding the last ref. For e.g, when a peer is detached and its rpc object was unref'd. In this case, ping-timer mechanism would try to acquire conn->mutex to perform the 'last' unref while being inside the critical section already. This will result in a deadlock. Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea BUG: 1206134 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/9613 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur> --- Additional comment from krishnan parthasarathi on 2015-04-10 02:50:54 EDT --- The following link provides a test case written for GlusterFS regression test framework. This wasn't merged in the repo since it is Linux-specific. This test can be used as a representative for recreating this issue. http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t --- Additional comment from John Skeoch on 2015-04-19 20:24:48 EDT --- User racpatel's account has been closed --- Additional comment from Niels de Vos on 2015-05-14 13:27:03 EDT --- This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user --- Additional comment from Niels de Vos on 2015-05-14 13:28:34 EDT --- This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user --- Additional comment from Niels de Vos on 2015-05-14 13:35:18 EDT --- This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user
REVIEW: http://review.gluster.org/11303 (rpc: fix possible deadlock left behind in d448fd1) posted (#1) for review on release-3.7 by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/11303 committed in release-3.7 by Raghavendra G (rgowdapp) ------ commit c8f0d11918cc7d3577d624f2d757dc975c8bfad7 Author: Krishnan Parthasarathi <kparthas> Date: Tue May 19 15:01:08 2015 +0530 rpc: fix possible deadlock left behind in d448fd1 See http://review.gluster.org/9613 for more details. BUG: 1233019 Change-Id: I05ac0267b8c6f4e9b354acbbdf5469835455fb10 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/10821 Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp> (cherry picked from commit e902df70f8157db4db503b7ec3c2635b08b3dcb2) Reviewed-on: http://review.gluster.org/11303 Tested-by: Gluster Build System <jenkins.com>
This has been already fixed and shipped in 3.7.3. Closing the bug.