Bug 1206134 - glusterd :- after volume create command time out, deadlock has been observed among glusterd and all command keep failing with error "Another transaction is in progress"
Summary: glusterd :- after volume create command time out, deadlock has been observed ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: qe_tracker_everglades 1233019
TreeView+ depends on / blocked
 
Reported: 2015-03-26 12:11 UTC by Rachana Patel
Modified: 2015-06-18 05:13 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.7.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1233019 (view as bug list)
Environment:
Last Closed: 2015-05-14 17:27:03 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Rachana Patel 2015-03-26 12:11:29 UTC
Description of problem:
=======================
Gluster volume creation command failes with time out error and after that gluster commands are failing with error "Another transaction is in progress" as there is a deadlock.

Version-Release number of selected component (if applicable):
=============================================================
3.7dev-0.803.gitf64666f.el6.x86_64

How reproducible:
================
intermittent


Steps to Reproduce:
1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster

2. create a volume using below command which gave time out error :-
root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
Error : Request timed out

3. after a while (10-15 min) while checking status found that all commands are failing as below

[root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
volume create: BitRot1: failed: Volume BitRot1 already exists

[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log 
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.


Actual results:
===============
Due to Deadlock commands are failing saying "another  transaction is in progress "



Additional info:
===============
(gdb) bt
#0  0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at rpc-clnt.c:1712
#4  0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>) at rpc-clnt.c:1634
#5  rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670
#6  0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at rpc-clnt-ping.c:265
#7  0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183
#8  0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000032912e88fd in clone () from /lib64/libc.so.6


log snippet:-
[2015-03-26 07:35:49.746017] E [glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management: Volume BitRot1 already exists
[2015-03-26 07:35:57.377967] I [glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1

Comment 2 Anand Avati 2015-03-30 12:42:17 UTC
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#2) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)

Comment 3 Anand Avati 2015-04-06 12:31:26 UTC
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#3) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)

Comment 4 Anand Avati 2015-04-06 17:16:17 UTC
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#4) for review on master by Niels de Vos (ndevos@redhat.com)

Comment 5 Anand Avati 2015-04-07 06:27:54 UTC
REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#5) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)

Comment 6 Anand Avati 2015-04-10 03:05:24 UTC
COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur (vbellur@redhat.com) 
------
commit d448fd187dde46bfb0d20354613912f6aa477904
Author: Krishnan Parthasarathi <kparthas@redhat.com>
Date:   Mon Feb 9 17:10:49 2015 +0530

    rpc: fix deadlock when unref is inside conn->lock
    
    In ping-timer implementation, the timer event takes a ref on the rpc
    object. This ref needs to be removed after every timeout event.
    ping-timer mechanism could be holding the last ref. For e.g, when a peer
    is detached and its rpc object was unref'd. In this case, ping-timer
    mechanism would try to acquire conn->mutex to perform the 'last' unref
    while being inside the critical section already. This will result in a
    deadlock.
    
    Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea
    BUG: 1206134
    Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
    Reviewed-on: http://review.gluster.org/9613
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>

Comment 7 krishnan parthasarathi 2015-04-10 06:50:54 UTC
The following link provides a test case written for GlusterFS regression test framework. This wasn't merged in the repo since it is Linux-specific. This test can be used as a representative for recreating this issue.

http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t

Comment 9 Niels de Vos 2015-05-14 17:27:03 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 10 Niels de Vos 2015-05-14 17:28:34 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 11 Niels de Vos 2015-05-14 17:35:18 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.