Bug 1233019 - glusterd :- after volume create command time out, deadlock has been observed among glusterd and all command keep failing with error "Another transaction is in progress"
Summary: glusterd :- after volume create command time out, deadlock has been observed ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.7.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Atin Mukherjee
QA Contact:
URL:
Whiteboard:
Depends On: 1206134
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-18 05:13 UTC by krishnan parthasarathi
Modified: 2016-01-11 04:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1206134
Environment:
Last Closed: 2016-01-11 04:58:43 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description krishnan parthasarathi 2015-06-18 05:13:12 UTC
+++ This bug was initially created as a clone of Bug #1206134 +++

Description of problem:
=======================
Gluster volume creation command failes with time out error and after that gluster commands are failing with error "Another transaction is in progress" as there is a deadlock.

Version-Release number of selected component (if applicable):
=============================================================
3.7dev-0.803.gitf64666f.el6.x86_64

How reproducible:
================
intermittent


Steps to Reproduce:
1.Installed 3.7dev-0.803.gitf64666f.el6.x86_64 on cluster

2. create a volume using below command which gave time out error :-
root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
Error : Request timed out

3. after a while (10-15 min) while checking status found that all commands are failing as below

[root@rhs-client38 ~]# gluster v create BitRot1 replica 3 rhs-client44:/pavanbrick6/br1 rhs-client38://pavanbrick6/br1 rhs-client37:/pavanbrick6/br1 rhs-client44:/pavanbrick7/br1 rhs-client38://pavanbrick7/br1 rhs-client37:/pavanbrick7/br1
volume create: BitRot1: failed: Volume BitRot1 already exists

[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log 
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.
[root@rhs-client38 ~]# gluster volume bitrot BitRot1 enable
Bitrot command failed : Another transaction is in progress for BitRot1. Please try again after sometime.


Actual results:
===============
Due to Deadlock commands are failing saying "another  transaction is in progress "



Additional info:
===============
(gdb) bt
#0  0x0000003291a0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003291a09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003291a093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96281958bb in rpc_clnt_disable (rpc=0x7f9618001860) at rpc-clnt.c:1712
#4  0x00007f962819587e in rpc_clnt_trigger_destroy (rpc=<value optimized out>) at rpc-clnt.c:1634
#5  rpc_clnt_unref (rpc=<value optimized out>) at rpc-clnt.c:1670
#6  0x00007f962819a765 in rpc_clnt_start_ping (rpc_ptr=0x7f9618001860) at rpc-clnt-ping.c:265
#7  0x00007f96283e1d30 in gf_timer_proc (ctx=0x2080010) at timer.c:183
#8  0x0000003291a079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x00000032912e88fd in clone () from /lib64/libc.so.6


log snippet:-
[2015-03-26 07:35:49.746017] E [glusterd-volume-ops.c:321:__glusterd_handle_create_volume] 0-management: Volume BitRot1 already exists
[2015-03-26 07:35:57.377967] I [glusterd-handler.c:1321:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-03-26 07:38:53.196387] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:38:53.196419] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:39:10.912649] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:39:10.912682] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:08.276495] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:08.276534] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1
[2015-03-26 07:40:57.076025] W [glusterd-locks.c:550:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f96283c4540] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1ca)[0x7f961e158f3a] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x4ff)[0x7f961e1549df] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f961e154d1b] (--> /usr/lib64/glusterfs/3.7dev/xlator/mgmt/glusterd.so(__glusterd_handle_bitrot+0x2c2)[0x7f961e12c652] ))))) 0-management: Lock for BitRot1 held by d25fd6c1-bc55-4ba8-befb-3f0f7623a504
[2015-03-26 07:40:57.076056] E [glusterd-syncop.c:1694:gd_sync_task_begin] 0-management: Unable to acquire lock for BitRot1

--- Additional comment from Rachana Patel on 2015-03-26 08:25:09 EDT ---

sosreport@http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1206134/

--- Additional comment from Anand Avati on 2015-03-30 08:42:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)

--- Additional comment from Anand Avati on 2015-04-06 08:31:26 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)

--- Additional comment from Anand Avati on 2015-04-06 13:16:17 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#4) for review on master by Niels de Vos (ndevos)

--- Additional comment from Anand Avati on 2015-04-07 02:27:54 EDT ---

REVIEW: http://review.gluster.org/9613 (rpc: fix deadlock when unref is inside conn->lock) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)

--- Additional comment from Anand Avati on 2015-04-09 23:05:24 EDT ---

COMMIT: http://review.gluster.org/9613 committed in master by Vijay Bellur (vbellur) 
------
commit d448fd187dde46bfb0d20354613912f6aa477904
Author: Krishnan Parthasarathi <kparthas>
Date:   Mon Feb 9 17:10:49 2015 +0530

    rpc: fix deadlock when unref is inside conn->lock
    
    In ping-timer implementation, the timer event takes a ref on the rpc
    object. This ref needs to be removed after every timeout event.
    ping-timer mechanism could be holding the last ref. For e.g, when a peer
    is detached and its rpc object was unref'd. In this case, ping-timer
    mechanism would try to acquire conn->mutex to perform the 'last' unref
    while being inside the critical section already. This will result in a
    deadlock.
    
    Change-Id: I74f80dd08c9348bd320a1c6d12fc8cd544fa4aea
    BUG: 1206134
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/9613
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

--- Additional comment from krishnan parthasarathi on 2015-04-10 02:50:54 EDT ---

The following link provides a test case written for GlusterFS regression test framework. This wasn't merged in the repo since it is Linux-specific. This test can be used as a representative for recreating this issue.

http://review.gluster.com/#/c/9613/4/tests/bugs/rpc/bug-1206134.t

--- Additional comment from John Skeoch on 2015-04-19 20:24:48 EDT ---

User racpatel's account has been closed

--- Additional comment from Niels de Vos on 2015-05-14 13:27:03 EDT ---

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:28:34 EDT ---

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

--- Additional comment from Niels de Vos on 2015-05-14 13:35:18 EDT ---

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 1 Anand Avati 2015-06-18 05:14:22 UTC
REVIEW: http://review.gluster.org/11303 (rpc: fix possible deadlock left behind in d448fd1) posted (#1) for review on release-3.7 by Krishnan Parthasarathi (kparthas)

Comment 2 Vijay Bellur 2015-11-03 05:53:43 UTC
COMMIT: http://review.gluster.org/11303 committed in release-3.7 by Raghavendra G (rgowdapp) 
------
commit c8f0d11918cc7d3577d624f2d757dc975c8bfad7
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue May 19 15:01:08 2015 +0530

    rpc: fix possible deadlock left behind in d448fd1
    
    See http://review.gluster.org/9613 for more details.
    
    BUG: 1233019
    Change-Id: I05ac0267b8c6f4e9b354acbbdf5469835455fb10
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/10821
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>
    (cherry picked from commit e902df70f8157db4db503b7ec3c2635b08b3dcb2)
    Reviewed-on: http://review.gluster.org/11303
    Tested-by: Gluster Build System <jenkins.com>

Comment 5 Atin Mukherjee 2016-01-11 04:58:43 UTC
This has been already fixed and shipped in 3.7.3. Closing the bug.


Note You need to log in before you can comment on or make changes to this bug.