Bug 962619 - glusterd crashes on volume-stop
Summary: glusterd crashes on volume-stop
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 962621
TreeView+ depends on / blocked
 
Reported: 2013-05-14 04:50 UTC by krishnan parthasarathi
Modified: 2015-11-03 23:05 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.6.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 962621 (view as bug list)
Environment:
Last Closed: 2014-11-11 08:23:48 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Back trace of the crash (14.16 KB, text/plain)
2013-05-14 04:51 UTC, krishnan parthasarathi
no flags Details

Description krishnan parthasarathi 2013-05-14 04:50:09 UTC
Description of problem:
glusterd crashes on running volume-stop command. This crash was observed once while running regression tests, which is part of the codebase.

Version-Release number of selected component (if applicable):


How reproducible:
Inconsistent

Steps to Reproduce:
1. Run regression tests [1]
2.
3.
  
Actual results:
Glusterd crashes.

Expected results:
Glusterd shouldn't crash.


Additional info:
[1] - For further information on running regression tests, see https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/tests/README

Comment 1 krishnan parthasarathi 2013-05-14 04:51:24 UTC
Created attachment 747495 [details]
Back trace of the crash

Comment 2 Anand Avati 2013-05-14 04:53:22 UTC
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)

Comment 3 Anand Avati 2013-05-29 12:08:58 UTC
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)

Comment 4 Anand Avati 2013-05-29 12:09:20 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)

Comment 5 Anand Avati 2013-05-30 06:32:26 UTC
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)

Comment 6 Anand Avati 2013-05-30 06:32:48 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)

Comment 7 Anand Avati 2013-06-12 04:57:20 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)

Comment 8 Anand Avati 2013-06-14 07:04:02 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)

Comment 9 Anand Avati 2013-06-14 12:38:51 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)

Comment 10 Anand Avati 2013-06-14 16:16:38 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#6) for review on master by Krishnan Parthasarathi (kparthas)

Comment 11 Anand Avati 2013-06-14 18:40:38 UTC
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#7) for review on master by Anand Avati (avati)

Comment 12 Anand Avati 2013-06-16 05:35:02 UTC
COMMIT: http://review.gluster.org/5107 committed in master by Anand Avati (avati) 
------
commit 74fe3057270fabb79f311414dd9c47c6245b52c7
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue May 28 14:23:49 2013 +0530

    rpc: Cleanup rpc object in TRANSPORT_CLEANUP event
    
    rpc_transport object should be alive as long as the rpc_clnt object is
    alive. To ensure this, on rpc_clnt's last unref, we cleanup the
    corresponding rpc_transport object and complete the rpc_clnt cleanup
    later, in a bottom-up fashion.
    
    Introduced rpc_clnt_is_disabled, to allow higher layers to differentiate
    between the 'final'[1] disconnect triggered from upper layers, and a
    normal disconnect. This differentiation helps in cleaning up resources,
    at higher layers, in a race-free manner.
    
    [1] - 'final' here means that the rpc and the associated connection, is
    not going to be used anymore. eg - glusterd_brick_disconnect on
    volume-stop.
    
    Change-Id: I2ecf891a36e3b02cd9eacca964e659525d1bbc6e
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5107
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>

Comment 13 Anand Avati 2013-06-17 07:40:23 UTC
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)

Comment 14 Anand Avati 2013-06-17 07:42:06 UTC
REVIEW: http://review.gluster.org/5213 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)

Comment 15 Anand Avati 2013-06-17 07:42:29 UTC
REVIEW: http://review.gluster.org/5214 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)

Comment 16 Anand Avati 2013-06-19 02:44:46 UTC
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)

Comment 17 Anand Avati 2013-06-19 04:53:43 UTC
COMMIT: http://review.gluster.org/5213 committed in release-3.4 by Vijay Bellur (vbellur) 
------
commit b3f480a8e451ff1b11761c4cfca6b798c35bfb04
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue May 28 14:23:49 2013 +0530

    rpc: Cleanup rpc object in TRANSPORT_CLEANUP event
    
            Backport of http://review.gluster.org/5107 (upstream)
    
    This is to ensure that unref of rpc_clnt object doesn't race with the
    unref of the corresponding rpc_transport object.
    
    rpc_transport has ref_count 2, in normal scheme of things. One held by
    the socket layer and the other held by rpc layer. This inequality in
    ref_count between rpc_clnt and rpc_transport could lead to concurrent
    destruction of the objects and possibly lead to a crash.  To avoid this,
    we defer the clean up of rpc_clnt obj to TRANSPORT_CLEANUP event. ie,
    once rpc_transport's ref_count goes to zero.
    
    Introduced rpc_clnt_disabled, to allow higher layers to differentiate
    between the 'final'[1] disconnect, triggered from upper layers, and disconnect
    seen as a consequence of transport disconnect. This differentiation
    helps in cleaning up resources, at higher layers, in a race-free manner.
    
    [1] - 'final' here means that the rpc and the associated connection, is not
    to be used anymore. eg - glusterd_brick_disconnect on volume-stop.
    
    Change-Id: I2ecf891a36e3b02cd9eacca964e659525d1bbc6e
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5213
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 18 Anand Avati 2013-06-19 04:54:16 UTC
COMMIT: http://review.gluster.org/5214 committed in release-3.4 by Vijay Bellur (vbellur) 
------
commit 878bc03d7df8e18faca13fbf89a7ae55a29b0fdc
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue May 14 09:59:45 2013 +0530

    glusterd: Disable transport before cleaning up rpc object
    
            Backport of http://review.gluster.org/5000
    
    Problem:
    rpc_transport object, which is part of rpc_clnt, is destroyed
    prematurely. This is because, rpc_transport object is ref'd by socket
    layer and rpc layer. These ref's, until the synctask'izing of
    operations, were unref'd sequentially in the epoll thread.
    With more threads at play, the sequential unref guarantee is off.
    
    Fix:
    Shutting down the transport before proceeding with cleaning up of
    rpc_clnt object would serialize the unref's on the rpc_transport object
    and thus eliminating the race.
    
    Also, we don't store the address of brickinfo in brick's rpc notify
    function, to avoid the possibility of referring a freed brickinfo.
    Instead we use a string based id to 'reach' the corresponding brickinfo.
    
    Change-Id: If2739e2eeaee1e8b071ab2b6754b7ea0f81cfceb
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5214
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 19 Anand Avati 2013-06-19 05:11:13 UTC
COMMIT: http://review.gluster.org/5000 committed in master by Vijay Bellur (vbellur) 
------
commit bb5ded9bee8cf7671bcb7c06e9ebca91f7bf8d67
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue May 14 09:59:45 2013 +0530

    glusterd: Disable transport before cleaning up rpc object
    
    Problem:
    rpc_transport object, which is part of rpc_clnt, is destroyed
    prematurely. This is because, rpc_transport object is ref'd by socket
    layer and rpc layer. These ref's, until the synctask'izing of
    operations, were unref'd sequentially in the epoll thread.
    With more threads at play, the sequential unref guarantee is off.
    
    Fix:
    Shutting down the transport before proceeding with cleaning up of
    rpc_clnt object would serialize the unref's on the rpc_transport object
    and thus eliminating the race.
    
    Also, we don't store the address of brickinfo in brick's rpc notify
    function, to avoid the possibility of referring a freed brickinfo.
    Instead we use a string based id to 'reach' the corresponding brickinfo.
    
    Change-Id: If2739e2eeaee1e8b071ab2b6754b7ea0f81cfceb
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5000
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 20 Anand Avati 2013-07-11 09:52:16 UTC
REVIEW: http://review.gluster.org/5321 (glusterd: Give up biglock before brick's rpc unref) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)

Comment 21 Anand Avati 2013-07-11 23:10:01 UTC
COMMIT: http://review.gluster.org/5321 committed in master by Anand Avati (avati) 
------
commit 5bb136c4ca18cc4c058040ea6db312be13edb098
Author: Krishnan Parthasarathi <kparthas>
Date:   Thu Jul 11 14:28:41 2013 +0530

    glusterd: Give up biglock before brick's rpc unref
    
    This is to prevent the possibility of a deadlock when
    rpc_connection_cleanup being called in the same thread as rpc_clnt_unref
    
    Change-Id: Ia4dcc0a8a6e6158d4ddec68b780fccbc4cd64adb
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5321
    Reviewed-by: Amar Tumballi <amarts>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>

Comment 22 Anand Avati 2013-07-12 05:50:20 UTC
REVIEW: http://review.gluster.org/5326 (glusterd: Give up biglock before brick's rpc unref) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)

Comment 24 Anand Avati 2013-08-07 07:48:55 UTC
REVIEW: http://review.gluster.org/5512 (rpc: add destructor function for notify data) posted (#1) for review on master by Kaushal M (kaushal)

Comment 25 Anand Avati 2013-08-08 10:27:23 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#2) for review on master by Kaushal M (kaushal)

Comment 26 Anand Avati 2013-08-12 07:46:17 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Correctly clean rpc clnt on disconnect) posted (#3) for review on master by Kaushal M (kaushal)

Comment 27 Anand Avati 2013-08-12 13:37:07 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Correctly clean rpc clnt on disconnect) posted (#4) for review on master by Kaushal M (kaushal)

Comment 28 Anand Avati 2013-08-14 08:14:55 UTC
COMMIT: http://review.gluster.org/5326 committed in release-3.4 by Anand Avati (avati) 
------
commit c1c96e1b5836b7ed1c501cc176da563614e2081e
Author: Krishnan Parthasarathi <kparthas>
Date:   Thu Jul 11 14:28:41 2013 +0530

    glusterd: Give up biglock before brick's rpc unref
    
    This is to prevent the possibility of a deadlock when
    rpc_connection_cleanup being called in the same thread as rpc_clnt_unref
    
    Change-Id: Ia4dcc0a8a6e6158d4ddec68b780fccbc4cd64adb
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/5326
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>

Comment 29 Anand Avati 2013-09-03 13:59:38 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#5) for review on master by Kaushal M (kaushal)

Comment 30 Anand Avati 2013-09-05 11:22:46 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#6) for review on master by Kaushal M (kaushal)

Comment 31 Anand Avati 2013-12-13 07:26:15 UTC
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#7) for review on master by Kaushal M (kaushal)

Comment 32 Anand Avati 2013-12-16 13:03:32 UTC
COMMIT: http://review.gluster.org/5512 committed in master by Vijay Bellur (vbellur) 
------
commit 40e13bc5b44d0b0cdaf7833c848d4a52352e0a13
Author: Kaushal M <kaushal>
Date:   Thu Aug 8 15:50:31 2013 +0530

    rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata
    
    rpc:
    - On a RPC_TRANSPORT_CLEANUP event, rpc_clnt_notify calls the registered
      notifyfn with a RPC_CLNT_DESTROY event. The notifyfn should properly
      cleanup the saved mydata on this event.
    - Break the reconnect chain when an rpc client is disabled. This will
      prevent new disconnect events which can lead to crashes.
    
    glusterd:
    - Added support for RPC_CLNT_DESTROY in glusterd_brick_rpc_notify
    - Use a common glusterd_rpc_clnt_unref() function throught glusterd in
      place of rpc_clnt_unref(). This function correctly gives up the
      big-lock before performing the unref.
    
    Change-Id: I93230441c5089039643fc9f5632477ef1b695348
    BUG: 962619
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/5512
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 33 Anand Avati 2013-12-23 08:59:55 UTC
REVIEW: http://review.gluster.org/6566 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#1) for review on release-3.5 by Krishnan Parthasarathi (kparthas)

Comment 34 Anand Avati 2013-12-23 14:58:25 UTC
COMMIT: http://review.gluster.org/6566 committed in release-3.5 by Vijay Bellur (vbellur) 
------
commit 8c4e79c446fdfea00c1589a625ba1f1a63fdecc5
Author: Krishnan Parthasarathi <kparthas>
Date:   Mon Dec 23 14:07:57 2013 +0530

    rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata
    
            Backport of http://review.gluster.org/5512
    
    rpc:
    - On a RPC_TRANSPORT_CLEANUP event, rpc_clnt_notify calls the registered
      notifyfn with a RPC_CLNT_DESTROY event. The notifyfn should properly
      cleanup the saved mydata on this event.
    - Break the reconnect chain when an rpc client is disabled. This will
      prevent new disconnect events which can lead to crashes.
    
    glusterd:
    - Added support for RPC_CLNT_DESTROY in glusterd_brick_rpc_notify
    - Use a common glusterd_rpc_clnt_unref() function throught glusterd in
      place of rpc_clnt_unref(). This function correctly gives up the
      big-lock before performing the unref.
    
    Change-Id: I93230441c5089039643fc9f5632477ef1b695348
    BUG: 962619
    Signed-off-by: Kaushal M <kaushal>
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/6566
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 35 krishnan parthasarathi 2014-01-22 19:15:46 UTC
The issue is still occurring intermittently in regression runs. 
tests/basic/mount.t fails occasionally with glusterd crashing after a volume-stop followed by a volume-delete is performed.

Comment 36 Anand Avati 2014-01-22 19:16:56 UTC
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)

Comment 37 Anand Avati 2014-01-23 04:24:57 UTC
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)

Comment 38 Anand Avati 2014-01-23 06:55:13 UTC
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)

Comment 39 Anand Avati 2014-01-24 05:29:32 UTC
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#6) for review on master by Krishnan Parthasarathi (kparthas)

Comment 40 Anand Avati 2014-02-28 10:24:41 UTC
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#7) for review on master by Krishnan Parthasarathi (kparthas)

Comment 41 Anand Avati 2014-03-06 05:27:05 UTC
COMMIT: http://review.gluster.org/6751 committed in master by Anand Avati (avati) 
------
commit d6c1468b2779b6247e44b75276436021a3469a59
Author: Krishnan Parthasarathi <kparthas>
Date:   Tue Jan 21 23:41:07 2014 +0530

    rpc: transport may be destroyed while rpc isn't
    
    rpc_clnt object is destroyed after the corresponding transport object is
    destroyed. But rpc_clnt_reconnect, a timer driven function, refers to
    the transport object beyond its 'life'. Instead, using the embedded
    connection object prevents use after free problem wrt transport object.
    
    Also, access transport object under conn->lock.
    
    Change-Id: Iae28e8a657d02689963c510114ad7cb7e6764e62
    BUG: 962619
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/6751
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Anand Avati <avati>

Comment 42 Niels de Vos 2014-09-22 12:31:59 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 43 Niels de Vos 2014-11-11 08:23:48 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users


Note You need to log in before you can comment on or make changes to this bug.