Description of problem: glusterd crashes on running volume-stop command. This crash was observed once while running regression tests, which is part of the codebase. Version-Release number of selected component (if applicable): How reproducible: Inconsistent Steps to Reproduce: 1. Run regression tests [1] 2. 3. Actual results: Glusterd crashes. Expected results: Glusterd shouldn't crash. Additional info: [1] - For further information on running regression tests, see https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/tests/README
Created attachment 747495 [details] Back trace of the crash
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#2) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#6) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5107 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#7) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/5107 committed in master by Anand Avati (avati) ------ commit 74fe3057270fabb79f311414dd9c47c6245b52c7 Author: Krishnan Parthasarathi <kparthas> Date: Tue May 28 14:23:49 2013 +0530 rpc: Cleanup rpc object in TRANSPORT_CLEANUP event rpc_transport object should be alive as long as the rpc_clnt object is alive. To ensure this, on rpc_clnt's last unref, we cleanup the corresponding rpc_transport object and complete the rpc_clnt cleanup later, in a bottom-up fashion. Introduced rpc_clnt_is_disabled, to allow higher layers to differentiate between the 'final'[1] disconnect triggered from upper layers, and a normal disconnect. This differentiation helps in cleaning up resources, at higher layers, in a race-free manner. [1] - 'final' here means that the rpc and the associated connection, is not going to be used anymore. eg - glusterd_brick_disconnect on volume-stop. Change-Id: I2ecf891a36e3b02cd9eacca964e659525d1bbc6e BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5107 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5213 (rpc: Cleanup rpc object in TRANSPORT_CLEANUP event) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5214 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/5213 committed in release-3.4 by Vijay Bellur (vbellur) ------ commit b3f480a8e451ff1b11761c4cfca6b798c35bfb04 Author: Krishnan Parthasarathi <kparthas> Date: Tue May 28 14:23:49 2013 +0530 rpc: Cleanup rpc object in TRANSPORT_CLEANUP event Backport of http://review.gluster.org/5107 (upstream) This is to ensure that unref of rpc_clnt object doesn't race with the unref of the corresponding rpc_transport object. rpc_transport has ref_count 2, in normal scheme of things. One held by the socket layer and the other held by rpc layer. This inequality in ref_count between rpc_clnt and rpc_transport could lead to concurrent destruction of the objects and possibly lead to a crash. To avoid this, we defer the clean up of rpc_clnt obj to TRANSPORT_CLEANUP event. ie, once rpc_transport's ref_count goes to zero. Introduced rpc_clnt_disabled, to allow higher layers to differentiate between the 'final'[1] disconnect, triggered from upper layers, and disconnect seen as a consequence of transport disconnect. This differentiation helps in cleaning up resources, at higher layers, in a race-free manner. [1] - 'final' here means that the rpc and the associated connection, is not to be used anymore. eg - glusterd_brick_disconnect on volume-stop. Change-Id: I2ecf891a36e3b02cd9eacca964e659525d1bbc6e BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5213 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
COMMIT: http://review.gluster.org/5214 committed in release-3.4 by Vijay Bellur (vbellur) ------ commit 878bc03d7df8e18faca13fbf89a7ae55a29b0fdc Author: Krishnan Parthasarathi <kparthas> Date: Tue May 14 09:59:45 2013 +0530 glusterd: Disable transport before cleaning up rpc object Backport of http://review.gluster.org/5000 Problem: rpc_transport object, which is part of rpc_clnt, is destroyed prematurely. This is because, rpc_transport object is ref'd by socket layer and rpc layer. These ref's, until the synctask'izing of operations, were unref'd sequentially in the epoll thread. With more threads at play, the sequential unref guarantee is off. Fix: Shutting down the transport before proceeding with cleaning up of rpc_clnt object would serialize the unref's on the rpc_transport object and thus eliminating the race. Also, we don't store the address of brickinfo in brick's rpc notify function, to avoid the possibility of referring a freed brickinfo. Instead we use a string based id to 'reach' the corresponding brickinfo. Change-Id: If2739e2eeaee1e8b071ab2b6754b7ea0f81cfceb BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5214 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
COMMIT: http://review.gluster.org/5000 committed in master by Vijay Bellur (vbellur) ------ commit bb5ded9bee8cf7671bcb7c06e9ebca91f7bf8d67 Author: Krishnan Parthasarathi <kparthas> Date: Tue May 14 09:59:45 2013 +0530 glusterd: Disable transport before cleaning up rpc object Problem: rpc_transport object, which is part of rpc_clnt, is destroyed prematurely. This is because, rpc_transport object is ref'd by socket layer and rpc layer. These ref's, until the synctask'izing of operations, were unref'd sequentially in the epoll thread. With more threads at play, the sequential unref guarantee is off. Fix: Shutting down the transport before proceeding with cleaning up of rpc_clnt object would serialize the unref's on the rpc_transport object and thus eliminating the race. Also, we don't store the address of brickinfo in brick's rpc notify function, to avoid the possibility of referring a freed brickinfo. Instead we use a string based id to 'reach' the corresponding brickinfo. Change-Id: If2739e2eeaee1e8b071ab2b6754b7ea0f81cfceb BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5000 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/5321 (glusterd: Give up biglock before brick's rpc unref) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/5321 committed in master by Anand Avati (avati) ------ commit 5bb136c4ca18cc4c058040ea6db312be13edb098 Author: Krishnan Parthasarathi <kparthas> Date: Thu Jul 11 14:28:41 2013 +0530 glusterd: Give up biglock before brick's rpc unref This is to prevent the possibility of a deadlock when rpc_connection_cleanup being called in the same thread as rpc_clnt_unref Change-Id: Ia4dcc0a8a6e6158d4ddec68b780fccbc4cd64adb BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5321 Reviewed-by: Amar Tumballi <amarts> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/5326 (glusterd: Give up biglock before brick's rpc unref) posted (#1) for review on release-3.4 by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/5512 (rpc: add destructor function for notify data) posted (#1) for review on master by Kaushal M (kaushal)
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#2) for review on master by Kaushal M (kaushal)
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Correctly clean rpc clnt on disconnect) posted (#3) for review on master by Kaushal M (kaushal)
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Correctly clean rpc clnt on disconnect) posted (#4) for review on master by Kaushal M (kaushal)
COMMIT: http://review.gluster.org/5326 committed in release-3.4 by Anand Avati (avati) ------ commit c1c96e1b5836b7ed1c501cc176da563614e2081e Author: Krishnan Parthasarathi <kparthas> Date: Thu Jul 11 14:28:41 2013 +0530 glusterd: Give up biglock before brick's rpc unref This is to prevent the possibility of a deadlock when rpc_connection_cleanup being called in the same thread as rpc_clnt_unref Change-Id: Ia4dcc0a8a6e6158d4ddec68b780fccbc4cd64adb BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/5326 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#5) for review on master by Kaushal M (kaushal)
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#6) for review on master by Kaushal M (kaushal)
REVIEW: http://review.gluster.org/5512 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#7) for review on master by Kaushal M (kaushal)
COMMIT: http://review.gluster.org/5512 committed in master by Vijay Bellur (vbellur) ------ commit 40e13bc5b44d0b0cdaf7833c848d4a52352e0a13 Author: Kaushal M <kaushal> Date: Thu Aug 8 15:50:31 2013 +0530 rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata rpc: - On a RPC_TRANSPORT_CLEANUP event, rpc_clnt_notify calls the registered notifyfn with a RPC_CLNT_DESTROY event. The notifyfn should properly cleanup the saved mydata on this event. - Break the reconnect chain when an rpc client is disabled. This will prevent new disconnect events which can lead to crashes. glusterd: - Added support for RPC_CLNT_DESTROY in glusterd_brick_rpc_notify - Use a common glusterd_rpc_clnt_unref() function throught glusterd in place of rpc_clnt_unref(). This function correctly gives up the big-lock before performing the unref. Change-Id: I93230441c5089039643fc9f5632477ef1b695348 BUG: 962619 Signed-off-by: Kaushal M <kaushal> Reviewed-on: http://review.gluster.org/5512 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krishnan Parthasarathi <kparthas> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/6566 (rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata) posted (#1) for review on release-3.5 by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/6566 committed in release-3.5 by Vijay Bellur (vbellur) ------ commit 8c4e79c446fdfea00c1589a625ba1f1a63fdecc5 Author: Krishnan Parthasarathi <kparthas> Date: Mon Dec 23 14:07:57 2013 +0530 rpc,glusterd: Use rpc_clnt notifyfn to cleanup mydata Backport of http://review.gluster.org/5512 rpc: - On a RPC_TRANSPORT_CLEANUP event, rpc_clnt_notify calls the registered notifyfn with a RPC_CLNT_DESTROY event. The notifyfn should properly cleanup the saved mydata on this event. - Break the reconnect chain when an rpc client is disabled. This will prevent new disconnect events which can lead to crashes. glusterd: - Added support for RPC_CLNT_DESTROY in glusterd_brick_rpc_notify - Use a common glusterd_rpc_clnt_unref() function throught glusterd in place of rpc_clnt_unref(). This function correctly gives up the big-lock before performing the unref. Change-Id: I93230441c5089039643fc9f5632477ef1b695348 BUG: 962619 Signed-off-by: Kaushal M <kaushal> Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/6566 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
The issue is still occurring intermittently in regression runs. tests/basic/mount.t fails occasionally with glusterd crashing after a volume-stop followed by a volume-delete is performed.
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#3) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#4) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#5) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#6) for review on master by Krishnan Parthasarathi (kparthas)
REVIEW: http://review.gluster.org/6751 (rpc: transport may be destroyed while rpc isn't) posted (#7) for review on master by Krishnan Parthasarathi (kparthas)
COMMIT: http://review.gluster.org/6751 committed in master by Anand Avati (avati) ------ commit d6c1468b2779b6247e44b75276436021a3469a59 Author: Krishnan Parthasarathi <kparthas> Date: Tue Jan 21 23:41:07 2014 +0530 rpc: transport may be destroyed while rpc isn't rpc_clnt object is destroyed after the corresponding transport object is destroyed. But rpc_clnt_reconnect, a timer driven function, refers to the transport object beyond its 'life'. Instead, using the embedded connection object prevents use after free problem wrt transport object. Also, access transport object under conn->lock. Change-Id: Iae28e8a657d02689963c510114ad7cb7e6764e62 BUG: 962619 Signed-off-by: Krishnan Parthasarathi <kparthas> Reviewed-on: http://review.gluster.org/6751 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report. glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html [2] http://supercolony.gluster.org/mailman/listinfo/gluster-users