+++ This bug was initially created as a clone of Bug #1212660 +++ I looked at seven core dumps from five recently failed regression tests. Here's a summary. http://build.gluster.org/job/rackspace-regression-2GB-triggered/7052/console generated by: tests/geo-rep/georep-rsync-hybrid.t crash details: in python (gsyncd) http://build.gluster.org/job/rackspace-regression-2GB-triggered/7038/console generated by: tests/basic/cdc.t crash details: in glusterfsd pthread_spin_lock __gf_free log_buf_destroy _gf_msg_internal _gf_msg "accepted client from %s (version: %s)" server_setvolume http://build.gluster.org/job/rackspace-regression-2GB-triggered/7035/console generated by: tests/basic/mgmt_v3-locks.t crash details: in glusterfs log_buf_destroy gf_log_flush_list gf_log_flush_extra_msgs gf_log_set_log_buf_size gf_log_disable_suppression_before_exit cleanup_and_exit glusterfs_process_volfp http://build.gluster.org/job/rackspace-regression-2GB-triggered/7030/console generated by: tests/basic/cdc.t crash details: in glusterfsd same as previous server_setvolume http://build.gluster.org/job/rackspace-regression-2GB-triggered/7029/console generated by: tests/basic/volume-snapshot-clone.t (three core files) crash details: in glusterfs all three same as previous glusterfs_process_volfp crash That's six out of seven going through log_buf_destroy - different tests, different daemons, different code paths, but all converging there. Could it be a coincidence that this is the same logging infrastructure we've recently started using more heavily? That seems unlikely. It's entirely possible that log_buf_destroy is the victim (of heap corruption) rather than a culprit, but chances are that the bug is somewhere in related code. --- Additional comment from Justin Clift on 2015-04-17 06:54:00 EDT --- Cool, keep going. Let's nail this sucker! :) --- Additional comment from Jeff Darcy on 2015-04-21 11:26:43 EDT --- This turns out to be a relative of both bug 1211749 and bug 1211473 - a memory object allocated in a translator has persisted past the lifetime of that translator. The translator pointer in that memory object's header is therefore no longer valid, and when the memory tracking code tries to dereference through that pointer . . . BOOM. In those other cases, the problem had to do with a temporary graph created for option validation. In this case it has to do with the list we use to detect and coalesce duplicate log messages. While the log_buf objects themselves are allocated from a pool, various elements are copied via gf_strdup, using THIS from the current context as the owning translator. The solution is going to be rather similar to that for 1211749: http://review.gluster.org/#/c/10238/ It's hacky, but it gets us past having our daemons blow up effectively at random. --- Additional comment from Anand Avati on 2015-04-21 11:50:53 EDT --- REVIEW: http://review.gluster.org/10319 (core: avoid crashes in gf_msg dup-detection code) posted (#1) for review on master by Jeff Darcy (jdarcy) --- Additional comment from Justin Clift on 2015-04-21 12:00:15 EDT --- Awesome. :) --- Additional comment from Anand Avati on 2015-04-22 02:15:43 EDT --- COMMIT: http://review.gluster.org/10319 committed in master by Vijay Bellur (vbellur) ------ commit 765849ee00f6661c9059122ff2346b03b224745f Author: Jeff Darcy <jdarcy> Date: Tue Apr 21 11:48:15 2015 -0400 core: avoid crashes in gf_msg dup-detection code Use global_xlator for allocations so that we don't try to free objects belonging to an already-deleted translator (which will crash). Change-Id: Ie72a546e7770cf5cb8a8370e22448c8d09e3ab37 BUG: 1212660 Signed-off-by: Jeff Darcy <jdarcy> Reviewed-on: http://review.gluster.org/10319 Reviewed-by: Krishnan Parthasarathi <kparthas> Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Atin Mukherjee <amukherj> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/10330 (core: avoid crashes in gf_msg dup-detection code) posted (#1) for review on release-3.7 by Vijay Bellur (vbellur)
COMMIT: http://review.gluster.org/10330 committed in release-3.7 by Vijay Bellur (vbellur) ------ commit 24422a6f1599597b3a378fa2ff392aa40f5a33f5 Author: Jeff Darcy <jdarcy> Date: Tue Apr 21 11:48:15 2015 -0400 core: avoid crashes in gf_msg dup-detection code Use global_xlator for allocations so that we don't try to free objects belonging to an already-deleted translator (which will crash). Change-Id: Ie72a546e7770cf5cb8a8370e22448c8d09e3ab37 BUG: 1214220 Signed-off-by: Jeff Darcy <jdarcy> Reviewed-on: http://review.gluster.org/10319 Reviewed-by: Krishnan Parthasarathi <kparthas> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Atin Mukherjee <amukherj> Reviewed-by: Vijay Bellur <vbellur> Reviewed-on: http://review.gluster.org/10330 Tested-by: NetBSD Build System
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user