1214220 – Crashes in logging code

Bug 1214220 - Crashes in logging code

Summary: Crashes in logging code

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Jeff Darcy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1212660
Blocks:	glusterfs-3.7.0
TreeView+	depends on / blocked

Reported:	2015-04-22 09:30 UTC by Vijay Bellur
Modified:	2015-05-14 17:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.7.0beta1
Clone Of:	1212660
Environment:
Last Closed:	2015-05-14 17:27:23 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijay Bellur 2015-04-22 09:30:53 UTC

+++ This bug was initially created as a clone of Bug #1212660 +++

I looked at seven core dumps from five recently failed regression tests.  Here's a summary.

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7052/console
   generated by: tests/geo-rep/georep-rsync-hybrid.t
   crash details: in python (gsyncd)

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7038/console
   generated by: tests/basic/cdc.t
   crash details: in glusterfsd
      pthread_spin_lock
      __gf_free
      log_buf_destroy
      _gf_msg_internal
      _gf_msg "accepted client from %s (version: %s)"
      server_setvolume

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7035/console
   generated by: tests/basic/mgmt_v3-locks.t
   crash details: in glusterfs
      log_buf_destroy
      gf_log_flush_list
      gf_log_flush_extra_msgs
      gf_log_set_log_buf_size
      gf_log_disable_suppression_before_exit
      cleanup_and_exit
      glusterfs_process_volfp

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7030/console
   generated by: tests/basic/cdc.t
   crash details: in glusterfsd
      same as previous server_setvolume

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7029/console
   generated by: tests/basic/volume-snapshot-clone.t (three core files)
   crash details: in glusterfs
      all three same as previous glusterfs_process_volfp crash

That's six out of seven going through log_buf_destroy - different tests,
different daemons, different code paths, but all converging there.
Could it be a coincidence that this is the same logging infrastructure
we've recently started using more heavily?  That seems unlikely.  It's
entirely possible that log_buf_destroy is the victim (of heap
corruption) rather than a culprit, but chances are that the bug is
somewhere in related code.

--- Additional comment from Justin Clift on 2015-04-17 06:54:00 EDT ---

Cool, keep going.  Let's nail this sucker! :)

--- Additional comment from Jeff Darcy on 2015-04-21 11:26:43 EDT ---

This turns out to be a relative of both bug 1211749 and bug 1211473 - a memory object allocated in a translator has persisted past the lifetime of that translator.  The translator pointer in that memory object's header is therefore no longer valid, and when the memory tracking code tries to dereference through that pointer . . . BOOM.

In those other cases, the problem had to do with a temporary graph created for option validation.  In this case it has to do with the list we use to detect and coalesce duplicate log messages.  While the log_buf objects themselves are allocated from a pool, various elements are copied via gf_strdup, using THIS from the current context as the owning translator.  The solution is going to be rather similar to that for 1211749:

    http://review.gluster.org/#/c/10238/

It's hacky, but it gets us past having our daemons blow up effectively at random.

--- Additional comment from Anand Avati on 2015-04-21 11:50:53 EDT ---

REVIEW: http://review.gluster.org/10319 (core: avoid crashes in gf_msg dup-detection code) posted (#1) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Justin Clift on 2015-04-21 12:00:15 EDT ---

Awesome. :)

--- Additional comment from Anand Avati on 2015-04-22 02:15:43 EDT ---

COMMIT: http://review.gluster.org/10319 committed in master by Vijay Bellur (vbellur) 
------
commit 765849ee00f6661c9059122ff2346b03b224745f
Author: Jeff Darcy <jdarcy>
Date:   Tue Apr 21 11:48:15 2015 -0400

    core: avoid crashes in gf_msg dup-detection code
    
    Use global_xlator for allocations so that we don't try to free objects
    belonging to an already-deleted translator (which will crash).
    
    Change-Id: Ie72a546e7770cf5cb8a8370e22448c8d09e3ab37
    BUG: 1212660
    Signed-off-by: Jeff Darcy <jdarcy>
    Reviewed-on: http://review.gluster.org/10319
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: NetBSD Build System
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 1 Anand Avati 2015-04-22 09:46:13 UTC

REVIEW: http://review.gluster.org/10330 (core: avoid crashes in gf_msg dup-detection code) posted (#1) for review on release-3.7 by Vijay Bellur (vbellur)

Comment 2 Anand Avati 2015-04-22 13:18:12 UTC

COMMIT: http://review.gluster.org/10330 committed in release-3.7 by Vijay Bellur (vbellur) 
------
commit 24422a6f1599597b3a378fa2ff392aa40f5a33f5
Author: Jeff Darcy <jdarcy>
Date:   Tue Apr 21 11:48:15 2015 -0400

    core: avoid crashes in gf_msg dup-detection code
    
    Use global_xlator for allocations so that we don't try to free objects
    belonging to an already-deleted translator (which will crash).
    
    Change-Id: Ie72a546e7770cf5cb8a8370e22448c8d09e3ab37
    BUG: 1214220
    Signed-off-by: Jeff Darcy <jdarcy>
    Reviewed-on: http://review.gluster.org/10319
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Vijay Bellur <vbellur>
    Reviewed-on: http://review.gluster.org/10330
    Tested-by: NetBSD Build System

Comment 3 Niels de Vos 2015-05-14 17:27:23 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Niels de Vos 2015-05-14 17:28:46 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:35:20 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.