Bug 1441476 - Glusterd crashes when restarted with many volumes
Summary: Glusterd crashes when restarted with many volumes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Jeff Darcy
QA Contact:
URL:
Whiteboard:
Depends On: 1432542
Blocks: 1438052 glusterfs-3.10.2
TreeView+ depends on / blocked
 
Reported: 2017-04-12 04:01 UTC by Jeff Darcy
Modified: 2017-05-31 20:45 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.10.2
Clone Of: 1432542
Environment:
Last Closed: 2017-05-31 20:45:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Jeff Darcy 2017-04-12 04:01:37 UTC
+++ This bug was initially created as a clone of Bug #1432542 +++

This was actually found in a test for bug 1430860, which was about *glusterfsd* crashing under the same load.  That test never reproduced the original bug, but it could reliably reproduce this one.  Examples include:

  https://build.gluster.org/job/centos6-regression/3607/consoleFull
  two crashes in attach_brick/send_attach_req  

  https://build.gluster.org/job/centos6-regression/3608/consoleFull
  first crash in attach_brick/send_attach_req
  second crash in __synclock_unlock/list_del_init

  https://build.gluster.org/job/centos6-regression/3609/consoleFull
  one crash in __synclock_unlock/list_del_init

Investigation showed that the problem was with the lock manipulation done around retries in attach_brick.  Turns out that's not safe after all, especially when it allows a new operation to overlap with one already in progress - as the test does.  A fix, which moves retries into a separate thread and tracks connection states more carefully, is forthcoming shortly.

--- Additional comment from Worker Ant on 2017-03-15 11:43:13 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#1) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 13:08:31 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#2) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 13:52:18 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#3) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 14:35:48 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#4) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 15:06:05 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#5) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 15:29:01 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#6) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 15:53:35 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#7) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 16:21:23 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#8) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 16:43:02 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#9) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 17:12:46 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#10) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Jeff Darcy on 2017-03-15 18:49:06 EDT ---

Down the rabbit hole we go.

Moving the retries to a separate thread made the attach_brick/send_attach_req crashes go away, but the __synclock_unlock/list_del_init crashes were still there.  It's hard to tell from a small sample, but if anything they seemed to be worse.  After looking at several cores and then running some experiments, I now strongly suspect that there's a bug in our synclock_lock/synclock_unlock code.  I'm sure that will come as a shock to just about nobody.

One of the hallmarks of these crashes is that a synclock's waitq - specifically the one for glusterd's big_lock - has one task on it, but that task has clearly been freed.  0xdeadcode all over it, indicating that not only has it been freed but parts of it have since been reallocated.  Somehow the problem seems to be related to the fact that __synclock_unlock will wake up both a waiting thread and a waiting synctask if both exist.  If I change it to wake up only one, giving preference to synctasks, then the crashes go away but I hit some sort of deadlock (hard to diagnose precisely when none of this has ever been reproducible other than on our regression machines).  If I change it to wake up only one, preferring the actual thread, then things seem much better.  The locking etc. in the synclock_lock/synclock_unlock code *looks* reasonable, but there must be either some gap or some piece of code that's going around it.

For now, the threads-get-priority change as implemented in patchset 10 seems to be avoiding the problem, but if we don't finish tracking it down then there's a high probability we'll just hit it again in some other context some day.

--- Additional comment from Worker Ant on 2017-03-15 20:39:53 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#11) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-15 22:50:02 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#12) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Jeff Darcy on 2017-03-15 23:08:14 EDT ---

OK, latest theory is that someone is calling synctask_set to set the task pointer in thread-local storage (TLS), then the task exits but the task pointer in TLS is still set to the now-freed task structure.  Subsequently, someone else calls into synclock_lock, and mistakenly believes that we're in that now-deleted task because that's what synctask_get told us.  So we put the bogus task on the waitq, and everything goes downhill from there.  I'm not *exactly* sure how we get to task exit before the next call to synctask_set, but there are lots of paths through this code and the fact that it's jumping between tasks and threads makes it even harder to follow.  What I do know is that better hygiene around calling synctask_set - including synctask_set(NULL) - whenever we call swapcontext seems to have helped.

--- Additional comment from Worker Ant on 2017-03-16 08:56:50 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#13) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-16 11:14:53 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#14) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Jeff Darcy on 2017-03-18 23:58:47 EDT ---

Further investigation shows that all symptoms - including cases where I was able to observe multiple threads/synctasks executing concurrently even though they were both supposed to be under glusterd's big_lock, could be explained by memory corruption.  What's happening is that the lock/unlock behavior in attach_brick is allowing other operations to overlap.  In particular, if we're calling attach_brick from glusterd_restart_bricks (as we do when starting up), an overlapping volume delete will modify the very list we're in the middle of walking, so we'll either be working on a pointer to a freed brickinfo_t or we'll follow its list pointers off into space.  Either way, *all sorts* of crazy and horrible things can happen.

Moving the attach-RPC retries into a separate thread/task really doesn't help, and introduces its own problems.  What has worked fairly well is forcing volume deletes to wait if glusterd_restart_bricks is still in progress, but the approach I'm currently using is neither as complete nor as airtight as it really needs to be.  I'm still trying to think of a better solution, but at least now it's clear what the problem is.

BTW, debugging and testing is being severely hampered by constant disconnections between glusterd and glusterfsd.  This might be related to the sharp increase in client/server disconnect/reconnect cycles that others have reported dating back approximately to RHGS 3.1.2, and we really need to figure out why our RPC got flakier around that time.

--- Additional comment from Worker Ant on 2017-03-20 12:45:19 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#1) for review on master by Jeff Darcy (jdarcy)

--- Additional comment from Worker Ant on 2017-03-20 15:06:14 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#2) for review on master by Jeff Darcy (jeff.us)

--- Additional comment from Worker Ant on 2017-03-20 17:19:35 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#3) for review on master by Jeff Darcy (jeff.us)

--- Additional comment from Worker Ant on 2017-03-23 08:51:28 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#4) for review on master by Jeff Darcy (jeff.us)

--- Additional comment from Worker Ant on 2017-03-23 14:30:41 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#5) for review on master by Jeff Darcy (jeff.us)

--- Additional comment from Worker Ant on 2017-03-30 23:33:35 EDT ---

COMMIT: https://review.gluster.org/16927 committed in master by Atin Mukherjee (amukherj) 
------
commit a7ce0548b7969050644891cd90c0bf134fa1594c
Author: Jeff Darcy <jdarcy>
Date:   Mon Mar 20 12:32:33 2017 -0400

    glusterd: hold off volume deletes while still restarting bricks
    
    We need to do this because modifying the volume/brick tree while
    glusterd_restart_bricks is still walking it can lead to segfaults.
    Without waiting we could accidentally "slip in" while attach_brick has
    released big_lock between retries and make such a modification.
    
    Change-Id: I30ccc4efa8d286aae847250f5d4fb28956a74b03
    BUG: 1432542
    Signed-off-by: Jeff Darcy <jeff.us>
    Reviewed-on: https://review.gluster.org/16927
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 1 Worker Ant 2017-04-12 04:15:55 UTC
REVIEW: https://review.gluster.org/17044 (glusterd: hold off volume deletes while still restarting bricks) posted (#2) for review on release-3.10 by Jeff Darcy (jeff.us)

Comment 2 Worker Ant 2017-04-13 16:10:11 UTC
REVIEW: https://review.gluster.org/17044 (glusterd: hold off volume deletes while still restarting bricks) posted (#3) for review on release-3.10 by Jeff Darcy (jeff.us)

Comment 3 Worker Ant 2017-04-24 15:49:03 UTC
COMMIT: https://review.gluster.org/17044 committed in release-3.10 by Raghavendra Talur (rtalur) 
------
commit e6045103b7e010779549bb486c00a07b3c3eb0fc
Author: Jeff Darcy <jdarcy>
Date:   Mon Mar 20 12:32:33 2017 -0400

    glusterd: hold off volume deletes while still restarting bricks
    
    We need to do this because modifying the volume/brick tree while
    glusterd_restart_bricks is still walking it can lead to segfaults.
    Without waiting we could accidentally "slip in" while attach_brick has
    released big_lock between retries and make such a modification.
    
    Backport of:
    > Commit a7ce0548b7969050644891cd90c0bf134fa1594c
    > BUG: 1432542
    > Reviewed-on: https://review.gluster.org/16927
    
    Change-Id: I30ccc4efa8d286aae847250f5d4fb28956a74b03
    BUG: 1441476
    Signed-off-by: Jeff Darcy <jeff.us>
    Reviewed-on: https://review.gluster.org/17044
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 4 Raghavendra Talur 2017-05-31 20:45:17 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.2, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.