Bug 1432542

Summary:	Glusterd crashes when restarted with many volumes
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	glusterd	Assignee:	Jeff Darcy <jeff>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1438052 1441476 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:47:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1438052, 1441476

Description Jeff Darcy 2017-03-15 15:39:03 UTC

This was actually found in a test for bug 1430860, which was about *glusterfsd* crashing under the same load.  That test never reproduced the original bug, but it could reliably reproduce this one.  Examples include:

  https://build.gluster.org/job/centos6-regression/3607/consoleFull
  two crashes in attach_brick/send_attach_req  

  https://build.gluster.org/job/centos6-regression/3608/consoleFull
  first crash in attach_brick/send_attach_req
  second crash in __synclock_unlock/list_del_init

  https://build.gluster.org/job/centos6-regression/3609/consoleFull
  one crash in __synclock_unlock/list_del_init

Investigation showed that the problem was with the lock manipulation done around retries in attach_brick.  Turns out that's not safe after all, especially when it allows a new operation to overlap with one already in progress - as the test does.  A fix, which moves retries into a separate thread and tracks connection states more carefully, is forthcoming shortly.

Comment 1 Worker Ant 2017-03-15 15:43:13 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#1) for review on master by Jeff Darcy (jdarcy)

Comment 2 Worker Ant 2017-03-15 17:08:31 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#2) for review on master by Jeff Darcy (jdarcy)

Comment 3 Worker Ant 2017-03-15 17:52:18 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#3) for review on master by Jeff Darcy (jdarcy)

Comment 4 Worker Ant 2017-03-15 18:35:48 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#4) for review on master by Jeff Darcy (jdarcy)

Comment 5 Worker Ant 2017-03-15 19:06:05 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#5) for review on master by Jeff Darcy (jdarcy)

Comment 6 Worker Ant 2017-03-15 19:29:01 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#6) for review on master by Jeff Darcy (jdarcy)

Comment 7 Worker Ant 2017-03-15 19:53:35 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#7) for review on master by Jeff Darcy (jdarcy)

Comment 8 Worker Ant 2017-03-15 20:21:23 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#8) for review on master by Jeff Darcy (jdarcy)

Comment 9 Worker Ant 2017-03-15 20:43:02 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#9) for review on master by Jeff Darcy (jdarcy)

Comment 10 Worker Ant 2017-03-15 21:12:46 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#10) for review on master by Jeff Darcy (jdarcy)

Comment 11 Jeff Darcy 2017-03-15 22:49:06 UTC

Down the rabbit hole we go.

Moving the retries to a separate thread made the attach_brick/send_attach_req crashes go away, but the __synclock_unlock/list_del_init crashes were still there. It's hard to tell from a small sample, but if anything they seemed to be worse. After looking at several cores and then running some experiments, I now strongly suspect that there's a bug in our synclock_lock/synclock_unlock code. I'm sure that will come as a shock to just about nobody.

One of the hallmarks of these crashes is that a synclock's waitq - specifically the one for glusterd's big_lock - has one task on it, but that task has clearly been freed. 0xdeadcode all over it, indicating that not only has it been freed but parts of it have since been reallocated. Somehow the problem seems to be related to the fact that __synclock_unlock will wake up both a waiting thread and a waiting synctask if both exist. If I change it to wake up only one, giving preference to synctasks, then the crashes go away but I hit some sort of deadlock (hard to diagnose precisely when none of this has ever been reproducible other than on our regression machines). If I change it to wake up only one, preferring the actual thread, then things seem much better. The locking etc. in the synclock_lock/synclock_unlock code *looks* reasonable, but there must be either some gap or some piece of code that's going around it.

For now, the threads-get-priority change as implemented in patchset 10 seems to be avoiding the problem, but if we don't finish tracking it down then there's a high probability we'll just hit it again in some other context some day.

Comment 12 Worker Ant 2017-03-16 00:39:53 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#11) for review on master by Jeff Darcy (jdarcy)

Comment 13 Worker Ant 2017-03-16 02:50:02 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#12) for review on master by Jeff Darcy (jdarcy)

Comment 14 Jeff Darcy 2017-03-16 03:08:14 UTC

OK, latest theory is that someone is calling synctask_set to set the task pointer in thread-local storage (TLS), then the task exits but the task pointer in TLS is still set to the now-freed task structure.  Subsequently, someone else calls into synclock_lock, and mistakenly believes that we're in that now-deleted task because that's what synctask_get told us.  So we put the bogus task on the waitq, and everything goes downhill from there.  I'm not *exactly* sure how we get to task exit before the next call to synctask_set, but there are lots of paths through this code and the fact that it's jumping between tasks and threads makes it even harder to follow.  What I do know is that better hygiene around calling synctask_set - including synctask_set(NULL) - whenever we call swapcontext seems to have helped.

Comment 15 Worker Ant 2017-03-16 12:56:50 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#13) for review on master by Jeff Darcy (jdarcy)

Comment 16 Worker Ant 2017-03-16 15:14:53 UTC

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#14) for review on master by Jeff Darcy (jdarcy)

Comment 17 Jeff Darcy 2017-03-19 03:58:47 UTC

Further investigation shows that all symptoms - including cases where I was able to observe multiple threads/synctasks executing concurrently even though they were both supposed to be under glusterd's big_lock, could be explained by memory corruption. What's happening is that the lock/unlock behavior in attach_brick is allowing other operations to overlap. In particular, if we're calling attach_brick from glusterd_restart_bricks (as we do when starting up), an overlapping volume delete will modify the very list we're in the middle of walking, so we'll either be working on a pointer to a freed brickinfo_t or we'll follow its list pointers off into space. Either way, *all sorts* of crazy and horrible things can happen.

Moving the attach-RPC retries into a separate thread/task really doesn't help, and introduces its own problems. What has worked fairly well is forcing volume deletes to wait if glusterd_restart_bricks is still in progress, but the approach I'm currently using is neither as complete nor as airtight as it really needs to be. I'm still trying to think of a better solution, but at least now it's clear what the problem is.

BTW, debugging and testing is being severely hampered by constant disconnections between glusterd and glusterfsd. This might be related to the sharp increase in client/server disconnect/reconnect cycles that others have reported dating back approximately to RHGS 3.1.2, and we really need to figure out why our RPC got flakier around that time.

Comment 18 Worker Ant 2017-03-20 16:45:19 UTC

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#1) for review on master by Jeff Darcy (jdarcy)

Comment 19 Worker Ant 2017-03-20 19:06:14 UTC

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#2) for review on master by Jeff Darcy (jeff.us)

Comment 20 Worker Ant 2017-03-20 21:19:35 UTC

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#3) for review on master by Jeff Darcy (jeff.us)

Comment 21 Worker Ant 2017-03-23 12:51:28 UTC

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#4) for review on master by Jeff Darcy (jeff.us)

Comment 22 Worker Ant 2017-03-23 18:30:41 UTC

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#5) for review on master by Jeff Darcy (jeff.us)

Comment 23 Worker Ant 2017-03-31 03:33:35 UTC

COMMIT: https://review.gluster.org/16927 committed in master by Atin Mukherjee (amukherj) 
------
commit a7ce0548b7969050644891cd90c0bf134fa1594c
Author: Jeff Darcy <jdarcy>
Date:   Mon Mar 20 12:32:33 2017 -0400

    glusterd: hold off volume deletes while still restarting bricks
    
    We need to do this because modifying the volume/brick tree while
    glusterd_restart_bricks is still walking it can lead to segfaults.
    Without waiting we could accidentally "slip in" while attach_brick has
    released big_lock between retries and make such a modification.
    
    Change-Id: I30ccc4efa8d286aae847250f5d4fb28956a74b03
    BUG: 1432542
    Signed-off-by: Jeff Darcy <jeff.us>
    Reviewed-on: https://review.gluster.org/16927
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 24 Worker Ant 2017-04-12 04:02:38 UTC

REVIEW: https://review.gluster.org/17044 (glusterd: hold off volume deletes while still restarting bricks) posted (#1) for review on release-3.10 by Jeff Darcy (jeff.us)

Comment 25 Shyamsundar 2017-05-30 18:47:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/