Bug 1432542
Summary: | Glusterd crashes when restarted with many volumes | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Jeff Darcy <jdarcy> | |
Component: | glusterd | Assignee: | Jeff Darcy <jeff> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | mainline | CC: | amukherj, bugs | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.11.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1438052 1441476 (view as bug list) | Environment: | ||
Last Closed: | 2017-05-30 18:47:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1438052, 1441476 |
Description
Jeff Darcy
2017-03-15 15:39:03 UTC
REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#1) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#2) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#3) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#4) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#5) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#6) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#7) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#8) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#9) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#10) for review on master by Jeff Darcy (jdarcy) Down the rabbit hole we go. Moving the retries to a separate thread made the attach_brick/send_attach_req crashes go away, but the __synclock_unlock/list_del_init crashes were still there. It's hard to tell from a small sample, but if anything they seemed to be worse. After looking at several cores and then running some experiments, I now strongly suspect that there's a bug in our synclock_lock/synclock_unlock code. I'm sure that will come as a shock to just about nobody. One of the hallmarks of these crashes is that a synclock's waitq - specifically the one for glusterd's big_lock - has one task on it, but that task has clearly been freed. 0xdeadcode all over it, indicating that not only has it been freed but parts of it have since been reallocated. Somehow the problem seems to be related to the fact that __synclock_unlock will wake up both a waiting thread and a waiting synctask if both exist. If I change it to wake up only one, giving preference to synctasks, then the crashes go away but I hit some sort of deadlock (hard to diagnose precisely when none of this has ever been reproducible other than on our regression machines). If I change it to wake up only one, preferring the actual thread, then things seem much better. The locking etc. in the synclock_lock/synclock_unlock code *looks* reasonable, but there must be either some gap or some piece of code that's going around it. For now, the threads-get-priority change as implemented in patchset 10 seems to be avoiding the problem, but if we don't finish tracking it down then there's a high probability we'll just hit it again in some other context some day. REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#11) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#12) for review on master by Jeff Darcy (jdarcy) OK, latest theory is that someone is calling synctask_set to set the task pointer in thread-local storage (TLS), then the task exits but the task pointer in TLS is still set to the now-freed task structure. Subsequently, someone else calls into synclock_lock, and mistakenly believes that we're in that now-deleted task because that's what synctask_get told us. So we put the bogus task on the waitq, and everything goes downhill from there. I'm not *exactly* sure how we get to task exit before the next call to synctask_set, but there are lots of paths through this code and the fact that it's jumping between tasks and threads makes it even harder to follow. What I do know is that better hygiene around calling synctask_set - including synctask_set(NULL) - whenever we call swapcontext seems to have helped. REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#13) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries to a separate thread) posted (#14) for review on master by Jeff Darcy (jdarcy) Further investigation shows that all symptoms - including cases where I was able to observe multiple threads/synctasks executing concurrently even though they were both supposed to be under glusterd's big_lock, could be explained by memory corruption. What's happening is that the lock/unlock behavior in attach_brick is allowing other operations to overlap. In particular, if we're calling attach_brick from glusterd_restart_bricks (as we do when starting up), an overlapping volume delete will modify the very list we're in the middle of walking, so we'll either be working on a pointer to a freed brickinfo_t or we'll follow its list pointers off into space. Either way, *all sorts* of crazy and horrible things can happen. Moving the attach-RPC retries into a separate thread/task really doesn't help, and introduces its own problems. What has worked fairly well is forcing volume deletes to wait if glusterd_restart_bricks is still in progress, but the approach I'm currently using is neither as complete nor as airtight as it really needs to be. I'm still trying to think of a better solution, but at least now it's clear what the problem is. BTW, debugging and testing is being severely hampered by constant disconnections between glusterd and glusterfsd. This might be related to the sharp increase in client/server disconnect/reconnect cycles that others have reported dating back approximately to RHGS 3.1.2, and we really need to figure out why our RPC got flakier around that time. REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#1) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#2) for review on master by Jeff Darcy (jeff.us) REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#3) for review on master by Jeff Darcy (jeff.us) REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#4) for review on master by Jeff Darcy (jeff.us) REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes while still restarting bricks) posted (#5) for review on master by Jeff Darcy (jeff.us) COMMIT: https://review.gluster.org/16927 committed in master by Atin Mukherjee (amukherj) ------ commit a7ce0548b7969050644891cd90c0bf134fa1594c Author: Jeff Darcy <jdarcy> Date: Mon Mar 20 12:32:33 2017 -0400 glusterd: hold off volume deletes while still restarting bricks We need to do this because modifying the volume/brick tree while glusterd_restart_bricks is still walking it can lead to segfaults. Without waiting we could accidentally "slip in" while attach_brick has released big_lock between retries and make such a modification. Change-Id: I30ccc4efa8d286aae847250f5d4fb28956a74b03 BUG: 1432542 Signed-off-by: Jeff Darcy <jeff.us> Reviewed-on: https://review.gluster.org/16927 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> REVIEW: https://review.gluster.org/17044 (glusterd: hold off volume deletes while still restarting bricks) posted (#1) for review on release-3.10 by Jeff Darcy (jeff.us) This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/ |