Bug 1421721
Summary: | volume start command hangs | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Atin Mukherjee <amukherj> | |
Component: | glusterd | Assignee: | Jeff Darcy <jeff> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | mainline | CC: | amukherj, bugs, jdarcy | |
Target Milestone: | --- | Keywords: | Triaged | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | brick-multiplexing-testing | |||
Fixed In Version: | glusterfs-3.11.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1431175 (view as bug list) | Environment: | ||
Last Closed: | 2017-05-30 18:42:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1431175 |
Description
Atin Mukherjee
2017-02-13 14:16:49 UTC
Looks a bit racy. I ran this sequence a few times and didn't see any failures. This part of the trace (thread 3) seems most important: #4 0x00007f42a4c5daf0 in gf_print_trace (signum=11, ctx=0x1a31010) at common-utils.c:714 #5 <signal handler called> So we're probably looking at some sort of list/memory corruption. Is it possible that the rpc isn't completely valid (even though the pointer is non-null) or that the timer has already been removed from its list? REVIEW: https://review.gluster.org/16650 (tests: add test for toggling MPX and restarting a volume) posted (#1) for review on master by Jeff Darcy (jdarcy) I have tested this more than a hundred times, with zero failures. You can see the script here. https://review.gluster.org/#/c/16650/ Is there something else that you did, that's missing from the script? There is one place in the multiplexing code where I added a call to gf_timer_call_cancel. It's in glusterd_volume_start_glusterfs, which is part of the stack above. Suspiciously, this is done without a lock on conn->lock as is done in other places (e.g. rpc_clnt_reconnect_cleanup). This fits with the theory that it's a race. In gf_timer_call_cancel we do call list_del instead of list_del_init, and there seems to be no other protection against being called twice, so I suspect that when we hit the race (because of the missing lock) we corrupt the list. Unfortunately, since the test already passes consistently for me, I won't be able to test whether the fix has any effect. REVIEW: https://review.gluster.org/16662 (glusterd: take conn->lock around operations on conn->reconnect) posted (#1) for review on master by Jeff Darcy (jdarcy) REVIEW: https://review.gluster.org/16662 (glusterd: take conn->lock around operations on conn->reconnect) posted (#2) for review on master by Jeff Darcy (jdarcy) COMMIT: https://review.gluster.org/16662 committed in master by Atin Mukherjee (amukherj) ------ commit 4e0d4b15717da1f6466133158a26927fb91384b8 Author: Jeff Darcy <jdarcy> Date: Fri Feb 17 09:42:46 2017 -0500 glusterd: take conn->lock around operations on conn->reconnect Failure to do this could lead to a race in which a timer would be removed twice concurrently, corrupting the timer list (because gf_timer_call_cancel has no internal protection against this) and possibly causing a crash. Change-Id: Ic1a8b612d436daec88fd6cee935db0ae81a47d5c BUG: 1421721 Signed-off-by: Jeff Darcy <jdarcy> Reviewed-on: https://review.gluster.org/16662 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/ |