Bug 1540607

Summary: glusterd fails to attach brick during restart of the node
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bmekala, bugs, rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: glusterfs-v4.1.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1540600
: 1543706 1543708 (view as bug list) Environment:
Last Closed: 2018-06-20 17:58:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1535732, 1540600, 1543706, 1543708, 1556670    

Comment 1 Atin Mukherjee 2018-01-31 14:12:09 UTC
Description of problem:
In a 3 node cluster with brick multiplexing is enabled, when one of the node is down and a volume goes through some option changes through volume set, on reboot of the node all the bricks fail to attach and hence looses the brick multiplexing feature. And other observation is the entire handshake process becomes very very slow and can take even hours and in between if some one brings down glusterd then we're going to loose certain volume info files.


Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Create a 3 node cluster, enable brick multiplexing and setup 20 1 X 3 volumes and start them.
2. Now bring down glusterd on first node and perform volume set operation for all 20 volumes from any of the other nodes.
3. bring back glusterd instance on 1st node.

Actual results:
Bricks failed to attach and multiplexing mode is lost. And handshake becomes damn slow.

Expected results:
Bricks should come up in a multiplexed mode.

Comment 2 Worker Ant 2018-02-01 06:03:43 UTC
REVIEW: https://review.gluster.org/19357 (glusterd: import volumes in separate synctask) posted (#3) for review on master by Atin Mukherjee

Comment 3 Worker Ant 2018-02-09 03:27:33 UTC
COMMIT: https://review.gluster.org/19357 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: import volumes in separate synctask

With brick multiplexing, to attach a brick to an existing brick process
the prerequisite is to have the compatible brick to finish it's
initialization and portmap sign in and hence the thread might have to go
to a sleep and context switch the synctask to allow the brick process to
communicate with glusterd. In normal code path, this works fine as
glusterd_restart_bricks () is launched through a separate synctask.

In case there's a mismatch of the volume when glusterd restarts,
glusterd_import_friend_volume is invoked and then it tries to call
glusterd_start_bricks () from the main thread which eventually may land
into the similar situation. Now since this is not done through a
separate synctask, the 1st brick will never be able to get its turn to
finish all of its handshaking and as a consequence to it, all the bricks
will fail to get attached to it.

Solution : Execute import volume and glusterd restart bricks in separate
synctask. Importing snaps had to be also done through synctask as
there's a dependency of the parent volume need to be available for the
importing snap functionality to work.

Change-Id: I290b244d456afcc9b913ab30be4af040d340428c
BUG: 1540607
Signed-off-by: Atin Mukherjee <amukherj>

Comment 4 Worker Ant 2018-02-09 18:25:39 UTC
REVIEW: https://review.gluster.org/19536 (glusterd/snapshot : fix the compare snap logic) posted (#2) for review on master by Atin Mukherjee

Comment 5 Worker Ant 2018-02-10 12:40:43 UTC
COMMIT: https://review.gluster.org/19536 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd/snapshot : fix the compare snap logic

In one of the case in commit cb0339f there's one particular case where
after removing the old snap it wasn't writing the new snap version and
this resulted into one of the test to fail spuriously.

Change-Id: I3e83435fb62d6bba3bbe227e40decc6ce37ea77b
BUG: 1540607
Signed-off-by: Atin Mukherjee <amukherj>

Comment 6 Worker Ant 2018-02-12 02:20:03 UTC
REVIEW: https://review.gluster.org/19539 (tests: fix spurious test failure) posted (#1) for review on master by Atin Mukherjee

Comment 7 Worker Ant 2018-02-13 05:36:27 UTC
COMMIT: https://review.gluster.org/19539 committed in master by "Atin Mukherjee" <amukherj> with a commit message- tests: fix spurious test failure

In bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
check for peer count after starting glusterd instance on node 2

Change-Id: I3f92013719d94b6d92fb5db25efef1fb4b41d510
BUG: 1540607
Signed-off-by: Atin Mukherjee <amukherj>

Comment 8 Shyamsundar 2018-06-20 17:58:44 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/