1508283 – stale brick processes getting created and volume status shows brick as down(pkill glusterfsd glusterfs ,glusterd restart)

Bug 1508283 - stale brick processes getting created and volume status shows brick as down(pkill glusterfsd glusterfs ,glusterd restart)

Summary: stale brick processes getting created and volume status shows brick as down(p...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:	1506513
Blocks:	1505363 1526368
TreeView+	depends on / blocked

Reported:	2017-11-01 03:57 UTC by Atin Mukherjee
Modified:	2017-12-15 09:55 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-glusterfs-3.12.3
Clone Of:	1506513
Environment:
Last Closed:	2017-11-29 05:53:24 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Atin Mukherjee 2017-11-01 03:58:23 UTC

On a brick multiplexing 3 node cluster setup, having 12 1 X 3 volumes and restarting all the gluster processes leaves up with some of brick status showing offline.

Steps to Reproduce:
1.brick mux enabled, max brick per process set to 3
2.had about 12 volumes,about 10 were 1x3 and 2 were 2x2 =>in total 17 bricks per node
3.did a pkill glusterfsd, glusterfs and service glusterd stop
4. did service glusterd start


Actual results:
==============
found about 11-18(different tries, different numbers) glusterfsd running, while only 7 are supposed to be created

also volume status shows bricks as offline for some of them

however, no IO impact


We would hit this in upgrade path

--- Additional comment from Worker Ant on 2017-10-26 05:19:43 EDT ---

REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-10-26 09:12:47 EDT ---

REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-10-30 05:17:39 EDT ---

REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#3) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-10-31 23:42:08 EDT ---

COMMIT: https://review.gluster.org/18577 committed in master by  

------------- glusterd: fix brick restart parallelism

glusterd's brick restart logic is not always sequential as there is
atleast three different ways how the bricks are restarted.
1. through friend-sm and glusterd_spawn_daemons ()
2. through friend-sm and handling volume quorum action
3. through friend handshaking when there is a mimatch on quorum on
friend import.

In a brick multiplexing setup, glusterd ended up trying to spawn the
same brick process couple of times as almost in fraction of milliseconds
two threads hit glusterd_brick_start () because of which glusterd didn't
have any choice of rejecting any one of them as for both the case brick
start criteria met.

As a solution, it'd be better to control this madness by two different
flags, one is a boolean called start_triggered which indicates a brick
start has been triggered and it continues to be true till a brick dies
or killed, the second is a mutex lock to ensure for a particular brick
we don't end up getting into glusterd_brick_start () more than once at
same point of time.

Change-Id: I292f1e58d6971e111725e1baea1fe98b890b43e2
BUG: 1506513
Signed-off-by: Atin Mukherjee <amukherj>

Comment 2 Worker Ant 2017-11-01 03:59:38 UTC

REVIEW: https://review.gluster.org/18603 (glusterd: fix brick restart parallelism) posted (#1) for review on release-3.12 by Atin Mukherjee

Comment 3 Worker Ant 2017-11-06 06:09:46 UTC

COMMIT: https://review.gluster.org/18603 committed in release-3.12 by  

------------- glusterd: fix brick restart parallelism

glusterd's brick restart logic is not always sequential as there is
atleast three different ways how the bricks are restarted.
1. through friend-sm and glusterd_spawn_daemons ()
2. through friend-sm and handling volume quorum action
3. through friend handshaking when there is a mimatch on quorum on
friend import.

In a brick multiplexing setup, glusterd ended up trying to spawn the
same brick process couple of times as almost in fraction of milliseconds
two threads hit glusterd_brick_start () because of which glusterd didn't
have any choice of rejecting any one of them as for both the case brick
start criteria met.

As a solution, it'd be better to control this madness by two different
flags, one is a boolean called start_triggered which indicates a brick
start has been triggered and it continues to be true till a brick dies
or killed, the second is a mutex lock to ensure for a particular brick
we don't end up getting into glusterd_brick_start () more than once at
same point of time.

Change-Id: I292f1e58d6971e111725e1baea1fe98b890b43e2
BUG: 1508283
Signed-off-by: Atin Mukherjee <amukherj>
(cherry picked from commit 82be66ef8e9e3127d41a4c843daf74c1d8aec4aa)

Comment 4 Jiffin 2017-11-29 05:53:24 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.12.3, please open a new bug report.

glusterfs-glusterfs-3.12.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-devel/2017-November/053983.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.