On a brick multiplexing 3 node cluster setup, having 12 1 X 3 volumes and restarting all the gluster processes leaves up with some of brick status showing offline. Steps to Reproduce: 1.brick mux enabled, max brick per process set to 3 2.had about 12 volumes,about 10 were 1x3 and 2 were 2x2 =>in total 17 bricks per node 3.did a pkill glusterfsd, glusterfs and service glusterd stop 4. did service glusterd start Actual results: ============== found about 11-18(different tries, different numbers) glusterfsd running, while only 7 are supposed to be created also volume status shows bricks as offline for some of them however, no IO impact We would hit this in upgrade path
REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#1) for review on master by Atin Mukherjee (amukherj)
REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#2) for review on master by Atin Mukherjee (amukherj)
REVIEW: https://review.gluster.org/18577 (glusterd: fix brick restart parallelism) posted (#3) for review on master by Atin Mukherjee (amukherj)
COMMIT: https://review.gluster.org/18577 committed in master by ------------- glusterd: fix brick restart parallelism glusterd's brick restart logic is not always sequential as there is atleast three different ways how the bricks are restarted. 1. through friend-sm and glusterd_spawn_daemons () 2. through friend-sm and handling volume quorum action 3. through friend handshaking when there is a mimatch on quorum on friend import. In a brick multiplexing setup, glusterd ended up trying to spawn the same brick process couple of times as almost in fraction of milliseconds two threads hit glusterd_brick_start () because of which glusterd didn't have any choice of rejecting any one of them as for both the case brick start criteria met. As a solution, it'd be better to control this madness by two different flags, one is a boolean called start_triggered which indicates a brick start has been triggered and it continues to be true till a brick dies or killed, the second is a mutex lock to ensure for a particular brick we don't end up getting into glusterd_brick_start () more than once at same point of time. Change-Id: I292f1e58d6971e111725e1baea1fe98b890b43e2 BUG: 1506513 Signed-off-by: Atin Mukherjee <amukherj>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/