Brick-mux nightly regression is breaking on master for over 11+ weeks (since around Feb-22nd, 2018): https://build.gluster.org/job/regression-test-with-multiplex/
NOTE: This bug is opened to track the dependent failures, or newer ones.
Over time the problems seem to have morphed, or moved, due to other fixes in between and an analysis of the last 13 builds (from Jenkins job number 725-738, IOW https://build.gluster.org/job/regression-test-with-multiplex/[725..738]) show the following problems that need to be addressed, each are further filed as separate bugs (or will be as the case progresses).
1) Test timeouts without anything really having run in the test (as far as I can tell at present) (timeout in 200/300 seconds) (number if brackets is # times this has failed in the analyzed runs)
2) ./tests/bugs/replicate/bug-1363721.t (4)
lk-quorum failures are easily observable in tests run on local machines as well. The issue seems to be due to brick process not listening on or not coming up for one of the bricks, causing the checks for up counts to fail.
bug-1363721.t failures seem to be related to the same.
3) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
The above failure seems to constantly be when adding an previously added brick (but using force), and failing in glusterd with transport end point not connected.
REVIEW: https://review.gluster.org/20022 (glusterd: address test failures with brick mux enabled) posted (#2) for review on master by Atin Mukherjee
REVIEW: https://review.gluster.org/20036 (afr: fix bug-1363721.t failure) posted (#1) for review on master by Ravishankar N
REVIEW: https://review.gluster.org/20037 (changelog: fix br-state-check.t failure for brick_mux) posted (#1) for review on master by MOHIT AGRAWAL
COMMIT: https://review.gluster.org/20036 committed in master by "Ravishankar N" <email@example.com> with a commit message- afr: fix bug-1363721.t failure
In the .t, when the only good brick was brought down, writes on the fd were
still succeeding on the bad bricks. The inflight split-brain check was
marking the write as failure but since the write succeeded on all the
bad bricks, afr_txn_nothing_failed() was set to true and we were
unwinding writev with success to DHT and then catching the failure in
post-op in the background.
Don't wind the FOP phase if the write_subvol (which is populated with readable
subvols obtained in pre-op cbk) does not have at least 1 good brick which was up
when the transaction started.
Note: This fix is not related to brick muliplexing. I ran the .t
10 times with this fix and brick-mux enabled without any failures.
Signed-off-by: Ravishankar N <firstname.lastname@example.org>
COMMIT: https://review.gluster.org/20037 committed in master by "Amar Tumballi" <email@example.com> with a commit message- changelog: fix br-state-check.t failure for brick_mux
Problem: Sometime br-state-check.t crash while runnning
for brick multiplex and command in test case is
taking 2 minutes for detach a brick
Solution: Update code in changelog xlator specific to wait
on all connection before cleanup rpc threads and
cleanup rpc object only in non brick mux scenario
Signed-off-by: Mohit Agrawal <firstname.lastname@example.org>
COMMIT: https://review.gluster.org/20022 committed in master by "Amar Tumballi" <email@example.com> with a commit message- glusterd: address test failures with brick mux enabled
This patch addresses following:
1. On volume stop, for the last brick, pmap_registry_remove () is
invoked by glusterd.
2. If a brick process is sigkilled, remove all the associated brick
instances from the portmap.
3. Bump up PROCESS_UP_TIMEOUT to 45.
4. gf_attach to kill a brick takes more time in mux (which is an
issue that needs a fix), but in the interim, give br-state-check.t
more time to complete (there are 2 kill_bricks, each taking 120
seconds, and the test usually passes in 30 odd seconds, hence bumping
this up to 350 seconds)
5. The test bug-1559004-EMLINK-handling.t is taking ~950 seconds at
times on master without mux, in mux cases, when it fails, it is almost
at the last iteration, hence bumping the timeout for this test case
to reduce regression error rates
Signed-off-by: Atin Mukherjee <firstname.lastname@example.org>
Master branch is now stable of all brick-mux regressions!
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.
glusterfs-5.0 has been announced on the Gluster mailinglists , packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist  and the update infrastructure for your distribution.