Brick-mux nightly regression is breaking on master for over 11+ weeks (since around Feb-22nd, 2018): https://build.gluster.org/job/regression-test-with-multiplex/ NOTE: This bug is opened to track the dependent failures, or newer ones. Over time the problems seem to have morphed, or moved, due to other fixes in between and an analysis of the last 13 builds (from Jenkins job number 725-738, IOW https://build.gluster.org/job/regression-test-with-multiplex/[725..738]) show the following problems that need to be addressed, each are further filed as separate bugs (or will be as the case progresses). 1) Test timeouts without anything really having run in the test (as far as I can tell at present) (timeout in 200/300 seconds) (number if brackets is # times this has failed in the analyzed runs) ./tests/basic/afr/entry-self-heal.t (4) ./tests/bitrot/br-state-check.t (8) ./tests/bugs/core/bug-1432542-mpx-restart-crash.t (1) ./tests/bugs/index/bug-1559004-EMLINK-handling.t (2) 2) ./tests/bugs/replicate/bug-1363721.t (4) ./tests/basic/afr/lk-quorum.t (5) lk-quorum failures are easily observable in tests run on local machines as well. The issue seems to be due to brick process not listening on or not coming up for one of the bricks, causing the checks for up counts to fail. bug-1363721.t failures seem to be related to the same. 3) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7) The above failure seems to constantly be when adding an previously added brick (but using force), and failing in glusterd with transport end point not connected.
REVIEW: https://review.gluster.org/20022 (glusterd: address test failures with brick mux enabled) posted (#2) for review on master by Atin Mukherjee
REVIEW: https://review.gluster.org/20036 (afr: fix bug-1363721.t failure) posted (#1) for review on master by Ravishankar N
REVIEW: https://review.gluster.org/20037 (changelog: fix br-state-check.t failure for brick_mux) posted (#1) for review on master by MOHIT AGRAWAL
COMMIT: https://review.gluster.org/20036 committed in master by "Ravishankar N" <ravishankar> with a commit message- afr: fix bug-1363721.t failure Problem: In the .t, when the only good brick was brought down, writes on the fd were still succeeding on the bad bricks. The inflight split-brain check was marking the write as failure but since the write succeeded on all the bad bricks, afr_txn_nothing_failed() was set to true and we were unwinding writev with success to DHT and then catching the failure in post-op in the background. Fix: Don't wind the FOP phase if the write_subvol (which is populated with readable subvols obtained in pre-op cbk) does not have at least 1 good brick which was up when the transaction started. Note: This fix is not related to brick muliplexing. I ran the .t 10 times with this fix and brick-mux enabled without any failures. Change-Id: I915c9c366aa32cd342b1565827ca2d83cb02ae85 updates: bz#1577672 Signed-off-by: Ravishankar N <ravishankar>
COMMIT: https://review.gluster.org/20037 committed in master by "Amar Tumballi" <amarts> with a commit message- changelog: fix br-state-check.t failure for brick_mux Problem: Sometime br-state-check.t crash while runnning for brick multiplex and command in test case is taking 2 minutes for detach a brick Solution: Update code in changelog xlator specific to wait on all connection before cleanup rpc threads and cleanup rpc object only in non brick mux scenario BUG: 1577672 Change-Id: I16e257c1e127744a815000b87bd8b7b8d9c51e1b fixes: bz#1577672 Signed-off-by: Mohit Agrawal <moagrawa>
COMMIT: https://review.gluster.org/20022 committed in master by "Amar Tumballi" <amarts> with a commit message- glusterd: address test failures with brick mux enabled This patch addresses following: 1. On volume stop, for the last brick, pmap_registry_remove () is invoked by glusterd. 2. If a brick process is sigkilled, remove all the associated brick instances from the portmap. 3. Bump up PROCESS_UP_TIMEOUT to 45. 4. gf_attach to kill a brick takes more time in mux (which is an issue that needs a fix), but in the interim, give br-state-check.t more time to complete (there are 2 kill_bricks, each taking 120 seconds, and the test usually passes in 30 odd seconds, hence bumping this up to 350 seconds) 5. The test bug-1559004-EMLINK-handling.t is taking ~950 seconds at times on master without mux, in mux cases, when it fails, it is almost at the last iteration, hence bumping the timeout for this test case to reduce regression error rates Updates: bz#1577672 Change-Id: I1922675e112baca4c125c4c094eaa42a11e34e67 Signed-off-by: Atin Mukherjee <amukherj>
https://build.gluster.org/job/regression-test-with-multiplex/ Master branch is now stable of all brick-mux regressions!
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/