Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1577672

Summary:	Brick-mux regressions failing for over 8+ weeks on master
Product:	[Community] GlusterFS	Reporter:	Shyamsundar <srangana>
Component:	tests	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	mainline	CC:	atumball, bugs
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-5.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1582286 (view as bug list)		Environment:
Last Closed:	2018-10-05 04:34:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shyamsundar 2018-05-13 21:59:39 UTC

Brick-mux nightly regression is breaking on master for over 11+ weeks (since around Feb-22nd, 2018): https://build.gluster.org/job/regression-test-with-multiplex/

NOTE: This bug is opened to track the dependent failures, or newer ones.

Over time the problems seem to have morphed, or moved, due to other fixes in between and an analysis of the last 13 builds (from Jenkins job number 725-738, IOW https://build.gluster.org/job/regression-test-with-multiplex/[725..738]) show the following problems that need to be addressed, each are further filed as separate bugs (or will be as the case progresses).

1) Test timeouts without anything really having run in the test (as far as I can tell at present) (timeout in 200/300 seconds) (number if brackets is # times this has failed in the analyzed runs)
./tests/basic/afr/entry-self-heal.t (4)
./tests/bitrot/br-state-check.t (8)
./tests/bugs/core/bug-1432542-mpx-restart-crash.t (1)
./tests/bugs/index/bug-1559004-EMLINK-handling.t (2)

2) ./tests/bugs/replicate/bug-1363721.t (4)
   ./tests/basic/afr/lk-quorum.t (5)

lk-quorum failures are easily observable in tests run on local machines as well. The issue seems to be due to brick process not listening on or not coming up for  one of the bricks, causing the checks for up counts to fail.

bug-1363721.t failures seem to be related to the same.

3) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)

The above failure seems to constantly be when adding an previously added brick (but using force), and failing in glusterd with transport end point not connected.

Comment 1 Worker Ant 2018-05-15 04:58:19 UTC

REVIEW: https://review.gluster.org/20022 (glusterd: address test failures with brick mux enabled) posted (#2) for review on master by Atin Mukherjee

Comment 2 Worker Ant 2018-05-18 10:40:54 UTC

REVIEW: https://review.gluster.org/20036 (afr: fix bug-1363721.t failure) posted (#1) for review on master by Ravishankar N

Comment 3 Worker Ant 2018-05-18 14:39:29 UTC

REVIEW: https://review.gluster.org/20037 (changelog: fix br-state-check.t failure for brick_mux) posted (#1) for review on master by MOHIT AGRAWAL

Comment 4 Worker Ant 2018-05-22 06:05:52 UTC

COMMIT: https://review.gluster.org/20036 committed in master by "Ravishankar N" <ravishankar> with a commit message- afr: fix bug-1363721.t failure

Problem:
In the .t, when the only good brick was brought down, writes on the fd were
still succeeding on the bad bricks. The inflight split-brain check was
marking the write as failure but since the write succeeded on all the
bad bricks, afr_txn_nothing_failed() was set to true and we were
unwinding writev with success to DHT and then catching the failure in
post-op in the background.

Fix:
Don't wind the FOP phase if the write_subvol (which is populated with readable
subvols obtained in pre-op cbk) does not have at least 1 good brick which was up
when the transaction started.

Note: This fix is not related to brick muliplexing. I ran the .t
10 times with this fix and brick-mux enabled without any failures.

Change-Id: I915c9c366aa32cd342b1565827ca2d83cb02ae85
updates: bz#1577672
Signed-off-by: Ravishankar N <ravishankar>

Comment 5 Worker Ant 2018-05-25 07:15:54 UTC

COMMIT: https://review.gluster.org/20037 committed in master by "Amar Tumballi" <amarts> with a commit message- changelog: fix br-state-check.t failure for brick_mux

Problem: Sometime br-state-check.t crash while runnning
         for brick multiplex and command in test case is
         taking 2 minutes for detach a brick

Solution: Update code in changelog xlator specific to wait
          on all connection before cleanup rpc threads and
          cleanup rpc object only in non brick mux scenario

BUG: 1577672
Change-Id: I16e257c1e127744a815000b87bd8b7b8d9c51e1b
fixes: bz#1577672
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 6 Worker Ant 2018-05-31 04:27:56 UTC

COMMIT: https://review.gluster.org/20022 committed in master by "Amar Tumballi" <amarts> with a commit message- glusterd: address test failures with brick mux enabled

This patch addresses following:
1. On volume stop, for the last brick, pmap_registry_remove () is
invoked by glusterd.
2. If a brick process is sigkilled, remove all the associated brick
instances from the portmap.
3. Bump up PROCESS_UP_TIMEOUT to 45.
4. gf_attach to kill a brick takes more time in mux (which is an
issue that needs a fix), but in the interim, give br-state-check.t
more time to complete (there are 2 kill_bricks, each taking 120
seconds, and the test usually passes in 30 odd seconds, hence bumping
this up to 350 seconds)
5. The test bug-1559004-EMLINK-handling.t is taking ~950 seconds at
times on master without mux, in mux cases, when it fails, it is almost
at the last iteration, hence bumping the timeout for this test case
to reduce regression error rates

Updates: bz#1577672
Change-Id: I1922675e112baca4c125c4c094eaa42a11e34e67
Signed-off-by: Atin Mukherjee <amukherj>

Comment 7 Amar Tumballi 2018-10-05 04:34:24 UTC

https://build.gluster.org/job/regression-test-with-multiplex/

Master branch is now stable of all brick-mux regressions!

Comment 8 Shyamsundar 2018-10-23 15:08:41 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/