Bug 1613807 - Fix spurious failures in tests/basic/afr/granular-esh/replace-brick.t
Summary: Fix spurious failures in tests/basic/afr/granular-esh/replace-brick.t
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: tests
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-08 10:29 UTC by Pranith Kumar K
Modified: 2018-10-23 15:16 UTC (History)
1 user (show)

Fixed In Version: glusterfs-5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-23 15:16:35 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2018-08-08 10:29:06 UTC
Description of problem:
Shd keeps doing heals in a loop until it heals at least one entry in the previous run. A heal is termed successful only if it heals both metadata and entry/data heal i.e. the entry needs to be completely healed by just that healer. In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1 is new. After replace-brick only root-gfid will be present in brick-0's index 1) shd-thread corresponding to brick-0 does metadata heal, this creates root-gfid in brick-0's 'dirty' index. 2) Both healer threads corresponding to brick-0 and brick-1 now try to heal root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will experience a failure and it goes back to waiting for 10 minutes (cluster.heal-timeout). When brick-1's healer-thread completes healing root-gfid it creates 5 files which create indices in brick-0, so until brick-0 doesn't trigger one more heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser than cluster.heal-timeout, so decreasing this to 5 seconds so that the next heal is triggered which will do the heals.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Pranith Kumar K 2018-08-08 10:29:49 UTC
Log of the failure run

17:27:53 [17:27:57] Running tests in file ./tests/basic/afr/granular-esh/replace-brick.t
17:29:30 ./tests/basic/afr/granular-esh/replace-brick.t .. 
17:29:30 1..34
17:29:30 ok 1, LINENUM:7
17:29:30 ok 2, LINENUM:8
17:29:30 ok 3, LINENUM:9
17:29:30 ok 4, LINENUM:10
17:29:30 ok 5, LINENUM:11
17:29:30 ok 6, LINENUM:12
17:29:30 ok 7, LINENUM:13
17:29:30 ok 8, LINENUM:14
17:29:30 ok 9, LINENUM:15
17:29:30 ok 10, LINENUM:17
17:29:30 ok 11, LINENUM:26
17:29:30 ok 12, LINENUM:29
17:29:30 ok 13, LINENUM:32
17:29:30 ok 14, LINENUM:35
17:29:30 ok 15, LINENUM:38
17:29:30 ok 16, LINENUM:41
17:29:30 ok 17, LINENUM:43
17:29:30 ok 18, LINENUM:44
17:29:30 ok 19, LINENUM:46
17:29:30 ok 20, LINENUM:47
17:29:30 ok 21, LINENUM:48
17:29:30 ok 22, LINENUM:49
17:29:30 ok 23, LINENUM:50
17:29:30 not ok 24 Got "5" instead of "0", LINENUM:53
17:29:30 FAILED COMMAND: 0 get_pending_heal_count patchy
17:29:30 ok 25, LINENUM:56
17:29:30 ok 26, LINENUM:59
17:29:30 ok 27, LINENUM:60
17:29:30 not ok 28 , LINENUM:63
17:29:30 FAILED COMMAND: diff /d/backends/patchy0/file1.txt /d/backends/patchy1_new/file1.txt
17:29:30 ok 29, LINENUM:65
17:29:30 not ok 30 Got "" instead of "qwerty", LINENUM:68
17:29:30 FAILED COMMAND: qwerty get_text_xattr user.test /d/backends/patchy1_new/file5.txt
17:29:30 ok 31, LINENUM:69
17:29:30 ok 32, LINENUM:71
17:29:30 ok 33, LINENUM:72
17:29:30 ok 34, LINENUM:73
17:29:30 Failed 3/34 subtests

Comment 2 Worker Ant 2018-08-09 11:32:41 UTC
COMMIT: https://review.gluster.org/20681 committed in master by "Atin Mukherjee" <amukherj> with a commit message- tests: Set heal-timeout to 5 seconds

Shd keeps doing heals in a loop until it heals at least one entry in the
previous run. A heal is termed successful only if it heals both metadata and
entry/data heal i.e. the entry needs to be completely healed by just that healer.

In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1
is new. After replace-brick only root-gfid will be present in brick-0's index
1) shd-thread corresponding to brick-0 does metadata heal, this creates
root-gfid in brick-0's 'dirty' index.
2) Both healer threads corresponding to brick-0 and brick-1 now try to heal
root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will
experience a failure and it goes back to waiting for 10 minutes
(cluster.heal-timeout).
3) When brick-1's healer-thread completes healing root-gfid it creates 5 files
which create indices in brick-0, so until brick-0 doesn't trigger one more
heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser
than cluster.heal-timeout, so decreasing this to 5 seconds so that the next
heal is triggered which will do the heals.

fixes bz#1613807
Change-Id: I881133fc28880d8615fbc4558a0dfa0dc63d7798
Signed-off-by: Pranith Kumar K <pkarampu>

Comment 3 Shyamsundar 2018-10-23 15:16:35 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.