Description of problem: Shd keeps doing heals in a loop until it heals at least one entry in the previous run. A heal is termed successful only if it heals both metadata and entry/data heal i.e. the entry needs to be completely healed by just that healer. In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1 is new. After replace-brick only root-gfid will be present in brick-0's index 1) shd-thread corresponding to brick-0 does metadata heal, this creates root-gfid in brick-0's 'dirty' index. 2) Both healer threads corresponding to brick-0 and brick-1 now try to heal root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will experience a failure and it goes back to waiting for 10 minutes (cluster.heal-timeout). When brick-1's healer-thread completes healing root-gfid it creates 5 files which create indices in brick-0, so until brick-0 doesn't trigger one more heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser than cluster.heal-timeout, so decreasing this to 5 seconds so that the next heal is triggered which will do the heals. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Log of the failure run 17:27:53 [17:27:57] Running tests in file ./tests/basic/afr/granular-esh/replace-brick.t 17:29:30 ./tests/basic/afr/granular-esh/replace-brick.t .. 17:29:30 1..34 17:29:30 ok 1, LINENUM:7 17:29:30 ok 2, LINENUM:8 17:29:30 ok 3, LINENUM:9 17:29:30 ok 4, LINENUM:10 17:29:30 ok 5, LINENUM:11 17:29:30 ok 6, LINENUM:12 17:29:30 ok 7, LINENUM:13 17:29:30 ok 8, LINENUM:14 17:29:30 ok 9, LINENUM:15 17:29:30 ok 10, LINENUM:17 17:29:30 ok 11, LINENUM:26 17:29:30 ok 12, LINENUM:29 17:29:30 ok 13, LINENUM:32 17:29:30 ok 14, LINENUM:35 17:29:30 ok 15, LINENUM:38 17:29:30 ok 16, LINENUM:41 17:29:30 ok 17, LINENUM:43 17:29:30 ok 18, LINENUM:44 17:29:30 ok 19, LINENUM:46 17:29:30 ok 20, LINENUM:47 17:29:30 ok 21, LINENUM:48 17:29:30 ok 22, LINENUM:49 17:29:30 ok 23, LINENUM:50 17:29:30 not ok 24 Got "5" instead of "0", LINENUM:53 17:29:30 FAILED COMMAND: 0 get_pending_heal_count patchy 17:29:30 ok 25, LINENUM:56 17:29:30 ok 26, LINENUM:59 17:29:30 ok 27, LINENUM:60 17:29:30 not ok 28 , LINENUM:63 17:29:30 FAILED COMMAND: diff /d/backends/patchy0/file1.txt /d/backends/patchy1_new/file1.txt 17:29:30 ok 29, LINENUM:65 17:29:30 not ok 30 Got "" instead of "qwerty", LINENUM:68 17:29:30 FAILED COMMAND: qwerty get_text_xattr user.test /d/backends/patchy1_new/file5.txt 17:29:30 ok 31, LINENUM:69 17:29:30 ok 32, LINENUM:71 17:29:30 ok 33, LINENUM:72 17:29:30 ok 34, LINENUM:73 17:29:30 Failed 3/34 subtests
COMMIT: https://review.gluster.org/20681 committed in master by "Atin Mukherjee" <amukherj> with a commit message- tests: Set heal-timeout to 5 seconds Shd keeps doing heals in a loop until it heals at least one entry in the previous run. A heal is termed successful only if it heals both metadata and entry/data heal i.e. the entry needs to be completely healed by just that healer. In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1 is new. After replace-brick only root-gfid will be present in brick-0's index 1) shd-thread corresponding to brick-0 does metadata heal, this creates root-gfid in brick-0's 'dirty' index. 2) Both healer threads corresponding to brick-0 and brick-1 now try to heal root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will experience a failure and it goes back to waiting for 10 minutes (cluster.heal-timeout). 3) When brick-1's healer-thread completes healing root-gfid it creates 5 files which create indices in brick-0, so until brick-0 doesn't trigger one more heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser than cluster.heal-timeout, so decreasing this to 5 seconds so that the next heal is triggered which will do the heals. fixes bz#1613807 Change-Id: I881133fc28880d8615fbc4558a0dfa0dc63d7798 Signed-off-by: Pranith Kumar K <pkarampu>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/