+++ This bug was initially created as a clone of Bug #1332054 +++ Description of problem: tests/bugs/disperse/bug-1236065.t failed several times on different Jenkins slaves: * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console Version-Release number of selected component (if applicable): current master branch How reproducible: way too often Steps to Reproduce: 1. run tests/bugs/disperse/bug-1236065.t as regression test on Jenkins Actual results: Sometimes test 24 fails, sometimes test 25. 13:25:28 [20:25:28] Running tests in file ./tests/bugs/disperse/bug-1236065.t 13:26:13 cp: accessing `13.o': Input/output error 13:26:13 cp: accessing `14.o': Input/output error 13:26:13 cp: accessing `15.o': Input/output error 13:26:13 cp: accessing `16.o': Input/output error 13:26:13 cp: accessing `17.o': Input/output error 13:26:14 cp: accessing `18.o': Input/output error 13:26:14 cp: accessing `19.o': Input/output error 13:26:14 cp: accessing `1.o': Input/output error 13:26:14 cp: accessing `2.o': Input/output error 13:26:15 cp: accessing `3.o': Input/output error 13:26:15 cp: accessing `4.o': Input/output error 13:26:15 cp: accessing `5.o': Input/output error 13:26:15 cp: accessing `6.o': Input/output error 13:26:15 cp: accessing `7.o': Input/output error 13:26:16 cp: accessing `8.o': Input/output error 13:26:16 cp: accessing `9.o': Input/output error 13:27:28 tar: Removing leading `/' from member names 13:27:28 ./tests/bugs/disperse/bug-1236065.t .. 13:27:28 1..41 13:27:28 ok 1, LINENUM:28 13:27:28 ok 2, LINENUM:29 13:27:28 ok 3, LINENUM:30 13:27:28 ok 4, LINENUM:31 13:27:28 ok 5, LINENUM:32 13:27:28 ok 6, LINENUM:33 13:27:28 ok 7, LINENUM:36 13:27:28 ok 8, LINENUM:39 13:27:28 ok 9, LINENUM:42 13:27:28 ok 10, LINENUM:43 13:27:28 ok 11, LINENUM:44 13:27:28 ok 12, LINENUM:46 13:27:28 ok 13, LINENUM:47 13:27:28 ok 14, LINENUM:50 13:27:28 ok 15, LINENUM:51 13:27:28 ok 16, LINENUM:54 13:27:28 ok 17, LINENUM:55 13:27:28 ok 18, LINENUM:56 13:27:28 ok 19, LINENUM:58 13:27:28 ok 20, LINENUM:59 13:27:28 ok 21, LINENUM:62 13:27:28 ok 22, LINENUM:63 13:27:28 ok 23, LINENUM:64 13:27:28 ok 24, LINENUM:66 13:27:28 not ok 25 , LINENUM:67 13:27:28 FAILED COMMAND: ec_test_make 13:27:28 ok 26, LINENUM:69 13:27:28 ok 27, LINENUM:72 13:27:28 ok 28, LINENUM:73 13:27:28 ok 29, LINENUM:76 13:27:28 ok 30, LINENUM:77 13:27:28 ok 31, LINENUM:78 13:27:28 ok 32, LINENUM:80 13:27:28 ok 33, LINENUM:81 13:27:28 ok 34, LINENUM:83 13:27:28 ok 35, LINENUM:84 13:27:28 ok 36, LINENUM:85 13:27:28 ok 37, LINENUM:86 13:27:28 ok 38, LINENUM:90 13:27:28 ok 39, LINENUM:91 13:27:28 ok 40, LINENUM:92 13:27:28 ok 41, LINENUM:93 13:27:28 Failed 1/41 subtests 13:27:28 13:27:28 Test Summary Report 13:27:28 ------------------- 13:27:28 ./tests/bugs/disperse/bug-1236065.t (Wstat: 0 Tests: 41 Failed: 1) 13:27:28 Failed test: 25 13:27:28 Files=1, Tests=41, 120 wallclock secs ( 0.03 usr 0.00 sys + 5.74 cusr 2.62 csys = 8.39 CPU) 13:27:28 Result: FAIL 13:27:28 End of test ./tests/bugs/disperse/bug-1236065.t 13:27:28 ================================================================================ 13:27:28 13:27:28 13:27:28 Run complete 13:27:28 ================================================================================ 13:27:28 Number of tests found: 177 13:27:28 Number of tests selected for run based on pattern: 177 13:27:28 Number of tests skipped as they were marked bad: 7 13:27:28 Number of tests skipped because of known_issues: 1 13:27:28 Number of tests that were run: 169 13:27:28 13:27:28 1 test(s) failed 13:27:28 ./tests/bugs/disperse/bug-1236065.t 13:27:28 13:27:28 0 test(s) generated core --- Additional comment from Niels de Vos on 2016-05-01 16:47:18 EDT --- Adding the 'tracking' keyword so that our bug-status-check-script does not triple over it. Please remove the keyword when progress on this bug is made. --- Additional comment from Vijay Bellur on 2016-05-01 16:53:02 EDT --- REVIEW: http://review.gluster.org/14138 (disperse: mark bug-1236065.t as bad_test) posted (#1) for review on master by Niels de Vos (ndevos) --- Additional comment from Xavier Hernandez on 2016-05-02 04:56:07 EDT --- I'm unable to reproduce the problem, however logs seem to indicate that healing operations are still running after a successful completion of test 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since additional bricks are killed after this test finishes, some files might get damaged as more that redundancy bricks will be bad, causing the I/O errors. Most probably the root cause is that EXPECT_WITHIN uses a regular expression and a simple "0" matches many values, for example "10". This means that if exactly 10 files still need to be healed when the test is run, the test will finish successfully, but self-healing won't have finished yet. I'll post a patch to solve this problem. --- Additional comment from Vijay Bellur on 2016-05-02 05:04:11 EDT --- REVIEW: http://review.gluster.org/14145 (cluster/ec: Fix spurious failure of test bug-1236065.t) posted (#1) for review on master by Xavier Hernandez (xhernandez) --- Additional comment from Vijay Bellur on 2016-05-02 07:42:53 EDT --- COMMIT: http://review.gluster.org/14138 committed in master by Jeff Darcy (jdarcy) ------ commit 70a889489d79c41edfed52fdbdfa6869869906aa Author: Niels de Vos <ndevos> Date: Sun May 1 22:49:57 2016 +0200 disperse: mark bug-1236065.t as bad_test tests/bugs/disperse/bug-1236065.t failed several times on different Jenkins slaves: * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console BUG: 1332054 Change-Id: Ie1934f09f843c2089c187e9295288c16c01913d2 Signed-off-by: Niels de Vos <ndevos> Reviewed-on: http://review.gluster.org/14138 Reviewed-by: Susant Palai <spalai> Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Vijay Bellur <vbellur> CentOS-regression: Gluster Build System <jenkins.com> --- Additional comment from Pranith Kumar K on 2016-05-02 09:09:18 EDT --- (In reply to Xavier Hernandez from comment #3) > I'm unable to reproduce the problem, however logs seem to indicate that > healing operations are still running after a successful completion of test > 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since > additional bricks are killed after this test finishes, some files might get > damaged as more that redundancy bricks will be bad, causing the I/O errors. > > Most probably the root cause is that EXPECT_WITHIN uses a regular expression > and a simple "0" matches many values, for example "10". This means that if > exactly 10 files still need to be healed when the test is run, the test will > finish successfully, but self-healing won't have finished yet. > > I'll post a patch to solve this problem. Good catch!, it could very well be this issue. --- Additional comment from Vijay Bellur on 2016-07-22 05:27:03 EDT --- REVIEW: http://review.gluster.org/14985 (tests: Fix pending-heal-count checks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu) --- Additional comment from Vijay Bellur on 2016-07-22 13:01:50 EDT --- COMMIT: http://review.gluster.org/14985 committed in master by Jeff Darcy (jdarcy) ------ commit c5bf5d98594a4237a72cf0d3c72925d5a5aa0f69 Author: Pranith Kumar K <pkarampu> Date: Fri Jul 22 13:58:22 2016 +0530 tests: Fix pending-heal-count checks EXPECT_WITHIN takes regular expression to match the count, so even when there are say 10 entries to heal, it would think that the heal is complete. Fixed checking pending heal count with correct regex. Thanks to Xavi for finding this problem. Change-Id: Ic593d22468b2b586bfca864962ffa0eda96b1d1f BUG: 1332054 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/14985 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> --- Additional comment from Vijay Bellur on 2016-07-25 10:34:41 EDT --- REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#1) for review on master by Ravishankar N (ravishankar) --- Additional comment from Vijay Bellur on 2016-07-27 00:48:45 EDT --- REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#2) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/15023 (tests: Fix pending-heal-count checks) posted (#1) for review on release-3.8 by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/15023 committed in release-3.8 by Pranith Kumar Karampuri (pkarampu) ------ commit 56ca0b14aaf4e6daddc2b787765db659b1c2ff1b Author: Pranith Kumar K <pkarampu> Date: Fri Jul 22 13:58:22 2016 +0530 tests: Fix pending-heal-count checks EXPECT_WITHIN takes regular expression to match the count, so even when there are say 10 entries to heal, it would think that the heal is complete. Fixed checking pending heal count with correct regex. Thanks to Xavi for finding this problem. >Change-Id: Ic593d22468b2b586bfca864962ffa0eda96b1d1f >BUG: 1332054 >Signed-off-by: Pranith Kumar K <pkarampu> >Reviewed-on: http://review.gluster.org/14985 >Smoke: Gluster Build System <jenkins.org> >Reviewed-by: Xavier Hernandez <xhernandez> >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.org> BUG: 1360574 Change-Id: I310f8d492bb576224797d9090658ca1e6367861c Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/15023 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Ravishankar N <ravishankar>
REVIEW: http://review.gluster.org/15047 (tests: Fix get_pending_heal_count check in ec) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/15047 committed in release-3.8 by Xavier Hernandez (xhernandez) ------ commit 07d7dec8ec307e68cf8f9690174ef0e9c6497085 Author: Ravishankar N <ravishankar> Date: Fri Jul 29 20:43:36 2016 +0530 tests: Fix get_pending_heal_count check in ec Backport of http://review.gluster.org/#/c/15006/ Change-Id: I3d274bdc2036392af942a17a0e0bf28f431c947b BUG: 1360574 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/15047 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.2, please open a new bug report. glusterfs-3.8.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://www.gluster.org/pipermail/announce/2016-August/000058.html [2] https://www.gluster.org/pipermail/gluster-users/