Bug 1360574 - multiple failures of tests/bugs/disperse/bug-1236065.t
Summary: multiple failures of tests/bugs/disperse/bug-1236065.t
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: 3.8.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On: 1332054
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-27 05:13 UTC by Pranith Kumar K
Modified: 2016-08-12 09:47 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1332054
Environment:
Last Closed: 2016-08-12 09:47:50 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2016-07-27 05:13:53 UTC
+++ This bug was initially created as a clone of Bug #1332054 +++

Description of problem:
tests/bugs/disperse/bug-1236065.t failed several times on different Jenkins slaves:

* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console
* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console
* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console

Version-Release number of selected component (if applicable):
current master branch

How reproducible:
way too often

Steps to Reproduce:
1. run tests/bugs/disperse/bug-1236065.t as regression test on Jenkins

Actual results:

Sometimes test 24 fails, sometimes test 25.

13:25:28 [20:25:28] Running tests in file ./tests/bugs/disperse/bug-1236065.t
13:26:13 cp: accessing `13.o': Input/output error
13:26:13 cp: accessing `14.o': Input/output error
13:26:13 cp: accessing `15.o': Input/output error
13:26:13 cp: accessing `16.o': Input/output error
13:26:13 cp: accessing `17.o': Input/output error
13:26:14 cp: accessing `18.o': Input/output error
13:26:14 cp: accessing `19.o': Input/output error
13:26:14 cp: accessing `1.o': Input/output error
13:26:14 cp: accessing `2.o': Input/output error
13:26:15 cp: accessing `3.o': Input/output error
13:26:15 cp: accessing `4.o': Input/output error
13:26:15 cp: accessing `5.o': Input/output error
13:26:15 cp: accessing `6.o': Input/output error
13:26:15 cp: accessing `7.o': Input/output error
13:26:16 cp: accessing `8.o': Input/output error
13:26:16 cp: accessing `9.o': Input/output error
13:27:28 tar: Removing leading `/' from member names
13:27:28 ./tests/bugs/disperse/bug-1236065.t .. 
13:27:28 1..41
13:27:28 ok 1, LINENUM:28
13:27:28 ok 2, LINENUM:29
13:27:28 ok 3, LINENUM:30
13:27:28 ok 4, LINENUM:31
13:27:28 ok 5, LINENUM:32
13:27:28 ok 6, LINENUM:33
13:27:28 ok 7, LINENUM:36
13:27:28 ok 8, LINENUM:39
13:27:28 ok 9, LINENUM:42
13:27:28 ok 10, LINENUM:43
13:27:28 ok 11, LINENUM:44
13:27:28 ok 12, LINENUM:46
13:27:28 ok 13, LINENUM:47
13:27:28 ok 14, LINENUM:50
13:27:28 ok 15, LINENUM:51
13:27:28 ok 16, LINENUM:54
13:27:28 ok 17, LINENUM:55
13:27:28 ok 18, LINENUM:56
13:27:28 ok 19, LINENUM:58
13:27:28 ok 20, LINENUM:59
13:27:28 ok 21, LINENUM:62
13:27:28 ok 22, LINENUM:63
13:27:28 ok 23, LINENUM:64
13:27:28 ok 24, LINENUM:66
13:27:28 not ok 25 , LINENUM:67
13:27:28 FAILED COMMAND: ec_test_make
13:27:28 ok 26, LINENUM:69
13:27:28 ok 27, LINENUM:72
13:27:28 ok 28, LINENUM:73
13:27:28 ok 29, LINENUM:76
13:27:28 ok 30, LINENUM:77
13:27:28 ok 31, LINENUM:78
13:27:28 ok 32, LINENUM:80
13:27:28 ok 33, LINENUM:81
13:27:28 ok 34, LINENUM:83
13:27:28 ok 35, LINENUM:84
13:27:28 ok 36, LINENUM:85
13:27:28 ok 37, LINENUM:86
13:27:28 ok 38, LINENUM:90
13:27:28 ok 39, LINENUM:91
13:27:28 ok 40, LINENUM:92
13:27:28 ok 41, LINENUM:93
13:27:28 Failed 1/41 subtests 
13:27:28 
13:27:28 Test Summary Report
13:27:28 -------------------
13:27:28 ./tests/bugs/disperse/bug-1236065.t (Wstat: 0 Tests: 41 Failed: 1)
13:27:28   Failed test:  25
13:27:28 Files=1, Tests=41, 120 wallclock secs ( 0.03 usr  0.00 sys +  5.74 cusr  2.62 csys =  8.39 CPU)
13:27:28 Result: FAIL
13:27:28 End of test ./tests/bugs/disperse/bug-1236065.t
13:27:28 ================================================================================
13:27:28 
13:27:28 
13:27:28 Run complete
13:27:28 ================================================================================
13:27:28 Number of tests found:                             177
13:27:28 Number of tests selected for run based on pattern: 177
13:27:28 Number of tests skipped as they were marked bad:   7
13:27:28 Number of tests skipped because of known_issues:   1
13:27:28 Number of tests that were run:                     169
13:27:28 
13:27:28 1 test(s) failed 
13:27:28 ./tests/bugs/disperse/bug-1236065.t
13:27:28 
13:27:28 0 test(s) generated core

--- Additional comment from Niels de Vos on 2016-05-01 16:47:18 EDT ---

Adding the 'tracking' keyword so that our bug-status-check-script does not triple over it. Please remove the keyword when progress on this bug is made.

--- Additional comment from Vijay Bellur on 2016-05-01 16:53:02 EDT ---

REVIEW: http://review.gluster.org/14138 (disperse: mark bug-1236065.t as bad_test) posted (#1) for review on master by Niels de Vos (ndevos)

--- Additional comment from Xavier Hernandez on 2016-05-02 04:56:07 EDT ---

I'm unable to reproduce the problem, however logs seem to indicate that healing operations are still running after a successful completion of test 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since additional bricks are killed after this test finishes, some files might get damaged as more that redundancy bricks will be bad, causing the I/O errors.

Most probably the root cause is that EXPECT_WITHIN uses a regular expression and a simple "0" matches many values, for example "10". This means that if exactly 10 files still need to be healed when the test is run, the test will finish successfully, but self-healing won't have finished yet.

I'll post a patch to solve this problem.

--- Additional comment from Vijay Bellur on 2016-05-02 05:04:11 EDT ---

REVIEW: http://review.gluster.org/14145 (cluster/ec: Fix spurious failure of test bug-1236065.t) posted (#1) for review on master by Xavier Hernandez (xhernandez)

--- Additional comment from Vijay Bellur on 2016-05-02 07:42:53 EDT ---

COMMIT: http://review.gluster.org/14138 committed in master by Jeff Darcy (jdarcy) 
------
commit 70a889489d79c41edfed52fdbdfa6869869906aa
Author: Niels de Vos <ndevos>
Date:   Sun May 1 22:49:57 2016 +0200

    disperse: mark bug-1236065.t as bad_test
    
    tests/bugs/disperse/bug-1236065.t failed several times on different
    Jenkins slaves:
    
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console
    
    BUG: 1332054
    Change-Id: Ie1934f09f843c2089c187e9295288c16c01913d2
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: http://review.gluster.org/14138
    Reviewed-by: Susant Palai <spalai>
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Vijay Bellur <vbellur>
    CentOS-regression: Gluster Build System <jenkins.com>

--- Additional comment from Pranith Kumar K on 2016-05-02 09:09:18 EDT ---

(In reply to Xavier Hernandez from comment #3)
> I'm unable to reproduce the problem, however logs seem to indicate that
> healing operations are still running after a successful completion of test
> 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since
> additional bricks are killed after this test finishes, some files might get
> damaged as more that redundancy bricks will be bad, causing the I/O errors.
> 
> Most probably the root cause is that EXPECT_WITHIN uses a regular expression
> and a simple "0" matches many values, for example "10". This means that if
> exactly 10 files still need to be healed when the test is run, the test will
> finish successfully, but self-healing won't have finished yet.
> 
> I'll post a patch to solve this problem.

Good catch!, it could very well be this issue.

--- Additional comment from Vijay Bellur on 2016-07-22 05:27:03 EDT ---

REVIEW: http://review.gluster.org/14985 (tests: Fix pending-heal-count checks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Vijay Bellur on 2016-07-22 13:01:50 EDT ---

COMMIT: http://review.gluster.org/14985 committed in master by Jeff Darcy (jdarcy) 
------
commit c5bf5d98594a4237a72cf0d3c72925d5a5aa0f69
Author: Pranith Kumar K <pkarampu>
Date:   Fri Jul 22 13:58:22 2016 +0530

    tests: Fix pending-heal-count checks
    
    EXPECT_WITHIN takes regular expression to match the count,
    so even when there are say 10 entries to heal, it would
    think that the heal is complete. Fixed checking
    pending heal count with correct regex.
    
    Thanks to Xavi for finding this problem.
    
    Change-Id: Ic593d22468b2b586bfca864962ffa0eda96b1d1f
    BUG: 1332054
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/14985
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

--- Additional comment from Vijay Bellur on 2016-07-25 10:34:41 EDT ---

REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#1) for review on master by Ravishankar N (ravishankar)

--- Additional comment from Vijay Bellur on 2016-07-27 00:48:45 EDT ---

REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 1 Vijay Bellur 2016-07-27 05:52:58 UTC
REVIEW: http://review.gluster.org/15023 (tests: Fix pending-heal-count checks) posted (#1) for review on release-3.8 by Pranith Kumar Karampuri (pkarampu)

Comment 2 Vijay Bellur 2016-07-28 14:02:58 UTC
COMMIT: http://review.gluster.org/15023 committed in release-3.8 by Pranith Kumar Karampuri (pkarampu) 
------
commit 56ca0b14aaf4e6daddc2b787765db659b1c2ff1b
Author: Pranith Kumar K <pkarampu>
Date:   Fri Jul 22 13:58:22 2016 +0530

    tests: Fix pending-heal-count checks
    
    EXPECT_WITHIN takes regular expression to match the count,
    so even when there are say 10 entries to heal, it would
    think that the heal is complete. Fixed checking
    pending heal count with correct regex.
    
    Thanks to Xavi for finding this problem.
    
     >Change-Id: Ic593d22468b2b586bfca864962ffa0eda96b1d1f
     >BUG: 1332054
     >Signed-off-by: Pranith Kumar K <pkarampu>
     >Reviewed-on: http://review.gluster.org/14985
     >Smoke: Gluster Build System <jenkins.org>
     >Reviewed-by: Xavier Hernandez <xhernandez>
     >NetBSD-regression: NetBSD Build System <jenkins.org>
     >CentOS-regression: Gluster Build System <jenkins.org>
    
    BUG: 1360574
    Change-Id: I310f8d492bb576224797d9090658ca1e6367861c
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15023
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Ravishankar N <ravishankar>

Comment 3 Vijay Bellur 2016-07-29 15:31:00 UTC
REVIEW: http://review.gluster.org/15047 (tests: Fix get_pending_heal_count check in ec) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 4 Vijay Bellur 2016-07-30 11:40:16 UTC
COMMIT: http://review.gluster.org/15047 committed in release-3.8 by Xavier Hernandez (xhernandez) 
------
commit 07d7dec8ec307e68cf8f9690174ef0e9c6497085
Author: Ravishankar N <ravishankar>
Date:   Fri Jul 29 20:43:36 2016 +0530

    tests: Fix get_pending_heal_count check in ec
    
    Backport of http://review.gluster.org/#/c/15006/
    
    Change-Id: I3d274bdc2036392af942a17a0e0bf28f431c947b
    BUG: 1360574
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/15047
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 5 Niels de Vos 2016-08-12 09:47:50 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.2, please open a new bug report.

glusterfs-3.8.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/announce/2016-August/000058.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.