Bug 1332054 - multiple failures of tests/bugs/disperse/bug-1236065.t
Summary: multiple failures of tests/bugs/disperse/bug-1236065.t
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1360574
TreeView+ depends on / blocked
 
Reported: 2016-05-01 20:46 UTC by Niels de Vos
Modified: 2016-11-23 07:23 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.9.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1360574 (view as bug list)
Environment:
Last Closed: 2016-11-23 07:23:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Niels de Vos 2016-05-01 20:46:40 UTC
Description of problem:
tests/bugs/disperse/bug-1236065.t failed several times on different Jenkins slaves:

* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console
* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console
* https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console

Version-Release number of selected component (if applicable):
current master branch

How reproducible:
way too often

Steps to Reproduce:
1. run tests/bugs/disperse/bug-1236065.t as regression test on Jenkins

Actual results:

Sometimes test 24 fails, sometimes test 25.

13:25:28 [20:25:28] Running tests in file ./tests/bugs/disperse/bug-1236065.t
13:26:13 cp: accessing `13.o': Input/output error
13:26:13 cp: accessing `14.o': Input/output error
13:26:13 cp: accessing `15.o': Input/output error
13:26:13 cp: accessing `16.o': Input/output error
13:26:13 cp: accessing `17.o': Input/output error
13:26:14 cp: accessing `18.o': Input/output error
13:26:14 cp: accessing `19.o': Input/output error
13:26:14 cp: accessing `1.o': Input/output error
13:26:14 cp: accessing `2.o': Input/output error
13:26:15 cp: accessing `3.o': Input/output error
13:26:15 cp: accessing `4.o': Input/output error
13:26:15 cp: accessing `5.o': Input/output error
13:26:15 cp: accessing `6.o': Input/output error
13:26:15 cp: accessing `7.o': Input/output error
13:26:16 cp: accessing `8.o': Input/output error
13:26:16 cp: accessing `9.o': Input/output error
13:27:28 tar: Removing leading `/' from member names
13:27:28 ./tests/bugs/disperse/bug-1236065.t .. 
13:27:28 1..41
13:27:28 ok 1, LINENUM:28
13:27:28 ok 2, LINENUM:29
13:27:28 ok 3, LINENUM:30
13:27:28 ok 4, LINENUM:31
13:27:28 ok 5, LINENUM:32
13:27:28 ok 6, LINENUM:33
13:27:28 ok 7, LINENUM:36
13:27:28 ok 8, LINENUM:39
13:27:28 ok 9, LINENUM:42
13:27:28 ok 10, LINENUM:43
13:27:28 ok 11, LINENUM:44
13:27:28 ok 12, LINENUM:46
13:27:28 ok 13, LINENUM:47
13:27:28 ok 14, LINENUM:50
13:27:28 ok 15, LINENUM:51
13:27:28 ok 16, LINENUM:54
13:27:28 ok 17, LINENUM:55
13:27:28 ok 18, LINENUM:56
13:27:28 ok 19, LINENUM:58
13:27:28 ok 20, LINENUM:59
13:27:28 ok 21, LINENUM:62
13:27:28 ok 22, LINENUM:63
13:27:28 ok 23, LINENUM:64
13:27:28 ok 24, LINENUM:66
13:27:28 not ok 25 , LINENUM:67
13:27:28 FAILED COMMAND: ec_test_make
13:27:28 ok 26, LINENUM:69
13:27:28 ok 27, LINENUM:72
13:27:28 ok 28, LINENUM:73
13:27:28 ok 29, LINENUM:76
13:27:28 ok 30, LINENUM:77
13:27:28 ok 31, LINENUM:78
13:27:28 ok 32, LINENUM:80
13:27:28 ok 33, LINENUM:81
13:27:28 ok 34, LINENUM:83
13:27:28 ok 35, LINENUM:84
13:27:28 ok 36, LINENUM:85
13:27:28 ok 37, LINENUM:86
13:27:28 ok 38, LINENUM:90
13:27:28 ok 39, LINENUM:91
13:27:28 ok 40, LINENUM:92
13:27:28 ok 41, LINENUM:93
13:27:28 Failed 1/41 subtests 
13:27:28 
13:27:28 Test Summary Report
13:27:28 -------------------
13:27:28 ./tests/bugs/disperse/bug-1236065.t (Wstat: 0 Tests: 41 Failed: 1)
13:27:28   Failed test:  25
13:27:28 Files=1, Tests=41, 120 wallclock secs ( 0.03 usr  0.00 sys +  5.74 cusr  2.62 csys =  8.39 CPU)
13:27:28 Result: FAIL
13:27:28 End of test ./tests/bugs/disperse/bug-1236065.t
13:27:28 ================================================================================
13:27:28 
13:27:28 
13:27:28 Run complete
13:27:28 ================================================================================
13:27:28 Number of tests found:                             177
13:27:28 Number of tests selected for run based on pattern: 177
13:27:28 Number of tests skipped as they were marked bad:   7
13:27:28 Number of tests skipped because of known_issues:   1
13:27:28 Number of tests that were run:                     169
13:27:28 
13:27:28 1 test(s) failed 
13:27:28 ./tests/bugs/disperse/bug-1236065.t
13:27:28 
13:27:28 0 test(s) generated core

Comment 1 Niels de Vos 2016-05-01 20:47:18 UTC
Adding the 'tracking' keyword so that our bug-status-check-script does not triple over it. Please remove the keyword when progress on this bug is made.

Comment 2 Vijay Bellur 2016-05-01 20:53:02 UTC
REVIEW: http://review.gluster.org/14138 (disperse: mark bug-1236065.t as bad_test) posted (#1) for review on master by Niels de Vos (ndevos)

Comment 3 Xavi Hernandez 2016-05-02 08:56:07 UTC
I'm unable to reproduce the problem, however logs seem to indicate that healing operations are still running after a successful completion of test 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since additional bricks are killed after this test finishes, some files might get damaged as more that redundancy bricks will be bad, causing the I/O errors.

Most probably the root cause is that EXPECT_WITHIN uses a regular expression and a simple "0" matches many values, for example "10". This means that if exactly 10 files still need to be healed when the test is run, the test will finish successfully, but self-healing won't have finished yet.

I'll post a patch to solve this problem.

Comment 4 Vijay Bellur 2016-05-02 09:04:11 UTC
REVIEW: http://review.gluster.org/14145 (cluster/ec: Fix spurious failure of test bug-1236065.t) posted (#1) for review on master by Xavier Hernandez (xhernandez)

Comment 5 Vijay Bellur 2016-05-02 11:42:53 UTC
COMMIT: http://review.gluster.org/14138 committed in master by Jeff Darcy (jdarcy) 
------
commit 70a889489d79c41edfed52fdbdfa6869869906aa
Author: Niels de Vos <ndevos>
Date:   Sun May 1 22:49:57 2016 +0200

    disperse: mark bug-1236065.t as bad_test
    
    tests/bugs/disperse/bug-1236065.t failed several times on different
    Jenkins slaves:
    
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20316/console
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20320/console
    * https://build.gluster.org/job/rackspace-regression-2GB-triggered/20321/console
    
    BUG: 1332054
    Change-Id: Ie1934f09f843c2089c187e9295288c16c01913d2
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: http://review.gluster.org/14138
    Reviewed-by: Susant Palai <spalai>
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Vijay Bellur <vbellur>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 6 Pranith Kumar K 2016-05-02 13:09:18 UTC
(In reply to Xavier Hernandez from comment #3)
> I'm unable to reproduce the problem, however logs seem to indicate that
> healing operations are still running after a successful completion of test
> 'EXPECT_WITHIN $HEAL_TIMEOUT "0" get_pending_heal_count $V0'. Since
> additional bricks are killed after this test finishes, some files might get
> damaged as more that redundancy bricks will be bad, causing the I/O errors.
> 
> Most probably the root cause is that EXPECT_WITHIN uses a regular expression
> and a simple "0" matches many values, for example "10". This means that if
> exactly 10 files still need to be healed when the test is run, the test will
> finish successfully, but self-healing won't have finished yet.
> 
> I'll post a patch to solve this problem.

Good catch!, it could very well be this issue.

Comment 7 Vijay Bellur 2016-07-22 09:27:03 UTC
REVIEW: http://review.gluster.org/14985 (tests: Fix pending-heal-count checks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 8 Vijay Bellur 2016-07-22 17:01:50 UTC
COMMIT: http://review.gluster.org/14985 committed in master by Jeff Darcy (jdarcy) 
------
commit c5bf5d98594a4237a72cf0d3c72925d5a5aa0f69
Author: Pranith Kumar K <pkarampu>
Date:   Fri Jul 22 13:58:22 2016 +0530

    tests: Fix pending-heal-count checks
    
    EXPECT_WITHIN takes regular expression to match the count,
    so even when there are say 10 entries to heal, it would
    think that the heal is complete. Fixed checking
    pending heal count with correct regex.
    
    Thanks to Xavi for finding this problem.
    
    Change-Id: Ic593d22468b2b586bfca864962ffa0eda96b1d1f
    BUG: 1332054
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/14985
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 9 Vijay Bellur 2016-07-25 14:34:41 UTC
REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 10 Vijay Bellur 2016-07-27 04:48:45 UTC
REVIEW: http://review.gluster.org/15006 (tests: Fix get_pending_heal_count check in ec) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 11 Vijay Bellur 2016-07-29 08:10:56 UTC
COMMIT: http://review.gluster.org/15006 committed in master by Xavier Hernandez (xhernandez) 
------
commit 6c43efbb6b01726e450b71d274c3b45b56cc7916
Author: Ravishankar N <ravishankar>
Date:   Mon Jul 25 19:58:01 2016 +0530

    tests: Fix get_pending_heal_count check in ec
    
    Continuation of http://review.gluster.org/#/c/14985.
    Also renamed tests/bugs/disperse to tests/bugs/ec for a better
    correlation to tests/basic/ec and xlators/cluster/ec
    
    Change-Id: I662b3477c12af8a0b94597769e8f00f354b1168c
    BUG: 1332054
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/15006
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Reviewed-by: Xavier Hernandez <xhernandez>


Note You need to log in before you can comment on or make changes to this bug.