Bug 1330132 - Disperse volume fails on high load and logs show some assertion failures
Summary: Disperse volume fails on high load and logs show some assertion failures
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: 3.7.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On: 1331254 1339465
Blocks: 1330997 1344836 1360576 1361402
TreeView+ depends on / blocked
 
Reported: 2016-04-25 12:41 UTC by Xavi Hernandez
Modified: 2016-07-29 02:59 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.7.12
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1331254 (view as bug list)
Environment:
Last Closed: 2016-06-28 12:15:11 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Xavi Hernandez 2016-04-25 12:41:28 UTC
Description of problem:

A distributed iozone test over multiple NFS mounts on different machines causes the test to fail and some assertion failures appear on the logs:

[2016-04-21 19:29:58.096645] E [ec-inode-read.c:1157:ec_readv_rebuild] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b) [0x7f9e4e8f18bb] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_readv+0x107) [0x7f9e4e908197] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_readv_rebuild+0x236) [0x7f9e4e907f26] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->fd->inode, &cbk->iatt[0].ia_size)
[2016-04-21 19:29:58.126547] E [ec-common.c:1641:ec_lock_unfreeze] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_inodelk+0x155) [0x7f9e4e8fc305] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_unlocked+0x35) [0x7f9e4e8f3c25] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_lock_unfreeze+0x100) [0x7f9e4e8f3ab0] ) 0-: Assertion failed: list_empty(&lock->waiting) && list_empty(&lock->owners)
[2016-04-21 19:30:05.998568] E [ec-inode-read.c:1612:ec_manager_stat] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_resume+0x88) [0x7f9e4e8f1a68] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b) [0x7f9e4e8f18bb] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_stat+0x315) [0x7f9e4e905ed5] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-04-21 19:30:05.999146] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-8: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999132] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-10: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999237] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-11: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999259] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-7: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999326] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-9: remote operation failed [Invalid argument]
[2016-04-21 19:30:06.047496] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-6: remote operation failed [Invalid argument]
[2016-04-21 19:30:06.047559] W [MSGID: 122015] [ec-common.c:1675:ec_unlocked] 0-test-disperse-1: entry/inode unlocking failed (FSTAT) [Invalid argument]

Version-Release number of selected component (if applicable): mainline


How reproducible:

It happens randomly after some time running the distributed iozone test.

Steps to Reproduce:
1.
2.
3.

Actual results:

Volume access fails and iozone quits with an error.

Expected results:

iozone should complete the test successfully.

Additional info:

Probably related to a race when cancelling the lock release timeout while the callback is already executing. In this case the new fop is not placed in the right waiting list.

Comment 1 Vijay Bellur 2016-05-03 06:22:27 UTC
REVIEW: http://review.gluster.org/14174 (cluster/ec: Fix issues with eager locking) posted (#1) for review on release-3.7 by Xavier Hernandez (xhernandez)

Comment 2 Vijay Bellur 2016-05-04 11:29:07 UTC
COMMIT: http://review.gluster.org/14174 committed in release-3.7 by Jeff Darcy (jdarcy) 
------
commit 6e1de9e46b12b25d27d852d3cccadc51768e1150
Author: Xavier Hernandez <xhernandez>
Date:   Thu Apr 28 08:42:40 2016 +0200

    cluster/ec: Fix issues with eager locking
    
    Due to a race in timer cancellation, in some cases it was possible
    to unlock the lock while another concurrent fop that needed it
    continues execution as if it were not released.
    
    This patch also fixes an issue that caused a lock to not be released
    if an error was found while preparing ec_update_size_version().
    
    > Change-Id: I1344a3f5ecfc333f05a09e62653838264c9c26b1
    > BUG: 1331254
    > Signed-off-by: Xavier Hernandez <xhernandez>
    > Reviewed-on: http://review.gluster.org/14112
    > Smoke: Gluster Build System <jenkins.com>
    > CentOS-regression: Gluster Build System <jenkins.com>
    > Reviewed-by: Chen Chen <chenchen>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    
    Change-Id: I21edd17d914dfa8d2f98e6bbde50830496e12a92
    BUG: 1330132
    Signed-off-by: Xavier Hernandez <xhernandez>
    Reviewed-on: http://review.gluster.org/14174
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 3 Kaushal 2016-06-28 12:15:11 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report.

glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.