Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1331254

Summary:	Disperse volume fails on high load and logs show some assertion failures
Product:	[Community] GlusterFS	Reporter:	Xavi Hernandez <jahernan>
Component:	disperse	Assignee:	Xavi Hernandez <jahernan>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	aspandey, bugs, pkarampu
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.9.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1330132
Clones:	1332845 1339465 (view as bug list)		Environment:
Last Closed:	2016-11-23 07:25:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1330132, 1332845, 1339465

Description Xavi Hernandez 2016-04-28 06:30:26 UTC

+++ This bug was initially created as a clone of Bug #1330132 +++

Description of problem:

A distributed iozone test over multiple NFS mounts on different machines causes the test to fail and some assertion failures appear on the logs:

[2016-04-21 19:29:58.096645] E [ec-inode-read.c:1157:ec_readv_rebuild] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b) [0x7f9e4e8f18bb] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_readv+0x107) [0x7f9e4e908197] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_readv_rebuild+0x236) [0x7f9e4e907f26] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->fd->inode, &cbk->iatt[0].ia_size)
[2016-04-21 19:29:58.126547] E [ec-common.c:1641:ec_lock_unfreeze] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_inodelk+0x155) [0x7f9e4e8fc305] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_unlocked+0x35) [0x7f9e4e8f3c25] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_lock_unfreeze+0x100) [0x7f9e4e8f3ab0] ) 0-: Assertion failed: list_empty(&lock->waiting) && list_empty(&lock->owners)
[2016-04-21 19:30:05.998568] E [ec-inode-read.c:1612:ec_manager_stat] (-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_resume+0x88) [0x7f9e4e8f1a68] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b) [0x7f9e4e8f18bb] -->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_stat+0x315) [0x7f9e4e905ed5] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-04-21 19:30:05.999146] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-8: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999132] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-10: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999237] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-11: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999259] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-7: remote operation failed [Invalid argument]
[2016-04-21 19:30:05.999326] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-9: remote operation failed [Invalid argument]
[2016-04-21 19:30:06.047496] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-6: remote operation failed [Invalid argument]
[2016-04-21 19:30:06.047559] W [MSGID: 122015] [ec-common.c:1675:ec_unlocked] 0-test-disperse-1: entry/inode unlocking failed (FSTAT) [Invalid argument]

Version-Release number of selected component (if applicable): mainline


How reproducible:

It happens randomly after some time running the distributed iozone test.

Steps to Reproduce:
1.
2.
3.

Actual results:

Volume access fails and iozone quits with an error.

Expected results:

iozone should complete the test successfully.

Additional info:

Probably related to a race when cancelling the lock release timeout while the callback is already executing. In this case the new fop is not placed in the right waiting list.

Comment 1 Vijay Bellur 2016-04-29 09:19:27 UTC

REVIEW: http://review.gluster.org/14112 (cluster/ec: Fix issues with eager locking) posted (#1) for review on master by Xavier Hernandez (xhernandez)

Comment 2 Vijay Bellur 2016-05-02 14:45:05 UTC

COMMIT: http://review.gluster.org/14112 committed in master by Jeff Darcy (jdarcy) 
------
commit 209985e861f4d8a22bfdb457c0e8d7045ab44553
Author: Xavier Hernandez <xhernandez>
Date:   Thu Apr 28 08:42:40 2016 +0200

    cluster/ec: Fix issues with eager locking
    
    Due to a race in timer cancellation, in some cases it was possible
    to unlock the lock while another concurrent fop that needed it
    continues execution as if it were not released.
    
    This patch also fixes an issue that caused a lock to not be released
    if an error was found while preparing ec_update_size_version().
    
    Change-Id: I1344a3f5ecfc333f05a09e62653838264c9c26b1
    BUG: 1331254
    Signed-off-by: Xavier Hernandez <xhernandez>
    Reviewed-on: http://review.gluster.org/14112
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Chen Chen <chenchen>
    NetBSD-regression: NetBSD Build System <jenkins.org>