Bug 1743573

Summary: fuse client hung when issued a lookup "ls" on an ec volume
Product: [Community] GlusterFS Reporter: Pranith Kumar K <pkarampu>
Component: disperseAssignee: bugs <bugs>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: mainlineCC: amukherj, aspandey, bugs, csaba, nchilaka, pkarampu, rgowdapp, rhs-bugs, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1731896 Environment:
Last Closed: 2019-09-12 06:38:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1731896    
Bug Blocks:    

Comment 1 Pranith Kumar K 2019-08-20 08:58:46 UTC
(gdb) p $4->locks[0]
$5 = {lock = 0x7f3da4abc1d8, fop = 0x7f3d74317e18, owner_list = {next = 0x7f3d74317ed0, prev = 0x7f3d74317ed0}, wait_list = {next = 0x7f3da4abc208, prev = 0x7f3da4abc208}, update = {false, false}, dirty = { false, false}, optimistic_changelog = false, base = 0x0, size = 0, waiting_flags = 0, fl_start = 0, fl_end = 9223372036854775807}
(gdb) p $4->locks[0].lock
$6 = (ec_lock_t *) 0x7f3da4abc1d8
(gdb) p *$4->locks[0].lock
$7 = {ctx = 0x7f3db7cbff70, timer = 0x0, owners = {next = 0x7f3da4abc1e8, prev = 0x7f3da4abc1e8}, waiting = {next = 0x7f3da4abc1f8, prev = 0x7f3da4abc1f8}, frozen = {next = 0x7f3d74317ee0, prev = 0x7f3d74317ee0}, mask = 0, good_mask = 18446744073709551615, healing = 0, refs_owners = 0, refs_pending = 0, waiting_flags = 0, acquired = false, unlock_now = false, release = true, query = true, fd = 0x0, loc = {path = 0x7f3d75084a40 "/IOs/kernel/rhs-client45.lab.eng.blr.redhat.com/dir.2/linux-5.2.7/Documentation/devicetree/bindings/rtc", name = 0x7f3d75084aa4 "rtc", inode = 0x7f3d98014768, parent = 0x7f3d99faad38, gfid = "\310\a\376|-\205K\v\215\000\b\363>\241\021i", pargfid = "\345\330}\212\242{Nr\233\064\373\030MD\361", <incomplete sequence \360>}, {type = ENTRYLK_WRLCK, flock = { l_type = 1, l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, data = '\000' <repeats 1023 times>}}}}
(gdb) p &$4->locks[0].lock->owners
$8 = (struct list_head *) 0x7f3da4abc1e8
(gdb) p &$4->locks[0].lock->waiting
$9 = (struct list_head *) 0x7f3da4abc1f8
(gdb) p &$4->locks[0].lock->frozen
$10 = (struct list_head *) 0x7f3da4abc208

This seems to suggest that the fop is stuck in frozen list which can only happen if lock->release is set to true.



    Problem:
    Mount-1                                Mount-2
    1)Tries to acquire lock on 'dir1'   1)Tries to acquire lock on 'dir1'
    2)Lock is granted on brick-0        2)Lock gets EAGAIN on brick-0 and
                                          leads to blocking lock on brick-0
    3)Gets a lock-contention            3) Doesn't matter what happens on mount-2
      notification, marks lock->release    from here on.
      to true.
    4)New fop comes on 'dir1' which will
      be put in frozen list as lock->release
      is set to true.
    5) Lock acquisition from step-2 fails because
    3 bricks went down in 4+2 setup.
    
    Fop on mount-1 which is put in frozen list will hang because no codepath will
    move it from frozen list to any other list and the lock will not be retried.

Comment 2 Worker Ant 2019-08-20 09:02:47 UTC
REVIEW: https://review.gluster.org/23272 (cluster/ec: Mark release only when it is acquired) posted (#1) for review on master by Pranith Kumar Karampuri

Comment 3 Worker Ant 2019-09-12 06:38:01 UTC
REVIEW: https://review.gluster.org/23272 (cluster/ec: Mark release only when it is acquired) merged (#5) on master by Pranith Kumar Karampuri