Bug 1743573 - fuse client hung when issued a lookup "ls" on an ec volume
Summary: fuse client hung when issued a lookup "ls" on an ec volume
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1731896
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-20 08:56 UTC by Pranith Kumar K
Modified: 2019-09-12 06:38 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1731896
Environment:
Last Closed: 2019-09-12 06:38:01 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23272 0 None Merged cluster/ec: Mark release only when it is acquired 2019-09-12 06:38:00 UTC

Comment 1 Pranith Kumar K 2019-08-20 08:58:46 UTC
(gdb) p $4->locks[0]
$5 = {lock = 0x7f3da4abc1d8, fop = 0x7f3d74317e18, owner_list = {next = 0x7f3d74317ed0, prev = 0x7f3d74317ed0}, wait_list = {next = 0x7f3da4abc208, prev = 0x7f3da4abc208}, update = {false, false}, dirty = { false, false}, optimistic_changelog = false, base = 0x0, size = 0, waiting_flags = 0, fl_start = 0, fl_end = 9223372036854775807}
(gdb) p $4->locks[0].lock
$6 = (ec_lock_t *) 0x7f3da4abc1d8
(gdb) p *$4->locks[0].lock
$7 = {ctx = 0x7f3db7cbff70, timer = 0x0, owners = {next = 0x7f3da4abc1e8, prev = 0x7f3da4abc1e8}, waiting = {next = 0x7f3da4abc1f8, prev = 0x7f3da4abc1f8}, frozen = {next = 0x7f3d74317ee0, prev = 0x7f3d74317ee0}, mask = 0, good_mask = 18446744073709551615, healing = 0, refs_owners = 0, refs_pending = 0, waiting_flags = 0, acquired = false, unlock_now = false, release = true, query = true, fd = 0x0, loc = {path = 0x7f3d75084a40 "/IOs/kernel/rhs-client45.lab.eng.blr.redhat.com/dir.2/linux-5.2.7/Documentation/devicetree/bindings/rtc", name = 0x7f3d75084aa4 "rtc", inode = 0x7f3d98014768, parent = 0x7f3d99faad38, gfid = "\310\a\376|-\205K\v\215\000\b\363>\241\021i", pargfid = "\345\330}\212\242{Nr\233\064\373\030MD\361", <incomplete sequence \360>}, {type = ENTRYLK_WRLCK, flock = { l_type = 1, l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, data = '\000' <repeats 1023 times>}}}}
(gdb) p &$4->locks[0].lock->owners
$8 = (struct list_head *) 0x7f3da4abc1e8
(gdb) p &$4->locks[0].lock->waiting
$9 = (struct list_head *) 0x7f3da4abc1f8
(gdb) p &$4->locks[0].lock->frozen
$10 = (struct list_head *) 0x7f3da4abc208

This seems to suggest that the fop is stuck in frozen list which can only happen if lock->release is set to true.



    Problem:
    Mount-1                                Mount-2
    1)Tries to acquire lock on 'dir1'   1)Tries to acquire lock on 'dir1'
    2)Lock is granted on brick-0        2)Lock gets EAGAIN on brick-0 and
                                          leads to blocking lock on brick-0
    3)Gets a lock-contention            3) Doesn't matter what happens on mount-2
      notification, marks lock->release    from here on.
      to true.
    4)New fop comes on 'dir1' which will
      be put in frozen list as lock->release
      is set to true.
    5) Lock acquisition from step-2 fails because
    3 bricks went down in 4+2 setup.
    
    Fop on mount-1 which is put in frozen list will hang because no codepath will
    move it from frozen list to any other list and the lock will not be retried.

Comment 2 Worker Ant 2019-08-20 09:02:47 UTC
REVIEW: https://review.gluster.org/23272 (cluster/ec: Mark release only when it is acquired) posted (#1) for review on master by Pranith Kumar Karampuri

Comment 3 Worker Ant 2019-09-12 06:38:01 UTC
REVIEW: https://review.gluster.org/23272 (cluster/ec: Mark release only when it is acquired) merged (#5) on master by Pranith Kumar Karampuri


Note You need to log in before you can comment on or make changes to this bug.