I am able to recreate the issue on latest master. Steps to recreate the issue: 1. Create an EC volume(2+1 configuration) and FUSE mount it. 2. Touch a file on the mount point. 3. From a terminal(term-1) aquire lock on file and write some data, don't unlock. [root@server1 setup]# ./lock /LAB/fuse_mounts/mount/file opening /LAB/fuse_mounts/mount/file opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... 4. From another terminal(term-2) try acquiring lock on the same region of the file. [root@server1 setup]# ./lock /LAB/fuse_mounts/mount/file opening /LAB/fuse_mounts/mount/file opened; hit Enter to lock... locking 5. Try to unlock the file from term-1(step 3). Step-5 should go through successfully. But on problematic code base it hangs. Issue was introduced as part of below code change: ++++++++++++++++++ commit f2406fa6155267fa747d9342092ee7709a2531a9 Author: Pranith Kumar K <pkarampu> Date: Fri Jan 27 16:17:49 2017 +0530 cluster/ec: Fix cthon failures observed with ec volumes Since EC already winds one write after other there is no need to align application fcntl locks with ec blocks. Also added this locking to be done as a transaction to prevent partial upgrade/downgrade of locks happening. BUG: 1410425 Change-Id: I7ce8955c2174f62b11e5cb16140e30ff0f7c4c31 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: https://review.gluster.org/16445 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> ++++++++++++++++++ Note: Use attachment "Lock Script" to generate the binary "lock"
Created attachment 1281848 [details] Lock script
[xlator.features.locks.testvol-locks.inode] path=/file mandatory=0 inodelk-count=1 lock-dump.domain.domain=testvol-disperse-0 inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 20283, owner=70e500f8957f0000, client=0x7f6ff80b08b0, connection-id=apandey-20134-2017/05/24-07:57:36:665823-testvol-client-0-0-0, granted at 2017-05-24 07:59:24 <<<<<<<<<<<<< EC lock taken by second lock request posixlk-count=2 posixlk.posixlk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=10, pid = 20266, owner=2f33c5d2250866c3, client=0x7f6ff80b08b0, connection-id=(null), granted at 2017-05-24 07:58:18 <<<<<<<<<<<<<< Posix lock taken by first lock request posixlk.posixlk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=10, pid = 20283, owner=9f4f2ac611372d51, client=0x7f6ff80b08b0, connection-id=(null), blocked at 2017-05-24 07:59:24 <<<<<<<<<<<<<< Second posix lock request BLOCKED Now, to unlock the first posix lock, we have to take EC lock which can not be taken as it is taken by second request. That caused deadlock.
With https://review.gluster.org/#/c/17542/ Term-1: root@dhcp35-190 - /mnt/ec2 17:48:27 :) ⚡ /root/a.out a opening a opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... unlocking Term-2: root@dhcp35-190 - /mnt/ec2 17:49:02 :) ⚡ /root/a.out a opening a opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... unlocking Will also do cthon tests.
** PARENT pass 1 results: 49/49 pass, 1/1 warn, 0/0 fail (pass/total). ** CHILD pass 1 results: 64/64 pass, 0/0 warn, 0/0 fail (pass/total). Congratulations, you passed the locking tests! All tests completed
REVIEW: https://review.gluster.org/17542 (cluster/ec: lk shouldn't be a transaction) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: https://review.gluster.org/17542 committed in master by Xavier Hernandez (xhernandez) ------ commit 26ca39ccf0caf0d55c88b05396883dd10ab66dc4 Author: Pranith Kumar K <pkarampu> Date: Tue Jun 13 23:35:40 2017 +0530 cluster/ec: lk shouldn't be a transaction Problem: When application sends a blocking lock, the lk fop actually waits under inodelk. This can lead to a dead-lock. 1) Let's say app-1 takes exculsive-fcntl-lock on the file 2) app-2 attempts an exclusive-fcntl-lock on the file which goes to blocking stage note: app-2 is blocked inside transaction which holds an inode-lock 3) app-1 tries to perform write which needs inode-lock so it gets blocked on app-2 to unlock inodelk and app-2 is blocked on app-1 to unlock fcntl-lock Fix: Correct way to fix this issue and make fcntl locks perform well would be to introduce 2-phase locking for fcntl lock: 1) Implement a try-lock phase where locks xlator will not merge lk call with existing calls until a commit-lock phase. 2) If in try-lock phase we get quorum number of success without any EAGAIN error, then send a commit-lock which will merge locks. 3) In case there are any errors, unlock should just delete the lock-object which was tried earlier and shouldn't touch the committed locks. Unfortunately this is a sizeable feature and need to be thought through for any corner cases. Until then remove transaction from lk call. BUG: 1455049 Change-Id: I18a782903ba0eb43f1e6526fb0cf8c626c460159 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: https://review.gluster.org/17542 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Ashish Pandey <aspandey> Reviewed-by: Xavier Hernandez <xhernandez>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/