Description of problem: When the lock is been taken from client on file say file1,So when the other client tries to take lock on the same file ,the lock will not be granted as the lock is been held by client 1. So when trying to release the lock from client1,It unable to release the lock and gets hang Version-Release number of selected component (if applicable): glusterfs-fuse-3.8.4-23.el7rhgs.x86_64 How reproducible: Consistently Steps to Reproduce: 1.Create a 6 Node gluster 2.Create an EC volume 2 x (4 + 2).Enable GNFS on the volume 3.Mount the volume to 2 clients. 4.Create a file say file1 of 512 bytes from client 1 5.Now take the lock on the same file from client1 6.Try taking the lock on the same file from client2.(Lock will not be granted for client 2 because it is already held by client 1) 7.Now release the lock from client 1 Client 1: ----- [root@dhcp37-192 home]# ./a.out /mnt/disperse/file1 opening /mnt/disperse/file1 opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... unlocking ----- Client 2 ----- [root@dhcp37-142 home]# ./a.out /mnt/disperse1/file1 opening /mnt/disperse1/file1 opened; hit Enter to lock... locking ----- Actual results: It unable to release the lock from file and gets hang Expected results: It should able to release the lock from client1 Additional info: I was trying to clear need info of the bug -https://bugzilla.redhat.com/show_bug.cgi?id=1411338 (To test it with latest gluster build to check whether the issue still persist) But as it got stuck in the 1st step itself ,I am unable to proceed. This used to pass earlier in build glusterfs-3.8.4-10.el7rhgs.x86_64 . Note:This issue is only observed with EC+GNFS
This is reproducible for me too, with recent builds (where gluster/nfs does not segfault anymore).
Created attachment 1280044 [details] Simplefied automated test script This script makes it reproducible on a single NFS-client. It really is only possible to reproduce when a disperse volume is used.
The NLM/UNLOCK call never seems to get a reply, and Wireshark shows retransmissions of the NLM/UNLOCK call (retransmitted by the NFS-client). It is not clear to me yet where the callback/unwind gets stuck or dropped.
Created attachment 1280370 [details] statedump of gluster/nfs that is waiting on calling dht_lk_cbk() for lock release
Created attachment 1280371 [details] Statedumps of all the bricks (6) and the gluster/nfs server
Created attachment 1280416 [details] Statedumps of all the bricks (6) and the gluster/nfs server + nfs.log Statedumps from all bricks and the gluster/nfs server, obtained when the test script was hanging on releasing the 1st lock.
Created attachment 1280421 [details] systemtap script for following the nfs+dht+disperse functions
Created attachment 1280423 [details] output of systemtap script when running the test against a one-brick volume
Created attachment 1280426 [details] output of systemtap script when running the test against a disperse volume
Hi Pranith (continuation from our earlier IRC chat), here are the statedumps of the bricks and gluster/nfs server (attachment 1280416 [details]). The statedump of gluster/nfs was taken once the test script was hung after trying to release the lock. When checking with systemtap, I can see that in the one-brick case the dht_lk_cbk is called and that results in NLM sending a reply to the NFS-client (which continues obtaining the 2nd lock). The same systemtap script against a disperse volume does not have this dht_lk_cbk. I base my assumptions on this and the brick statedumps that the unlock reply is somewhere stuck in ec. If you need more details for debugging or help in reproducing, let me know. Thanks for your assistance!
Created attachment 1280993 [details] Lock Script
I am able to recreate the issue on latest master. Steps to recreate the issue: 1. Create an EC volume(2+1 configuration) and FUSE mount it. 2. Touch a file on the mount point. 3. From a terminal(term-1) aquire lock on file and write some data, don't unlock. [root@server1 setup]# ./lock /LAB/fuse_mounts/mount/file opening /LAB/fuse_mounts/mount/file opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... 4. From another terminal(term-2) try acquiring lock on the same region of the file. [root@server1 setup]# ./lock /LAB/fuse_mounts/mount/file opening /LAB/fuse_mounts/mount/file opened; hit Enter to lock... locking 5. Try to unlock the file from term-1(step 3). Step-5 should go through successfully. But on problematic code base it hangs. Issue was introduced as part of below code change: ++++++++++++++++++ commit f2406fa6155267fa747d9342092ee7709a2531a9 Author: Pranith Kumar K <pkarampu> Date: Fri Jan 27 16:17:49 2017 +0530 cluster/ec: Fix cthon failures observed with ec volumes Since EC already winds one write after other there is no need to align application fcntl locks with ec blocks. Also added this locking to be done as a transaction to prevent partial upgrade/downgrade of locks happening. BUG: 1410425 Change-Id: I7ce8955c2174f62b11e5cb16140e30ff0f7c4c31 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: https://review.gluster.org/16445 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> ++++++++++++++++++ Note: Use attachment "Lock Script" to generate the binary "lock"
Verified this bug on #glusterfs-3.8.4-33.el7rhgs.x86_64 Steps: 1.Mount the volume to 2 different clients say client1 and client2 via GNFS 2.Create 1G of file on client 1 3.Take lock on that file from client 1 -> Lock is granted Client 1 # ./a.out /mnt/GNFS_mani/File1 opening /mnt/GNFS_mani/File1 opened; hit Enter to lock... locking locked; hit Enter to write... Write succeeeded locked; hit Enter to unlock... 4.Try Taking lock on same file from client 2 ->Lock is not granted Client 2 # ./a.out /mnt/GNFS_mani/File1 opening /mnt/GNFS_mani/File1 opened; hit Enter to lock... locking 5.Release the Lock from client 1 ->Lock is granted to client 2 Client 1: locked; hit Enter to unlock... unlocking Client 2: Write succeeeded locked; hit Enter to unlock... unlocking Moving this bug to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774