This is an issue we observed in internal testing: The locks were getting acquired at the time when bricks were going down because of ping timeouts. 4 of the 6 bricks went down at that time. 2 of the 6 bricks have locks which are not being unlocked for some reason and were left stale. Steps to recreate the issue: 1) create a plain disperse volume 2) Put a breakpoint at ec_wind_inodelk 3) From the fuse mount issue ls -laR <mount> 4) as soon as the break point is hit in gdb, from other terminal kill 4 of the 6 bricks 5) quit gdb 6) Wait for a second or two to confirm that there are stale locks on the remaining bricks 7) In my case there were, so I issued ls -laR on the mount and it hung. Relevant logs to come to this conclustion(These failures were on disperse-2 of 6=4+2 setup): [2016-06-10 17:21:44.690734] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7feed0cd5c32] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7feed0aa084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7feed0aa095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7feed0aa22ea] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7feed0aa2b18] ))))) 0-ec-nfsganesha-client-15: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-06-10 17:21:44.537422 (xid=0x274d7) [2016-06-10 17:21:44.771235] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7feed0cd5c32] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7feed0aa084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7feed0aa095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7feed0aa22ea] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7feed0aa2b18] ))))) 0-ec-nfsganesha-client-17: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-06-10 17:21:44.537520 (xid=0x2740b) [2016-06-10 17:21:44.773164] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7feed0cd5c32] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7feed0aa084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7feed0aa095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7feed0aa22ea] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7feed0aa2b18] ))))) 0-ec-nfsganesha-client-16: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-06-10 17:21:44.537487 (xid=0x2740b) [2016-06-10 17:21:44.808576] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7feed0cd5c32] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7feed0aa084e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7feed0aa095e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7feed0aa22ea] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7feed0aa2b18] ))))) 0-ec-nfsganesha-client-14: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-06-10 17:21:44.537377 (xid=0x2740d) And in statedump of first two bricks we see the following ACTIVE locks: [root@dhcp35-191 gluster]# vi rhs-brick3-ec-nfsganesha.14996.dump.1465582581 [xlator.features.locks.ec-nfsganesha-locks.inode] path=/126 mandatory=0 inodelk-count=3 lock-dump.domain.domain=ec-nfsganesha-disperse-2:self-heal lock-dump.domain.domain=ec-nfsganesha-disperse-2 inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 6267, owner=682574c0ee7f0000, client=0x7f9d0c22b620, connection-id=dhcp35-98.lab.eng.blr.redhat.com-6267-2016/06/10-17:02:59:402489-ec-nfsganesha-client-12-0-0, granted at 2016-06-10 17:21:44 inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 26625, owner=68757bd6777f0000, client=0x7f9d77bac630, connection-id=dhcp35-114.lab.eng.blr.redhat.com-26625-2016/06/10-12:00:18:416748-ec-nfsganesha-client-12-0-0, blocked at 2016-06-10 17:21:45 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 6267, owner=f0ba74c0ee7f0000, client=0x7f9d0c22b620, connection-id=dhcp35-98.lab.eng.blr.redhat.com-6267-2016/06/10-17:02:59:402489-ec-nfsganesha-client-12-0-0, blocked at 2016-06-10 17:32:25
REVIEW: http://review.gluster.org/14703 (cluster/ec: Unlock stale locks when inodelk/entrylk/lk fails) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/14703 (cluster/ec: Unlock stale locks when inodelk/entrylk/lk fails) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/14703 (cluster/ec: Unlock stale locks when inodelk/entrylk/lk fails) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/14703 committed in master by Jeff Darcy (jdarcy) ------ commit d25237709278f9530c7a3989a37254c628539375 Author: Pranith Kumar K <pkarampu> Date: Sat Jun 11 18:43:42 2016 +0530 cluster/ec: Unlock stale locks when inodelk/entrylk/lk fails Thanks to Rafi for hinting a while back that this kind of problem he saw once. I didn't think the theory was valid. Could have caught it earlier if I had tested his theory. Change-Id: Iac6ffcdba2950aa6f8cf94f8994adeed6e6a9c9b BUG: 1344836 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/14703 Reviewed-by: Xavier Hernandez <xhernandez> Smoke: Gluster Build System <jenkins.org> Tested-by: mohammed rafi kc <rkavunga> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report. glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html [2] https://www.gluster.org/pipermail/gluster-users/