+++ This bug was initially created as a clone of Bug #1087563 +++ Description of problem: There was a seg fault on the brick process of the node being mounted via SMB during compile kernel of our sanity test suite: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: pending frames: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(30) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: patchset: git://git.gluster.com/glusterfs.git Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: signal received: 11 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: time of crash: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: 2014-04-14 15:37:48 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: configuration details: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: argp 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: backtrace 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: dlfcn 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: fdatasync 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: libpthread 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: llistxattr 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: setfsid 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: spinlock 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: epoll.h 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: xattr.h 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: st_atim.tv_nsec 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: package-string: glusterfs 3.5qa2 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: --------- Version-Release number of selected component (if applicable): glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64 How reproducible: I hit this the first time I ran FS sanity on the latest build. Steps to Reproduce: 1. Create 6x2 volume 2. Mount -t cifs from a client 3. Compile the linux kernel Actual results: Segfault of the brick process Expected results: The kernel compiles correctly Additional info: --- Additional comment from Ben Turner on 2014-04-14 12:49:33 EDT --- BT from gdb: (gdb) bt #0 inodelk_overlap (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:118 #1 inodelk_conflict (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:134 #2 __inodelk_grantable (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:147 #3 __lock_inodelk (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:206 #4 0x00007f4e98b0d03a in __grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:329 #5 grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:351 #6 0x00007f4e98b0d529 in pl_inodelk_client_cleanup (this=0x6584b0, ctx=0x7f4e840009b0) at inodelk.c:451 #7 0x00007f4e98b02eca in pl_client_disconnect_cbk (this=0x6584b0, client=<value optimized out>) at posix.c:2549 #8 0x00007f4ea6022fdd in gf_client_disconnect (client=0x6a90d0) at client_t.c:368 #9 0x00007f4e93dd7718 in server_connection_cleanup (this=0x65f1c0, client=0x6a90d0, flags=<value optimized out>) at server-helpers.c:244 #10 0x00007f4e93dd2fcc in server_rpc_notify (rpc=<value optimized out>, xl=0x65f1c0, event=<value optimized out>, data=0x6a83f0) at server.c:558 #11 0x00007f4ea5da3d55 in rpcsvc_handle_disconnect (svc=0x661010, trans=0x6a83f0) at rpcsvc.c:676 #12 0x00007f4ea5da5890 in rpcsvc_notify (trans=0x6a83f0, mydata=<value optimized out>, event=<value optimized out>, data=0x6a83f0) at rpcsvc.c:714 #13 0x00007f4ea5da6f68 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512 #14 0x00007f4e9b9a0941 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:1071 #15 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:2240 #16 0x00007f4ea6025347 in event_dispatch_epoll_handler (event_pool=0x637730) at event-epoll.c:384 #17 event_dispatch_epoll (event_pool=0x637730) at event-epoll.c:445 #18 0x0000000000407ba2 in main (argc=19, argv=0x7fffa341dbc8) at glusterfsd.c:1965 --- Additional comment from krishnan parthasarathi on 2014-04-15 05:43:48 EDT --- Ben, Could you attach logs of the brick process that crashed? Could you check if any other other bricks or clients (SMB/FUSE/NFS) accessing this volume crashed around the same time? In the log snippet, that you have provided along with the description, the following messages 'suggest' that there might be some memory corruption. Having the log files would provide more context into when or what could have caused the possible memory corruption or triggered the crash. <snip> Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0 </snip> thanks. --- Additional comment from Krutika Dhananjay on 2014-04-15 11:04:49 EDT --- This is similar to the bug reported in http://lists.gnu.org/archive/html/gluster-devel/2014-04/msg00077.html which I've been looking into for 4 days now, while Pranith is away. There are two issues here: 1. Why did the client disconnect from the server? 2. Blocked inodelk list corruption I was able to root-cause the list corruption part, as follows: During disconnect, the brick(s) release all blocked inodelks associated with the disconnecting client. But in the current code (in function pl_inodelk_client_cleanup()), the blocked inodelk object is freed without being deleted from the blocked_inodelks list of the domain object, causing this list to contain stale inodelks. When the same list is traversed again at some other point in time, there's a crash. Since I am anyway looking into it, taking the bug in my name.
REVIEW: http://review.gluster.org/7512 (features/locks: Remove stale inodelk objects from 'blocked_locks' list) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/7512 (features/locks: Remove stale inodelk objects from 'blocked_locks' list) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/7531 (rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7531 (rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)
COMMIT: http://review.gluster.org/7531 committed in master by Anand Avati (avati) ------ commit bc434b3ca961757ade8c6093f4ff0dbe4b3a4672 Author: Pranith Kumar K <pkarampu> Date: Wed Apr 23 14:05:10 2014 +0530 rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling Problem: When iozone is in progress, number of blocking inodelks sometimes becomes greater than the threshold number of rpc requests allowed for that client (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client will not be read until all the outstanding requests are processed and replied to. But because no more requests are read from that client, unlocks on the already granted locks will never come thus the number of outstanding requests would never come down. This leads to a ping-timeout on the client. Fix: Do not account INODELK/ENTRYLK/LK for throttling Change-Id: I59c6b54e7ec24ed7375ff977e817a9cb73469806 BUG: 1089470 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/7531 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Anand Avati <avati>
COMMIT: http://review.gluster.org/7512 committed in master by Anand Avati (avati) ------ commit 500a656c91558dd7913f572369f20b8550e9e98d Author: Krutika Dhananjay <kdhananj> Date: Sat Apr 19 20:03:38 2014 +0530 features/locks: Remove stale inodelk objects from 'blocked_locks' list * In the event of a DISCONNECT from a client, as part of cleanup, inodelk objects are not removed from the blocked_locks list before being unref'd and freed, causing the brick process to crash at some point when the (now) stale object is accessed again in the list. * Also during cleanup, it is pointless to try and grant lock to a previously blocked inodelk (say L1) as part of releasing another conflicting lock (L2), (which is a side-effect of L1 not being deleted from blocked_locks list before grant_blocked_inode_locks() in cleanup) if L1 is also associated with the DISCONNECTing client. This patch fixes the problem. * Also, the codepath in cleanup of entrylks seems to be granting blocked inodelks, when it should be attempting to grant blocked entrylks, which is fixed in this patch. Change-Id: I8493365c33020333b3f61aa15f505e4e7e6a9891 BUG: 1089470 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/7512 Reviewed-by: Raghavendra G <rgowdapp> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krishnan Parthasarathi <kparthas> Reviewed-by: Anand Avati <avati>
REVIEW: http://review.gluster.org/7560 (features/locks: Remove stale entrylk objects from 'blocked_locks' list) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/7560 committed in master by Anand Avati (avati) ------ commit 6a188c6b2c95d16c1bb6391c9fcb8ef808c2141b Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 24 16:37:05 2014 +0530 features/locks: Remove stale entrylk objects from 'blocked_locks' list * In the event of a DISCONNECT from a client, as part of cleanup, entrylk objects are not removed from the blocked_locks list before being unref'd and freed, causing the brick process to crash at some point when the (now) stale object is accessed again in the list. * Also during cleanup, it is pointless to try and grant lock to a previously blocked entrylk (say L1) as part of releasing another conflicting lock (L2), (which is a side-effect of L1 not being deleted from blocked_locks list before grant_blocked_entry_locks() in cleanup) if L1 is also associated with the DISCONNECTing client. This patch fixes the problem. Change-Id: I3d684c6bafc7e6db89ba68f0a2ed1dcb333791c6 BUG: 1089470 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/7560 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
All relevant patches merged in upstream. Moving the bug to MODIFIED state.
REVIEW: http://review.gluster.org/7570 (rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling) posted (#1) for review on release-3.5 by Pranith Kumar Karampuri (pkarampu)
REVIEW: http://review.gluster.org/7575 (features/locks: Remove stale inodelk objects from 'blocked_locks' list) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/7576 (features/locks: Remove stale entrylk objects from 'blocked_locks' list) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/7575 committed in release-3.5 by Vijay Bellur (vbellur) ------ commit 82c244a390679e03ea25830abbb90b0dcc7a69cc Author: Krutika Dhananjay <kdhananj> Date: Sat Apr 19 20:03:38 2014 +0530 features/locks: Remove stale inodelk objects from 'blocked_locks' list Backport of http://review.gluster.org/7512 * In the event of a DISCONNECT from a client, as part of cleanup, inodelk objects are not removed from the blocked_locks list before being unref'd and freed, causing the brick process to crash at some point when the (now) stale object is accessed again in the list. * Also during cleanup, it is pointless to try and grant lock to a previously blocked inodelk (say L1) as part of releasing another conflicting lock (L2), (which is a side-effect of L1 not being deleted from blocked_locks list before grant_blocked_inode_locks() in cleanup) if L1 is also associated with the DISCONNECTing client. This patch fixes the problem. Change-Id: I84d884e203761d0b071183860ffe8cae1f212ea5 BUG: 1089470 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/7575 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
COMMIT: http://review.gluster.org/7570 committed in release-3.5 by Niels de Vos (ndevos) ------ commit 45a0322066513259e61c7a4b1b1ed1a0bd3a0827 Author: Pranith Kumar K <pkarampu> Date: Wed Apr 23 14:05:10 2014 +0530 rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling Problem: When iozone is in progress, number of blocking inodelks sometimes becomes greater than the threshold number of rpc requests allowed for that client (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client will not be read until all the outstanding requests are processed and replied to. But because no more requests are read from that client, unlocks on the already granted locks will never come thus the number of outstanding requests would never come down. This leads to a ping-timeout on the client. Fix: Do not account INODELK/ENTRYLK/LK for throttling BUG: 1089470 Change-Id: I9cc2c259d2462159cea913d95f98a565acb8e0c0 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/7570 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Niels de Vos <ndevos>
COMMIT: http://review.gluster.org/7576 committed in release-3.5 by Niels de Vos (ndevos) ------ commit d8014f53c22a7da2d7b38e7d3215aa83e3e51d0d Author: Krutika Dhananjay <kdhananj> Date: Thu Apr 24 16:37:05 2014 +0530 features/locks: Remove stale entrylk objects from 'blocked_locks' list Backport of http://review.gluster.org/7560 * In the event of a DISCONNECT from a client, as part of cleanup, entrylk objects are not removed from the blocked_locks list before being unref'd and freed, causing the brick process to crash at some point when the (now) stale object is accessed again in the list. * Also during cleanup, it is pointless to try and grant lock to a previously blocked entrylk (say L1) as part of releasing another conflicting lock (L2), (which is a side-effect of L1 not being deleted from blocked_locks list before grant_blocked_entry_locks() in cleanup) if L1 is also associated with the DISCONNECTing client. This patch fixes the problem. Change-Id: Ie077f8eeb61c5505f047a8fdaac67db32e5d4270 BUG: 1089470 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/7576 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Niels de Vos <ndevos>
Verified on glusterfs-3.6.0-1.0.el6rhs.x86_64.
The first (and last?) Beta for GlusterFS 3.5.1 has been released [1]. Please verify if the release solves this bug report for you. In case the glusterfs-3.5.1beta release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-May/040377.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.1, please reopen this bug report. glusterfs-3.5.1 has been announced on the Gluster Users mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-June/040723.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user
*** Bug 1108850 has been marked as a duplicate of this bug. ***