Description of problem: There was a seg fault on the brick process of the node being mounted via SMB during compile kernel of our sanity test suite: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: pending frames: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(30) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29) Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: patchset: git://git.gluster.com/glusterfs.git Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: signal received: 11 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: time of crash: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: 2014-04-14 15:37:48 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: configuration details: Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: argp 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: backtrace 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: dlfcn 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: fdatasync 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: libpthread 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: llistxattr 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: setfsid 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: spinlock 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: epoll.h 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: xattr.h 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: st_atim.tv_nsec 1 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: package-string: glusterfs 3.5qa2 Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: --------- Version-Release number of selected component (if applicable): glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64 How reproducible: I hit this the first time I ran FS sanity on the latest build. Steps to Reproduce: 1. Create 6x2 volume 2. Mount -t cifs from a client 3. Compile the linux kernel Actual results: Segfault of the brick process Expected results: The kernel compiles correctly Additional info:
BT from gdb: (gdb) bt #0 inodelk_overlap (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:118 #1 inodelk_conflict (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:134 #2 __inodelk_grantable (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:147 #3 __lock_inodelk (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:206 #4 0x00007f4e98b0d03a in __grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:329 #5 grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:351 #6 0x00007f4e98b0d529 in pl_inodelk_client_cleanup (this=0x6584b0, ctx=0x7f4e840009b0) at inodelk.c:451 #7 0x00007f4e98b02eca in pl_client_disconnect_cbk (this=0x6584b0, client=<value optimized out>) at posix.c:2549 #8 0x00007f4ea6022fdd in gf_client_disconnect (client=0x6a90d0) at client_t.c:368 #9 0x00007f4e93dd7718 in server_connection_cleanup (this=0x65f1c0, client=0x6a90d0, flags=<value optimized out>) at server-helpers.c:244 #10 0x00007f4e93dd2fcc in server_rpc_notify (rpc=<value optimized out>, xl=0x65f1c0, event=<value optimized out>, data=0x6a83f0) at server.c:558 #11 0x00007f4ea5da3d55 in rpcsvc_handle_disconnect (svc=0x661010, trans=0x6a83f0) at rpcsvc.c:676 #12 0x00007f4ea5da5890 in rpcsvc_notify (trans=0x6a83f0, mydata=<value optimized out>, event=<value optimized out>, data=0x6a83f0) at rpcsvc.c:714 #13 0x00007f4ea5da6f68 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512 #14 0x00007f4e9b9a0941 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:1071 #15 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:2240 #16 0x00007f4ea6025347 in event_dispatch_epoll_handler (event_pool=0x637730) at event-epoll.c:384 #17 event_dispatch_epoll (event_pool=0x637730) at event-epoll.c:445 #18 0x0000000000407ba2 in main (argc=19, argv=0x7fffa341dbc8) at glusterfsd.c:1965
Created attachment 886179 [details] bt
Created attachment 886180 [details] core info
Ben, Could you attach logs of the brick process that crashed? Could you check if any other other bricks or clients (SMB/FUSE/NFS) accessing this volume crashed around the same time? In the log snippet, that you have provided along with the description, the following messages 'suggest' that there might be some memory corruption. Having the log files would provide more context into when or what could have caused the possible memory corruption or triggered the crash. <snip> Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0 </snip> thanks.
This is similar to the bug reported in http://lists.gnu.org/archive/html/gluster-devel/2014-04/msg00077.html which I've been looking into for 4 days now, while Pranith is away. There are two issues here: 1. Why did the client disconnect from the server? 2. Blocked inodelk list corruption I was able to root-cause the list corruption part, as follows: During disconnect, the brick(s) release all blocked inodelks associated with the disconnecting client. But in the current code (in function pl_inodelk_client_cleanup()), the blocked inodelk object is freed without being deleted from the blocked_inodelks list of the domain object, causing this list to contain stale inodelks. When the same list is traversed again at some other point in time, there's a crash. Since I am anyway looking into it, taking the bug in my name.
Note: This issue also exists with blocked_entrylks (in pl_entrylk_client_cleanup()).
This was found during FS sanity daily runs, the systems were reinstalled to run other component automation and the logs from that event are lost. I am attempting to repro now but thus far I have been unable to. One thing to note is I see alot of Samba disconnects in the /var/log/messages file: Apr 15 12:52:04 gqac012 smbd[23120]: [2014/04/15 12:52:04.562926, 0] modules/vfs_glusterfs.c:282(vfs_gluster_connect) Apr 15 12:52:04 gqac012 smbd[23120]: testvol: Initialized volume from server localhost Apr 15 12:52:25 gqac012 logger: 2014-04-15 12:52:25 /usr/bin/rhts-test-runner.sh 86529 2220 hearbeat... Apr 15 12:53:25 gqac012 logger: 2014-04-15 12:53:25 /usr/bin/rhts-test-runner.sh 86529 2280 hearbeat... Apr 15 12:54:10 gqac012 smbd[24134]: [2014/04/15 12:54:10.720573, 0] modules/vfs_glusterfs.c:282(vfs_gluster_connect) Apr 15 12:54:10 gqac012 smbd[24134]: testvol: Initialized volume from server localhost Apr 15 12:54:26 gqac012 logger: 2014-04-15 12:54:26 /usr/bin/rhts-test-runner.sh 86529 2340 hearbeat... Apr 15 12:55:25 gqac012 logger: 2014-04-15 12:55:25 /usr/bin/rhts-test-runner.sh 86529 2400 hearbeat... Apr 15 12:56:16 gqac012 smbd[25170]: [2014/04/15 12:56:16.890376, 0] modules/vfs_glusterfs.c:282(vfs_gluster_connect) Apr 15 12:56:16 gqac012 smbd[25170]: testvol: Initialized volume from server localhost Apr 15 12:56:25 gqac012 logger: 2014-04-15 12:56:25 /usr/bin/rhts-test-runner.sh 86529 2460 hearbeat... Apr 15 12:57:26 gqac012 logger: 2014-04-15 12:57:26 /usr/bin/rhts-test-runner.sh 86529 2520 hearbeat... Apr 15 12:58:23 gqac012 smbd[26192]: [2014/04/15 12:58:23.047897, 0] modules/vfs_glusterfs.c:282(vfs_gluster_connect) Apr 15 12:58:23 gqac012 smbd[26192]: testvol: Initialized volume from server localhost Apr 15 12:58:25 gqac012 logger: 2014-04-15 12:58:25 /usr/bin/rhts-test-runner.sh 86529 2580 hearbeat... Apr 15 12:59:25 gqac012 logger: 2014-04-15 12:59:25 /usr/bin/rhts-test-runner.sh 86529 2640 hearbeat... Apr 15 13:00:25 gqac012 logger: 2014-04-15 13:00:25 /usr/bin/rhts-test-runner.sh 86529 2700 hearbeat... Apr 15 13:00:29 gqac012 smbd[27232]: [2014/04/15 13:00:29.215589, 0] modules/vfs_glusterfs.c:282(vfs_gluster_connect) Apr 15 13:00:29 gqac012 smbd[27232]: testvol: Initialized volume from server localhost I'll keep attempting to repro and update with my findings.
In the brick logs I am also seeing: [2014-04-15 17:25:43.591607] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-7395-2014/04/15-17:25:43:384130-testvol-client-4-0-0 [2014-04-15 17:26:27.943657] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder [2014-04-15 17:26:27.944127] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder [2014-04-15 17:27:49.912334] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0 [2014-04-15 17:27:49.912409] I [server-handshake.c:577:server_setvolume] 0-testvol-server: accepted client from gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0 (version: 3.5qa2) [2014-04-15 17:27:49.912610] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0- As soon as I repo I'll gather sosreports.
The following fix in upstream fixes issue #2 described in comment #6, and is under review: http://review.gluster.org/#/c/7512/ And the following upstream patch sent by Pranith fixes issue #1, and is under review: http://review.gluster.org/#/c/7531/
Setting flags required to add BZs to RHS 3.0 Errata
Verified on: glusterfs-3.6.0.9-1.el6.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html