Bug 1087563 - SMB: Crash on brick process during compile kernel.
Summary: SMB: Crash on brick process during compile kernel.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Krutika Dhananjay
QA Contact: Ben Turner
URL:
Whiteboard:
Depends On:
Blocks: 1089470 1111759
TreeView+ depends on / blocked
 
Reported: 2014-04-14 15:59 UTC by Ben Turner
Modified: 2016-09-17 14:38 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.5qa2-0.425.git9360107
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1089470 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:35:37 UTC
Embargoed:


Attachments (Terms of Use)
bt (8.91 KB, text/plain)
2014-04-14 16:50 UTC, Ben Turner
no flags Details
core info (111 bytes, text/plain)
2014-04-14 16:55 UTC, Ben Turner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Ben Turner 2014-04-14 15:59:03 UTC
Description of problem:

There was a seg fault on the brick process of the node being mounted via SMB during compile kernel of our sanity test suite:

Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: pending frames:
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(30)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: patchset: git://git.gluster.com/glusterfs.git
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: signal received: 11
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: time of crash: 
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: 2014-04-14 15:37:48
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: configuration details:
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: argp 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: backtrace 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: dlfcn 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: fdatasync 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: libpthread 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: llistxattr 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: setfsid 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: spinlock 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: epoll.h 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: xattr.h 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: st_atim.tv_nsec 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: package-string: glusterfs 3.5qa2
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: ---------

Version-Release number of selected component (if applicable):

glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64

How reproducible:

I hit this the first time I ran FS sanity on the latest build. 

Steps to Reproduce:
1.  Create 6x2 volume
2.  Mount -t cifs from a client
3.  Compile the linux kernel

Actual results:

Segfault of the brick process

Expected results:

The kernel compiles correctly

Additional info:

Comment 1 Ben Turner 2014-04-14 16:49:33 UTC
BT from gdb:

(gdb) bt
#0  inodelk_overlap (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:118
#1  inodelk_conflict (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:134
#2  __inodelk_grantable (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:147
#3  __lock_inodelk (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:206
#4  0x00007f4e98b0d03a in __grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:329
#5  grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:351
#6  0x00007f4e98b0d529 in pl_inodelk_client_cleanup (this=0x6584b0, ctx=0x7f4e840009b0) at inodelk.c:451
#7  0x00007f4e98b02eca in pl_client_disconnect_cbk (this=0x6584b0, client=<value optimized out>) at posix.c:2549
#8  0x00007f4ea6022fdd in gf_client_disconnect (client=0x6a90d0) at client_t.c:368
#9  0x00007f4e93dd7718 in server_connection_cleanup (this=0x65f1c0, client=0x6a90d0, flags=<value optimized out>) at server-helpers.c:244
#10 0x00007f4e93dd2fcc in server_rpc_notify (rpc=<value optimized out>, xl=0x65f1c0, event=<value optimized out>, data=0x6a83f0) at server.c:558
#11 0x00007f4ea5da3d55 in rpcsvc_handle_disconnect (svc=0x661010, trans=0x6a83f0) at rpcsvc.c:676
#12 0x00007f4ea5da5890 in rpcsvc_notify (trans=0x6a83f0, mydata=<value optimized out>, event=<value optimized out>, data=0x6a83f0) at rpcsvc.c:714
#13 0x00007f4ea5da6f68 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512
#14 0x00007f4e9b9a0941 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:1071
#15 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:2240
#16 0x00007f4ea6025347 in event_dispatch_epoll_handler (event_pool=0x637730) at event-epoll.c:384
#17 event_dispatch_epoll (event_pool=0x637730) at event-epoll.c:445
#18 0x0000000000407ba2 in main (argc=19, argv=0x7fffa341dbc8) at glusterfsd.c:1965

Comment 2 Ben Turner 2014-04-14 16:50:11 UTC
Created attachment 886179 [details]
bt

Comment 4 Ben Turner 2014-04-14 16:55:10 UTC
Created attachment 886180 [details]
core info

Comment 5 krishnan parthasarathi 2014-04-15 09:43:48 UTC
Ben,
Could you attach logs of the brick process that crashed? Could you check if any other other bricks or clients (SMB/FUSE/NFS) accessing this volume crashed around the same time?

In the log snippet, that you have provided along with the description, the following messages 'suggest' that there might be some memory corruption.
Having the log files would provide more context into when or what could have caused the possible memory corruption or triggered the crash.

<snip>

Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0

</snip>

thanks.

Comment 6 Krutika Dhananjay 2014-04-15 15:04:49 UTC
This is similar to the bug reported in http://lists.gnu.org/archive/html/gluster-devel/2014-04/msg00077.html which I've been looking into for 4 days now, while Pranith is away.

There are two issues here:

1. Why did the client disconnect from the server?

2. Blocked inodelk list corruption

I was able to root-cause the list corruption part, as follows:

During disconnect, the brick(s) release all blocked inodelks associated with the disconnecting client. But in the current code (in function pl_inodelk_client_cleanup()), the blocked inodelk object is freed without being deleted from the blocked_inodelks list of the domain object, causing this list to contain stale inodelks. When the same list is traversed again at some other point in time, there's a crash.

Since I am anyway looking into it, taking the bug in my name.

Comment 7 Krutika Dhananjay 2014-04-15 15:41:28 UTC
Note: This issue also exists with blocked_entrylks (in pl_entrylk_client_cleanup()).

Comment 8 Ben Turner 2014-04-15 17:23:50 UTC
This was found during FS sanity daily runs, the systems were reinstalled to run other component automation and the logs from that event are lost.  I am attempting to repro now but thus far I have been unable to.  One thing to note is I see alot of Samba disconnects in the /var/log/messages file:

Apr 15 12:52:04 gqac012 smbd[23120]: [2014/04/15 12:52:04.562926,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:52:04 gqac012 smbd[23120]:   testvol: Initialized volume from server localhost
Apr 15 12:52:25 gqac012 logger: 2014-04-15 12:52:25 /usr/bin/rhts-test-runner.sh 86529 2220 hearbeat...
Apr 15 12:53:25 gqac012 logger: 2014-04-15 12:53:25 /usr/bin/rhts-test-runner.sh 86529 2280 hearbeat...
Apr 15 12:54:10 gqac012 smbd[24134]: [2014/04/15 12:54:10.720573,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:54:10 gqac012 smbd[24134]:   testvol: Initialized volume from server localhost
Apr 15 12:54:26 gqac012 logger: 2014-04-15 12:54:26 /usr/bin/rhts-test-runner.sh 86529 2340 hearbeat...
Apr 15 12:55:25 gqac012 logger: 2014-04-15 12:55:25 /usr/bin/rhts-test-runner.sh 86529 2400 hearbeat...
Apr 15 12:56:16 gqac012 smbd[25170]: [2014/04/15 12:56:16.890376,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:56:16 gqac012 smbd[25170]:   testvol: Initialized volume from server localhost
Apr 15 12:56:25 gqac012 logger: 2014-04-15 12:56:25 /usr/bin/rhts-test-runner.sh 86529 2460 hearbeat...
Apr 15 12:57:26 gqac012 logger: 2014-04-15 12:57:26 /usr/bin/rhts-test-runner.sh 86529 2520 hearbeat...
Apr 15 12:58:23 gqac012 smbd[26192]: [2014/04/15 12:58:23.047897,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:58:23 gqac012 smbd[26192]:   testvol: Initialized volume from server localhost
Apr 15 12:58:25 gqac012 logger: 2014-04-15 12:58:25 /usr/bin/rhts-test-runner.sh 86529 2580 hearbeat...
Apr 15 12:59:25 gqac012 logger: 2014-04-15 12:59:25 /usr/bin/rhts-test-runner.sh 86529 2640 hearbeat...
Apr 15 13:00:25 gqac012 logger: 2014-04-15 13:00:25 /usr/bin/rhts-test-runner.sh 86529 2700 hearbeat...
Apr 15 13:00:29 gqac012 smbd[27232]: [2014/04/15 13:00:29.215589,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 13:00:29 gqac012 smbd[27232]:   testvol: Initialized volume from server localhost

I'll keep attempting to repro and update with my findings.

Comment 9 Ben Turner 2014-04-15 17:30:39 UTC
In the brick logs I am also seeing:

[2014-04-15 17:25:43.591607] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-7395-2014/04/15-17:25:43:384130-testvol-client-4-0-0
[2014-04-15 17:26:27.943657] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder
[2014-04-15 17:26:27.944127] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder
[2014-04-15 17:27:49.912334] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0
[2014-04-15 17:27:49.912409] I [server-handshake.c:577:server_setvolume] 0-testvol-server: accepted client from gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0 (version: 3.5qa2)
[2014-04-15 17:27:49.912610] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-

As soon as I repo I'll gather sosreports.

Comment 10 Krutika Dhananjay 2014-04-23 09:26:44 UTC
The following fix in upstream fixes issue #2 described in comment #6, and is under review:
                  http://review.gluster.org/#/c/7512/

And the following upstream patch sent by Pranith fixes issue #1, and is under review:
                  http://review.gluster.org/#/c/7531/

Comment 13 Nagaprasad Sathyanarayana 2014-05-19 10:56:34 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 15 Ben Turner 2014-05-29 02:39:52 UTC
Verified on:

glusterfs-3.6.0.9-1.el6.x86_64

Comment 17 errata-xmlrpc 2014-09-22 19:35:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.