1087563 – SMB: Crash on brick process during compile kernel.

Bug 1087563 - SMB: Crash on brick process during compile kernel.

Summary: SMB: Crash on brick process during compile kernel.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Krutika Dhananjay
QA Contact:	Ben Turner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1089470 1111759
TreeView+	depends on / blocked

Reported:	2014-04-14 15:59 UTC by Ben Turner
Modified:	2016-09-17 14:38 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.5qa2-0.425.git9360107
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1089470 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:35:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
bt (8.91 KB, text/plain) 2014-04-14 16:50 UTC, Ben Turner	no flags	Details
core info (111 bytes, text/plain) 2014-04-14 16:55 UTC, Ben Turner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Ben Turner 2014-04-14 15:59:03 UTC

Description of problem:

There was a seg fault on the brick process of the node being mounted via SMB during compile kernel of our sanity test suite:

Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: pending frames:
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(30)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: frame : type(0) op(29)
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: patchset: git://git.gluster.com/glusterfs.git
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: signal received: 11
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: time of crash: 
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: 2014-04-14 15:37:48
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: configuration details:
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: argp 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: backtrace 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: dlfcn 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: fdatasync 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: libpthread 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: llistxattr 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: setfsid 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: spinlock 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: epoll.h 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: xattr.h 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: st_atim.tv_nsec 1
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: package-string: glusterfs 3.5qa2
Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: ---------

Version-Release number of selected component (if applicable):

glusterfs-3.5qa2-0.340.gitc193996.el6rhs.x86_64

How reproducible:

I hit this the first time I ran FS sanity on the latest build. 

Steps to Reproduce:
1.  Create 6x2 volume
2.  Mount -t cifs from a client
3.  Compile the linux kernel

Actual results:

Segfault of the brick process

Expected results:

The kernel compiles correctly

Additional info:

Comment 1 Ben Turner 2014-04-14 16:49:33 UTC

BT from gdb:

(gdb) bt
#0  inodelk_overlap (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:118
#1  inodelk_conflict (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:134
#2  __inodelk_grantable (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:147
#3  __lock_inodelk (this=0x6584b0, pl_inode=<value optimized out>, lock=0x7f4e64005ff0, can_block=1, dom=0x7f4e7400f320) at inodelk.c:206
#4  0x00007f4e98b0d03a in __grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:329
#5  grant_blocked_inode_locks (this=0x6584b0, pl_inode=0x7f4e74004fe0, dom=0x7f4e7400f320) at inodelk.c:351
#6  0x00007f4e98b0d529 in pl_inodelk_client_cleanup (this=0x6584b0, ctx=0x7f4e840009b0) at inodelk.c:451
#7  0x00007f4e98b02eca in pl_client_disconnect_cbk (this=0x6584b0, client=<value optimized out>) at posix.c:2549
#8  0x00007f4ea6022fdd in gf_client_disconnect (client=0x6a90d0) at client_t.c:368
#9  0x00007f4e93dd7718 in server_connection_cleanup (this=0x65f1c0, client=0x6a90d0, flags=<value optimized out>) at server-helpers.c:244
#10 0x00007f4e93dd2fcc in server_rpc_notify (rpc=<value optimized out>, xl=0x65f1c0, event=<value optimized out>, data=0x6a83f0) at server.c:558
#11 0x00007f4ea5da3d55 in rpcsvc_handle_disconnect (svc=0x661010, trans=0x6a83f0) at rpcsvc.c:676
#12 0x00007f4ea5da5890 in rpcsvc_notify (trans=0x6a83f0, mydata=<value optimized out>, event=<value optimized out>, data=0x6a83f0) at rpcsvc.c:714
#13 0x00007f4ea5da6f68 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512
#14 0x00007f4e9b9a0941 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:1071
#15 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x6a83f0, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:2240
#16 0x00007f4ea6025347 in event_dispatch_epoll_handler (event_pool=0x637730) at event-epoll.c:384
#17 event_dispatch_epoll (event_pool=0x637730) at event-epoll.c:445
#18 0x0000000000407ba2 in main (argc=19, argv=0x7fffa341dbc8) at glusterfsd.c:1965

Comment 2 Ben Turner 2014-04-14 16:50:11 UTC

Created attachment 886179 [details]
bt

Comment 4 Ben Turner 2014-04-14 16:55:10 UTC

Created attachment 886180 [details]
core info

Comment 5 krishnan parthasarathi 2014-04-15 09:43:48 UTC

Ben,
Could you attach logs of the brick process that crashed? Could you check if any other other bricks or clients (SMB/FUSE/NFS) accessing this volume crashed around the same time?

In the log snippet, that you have provided along with the description, the following messages 'suggest' that there might be some memory corruption.
Having the log files would provide more context into when or what could have caused the possible memory corruption or triggered the crash.

<snip>

Apr 14 11:37:48 gqac012 bricks-testvol_brick4[8167]: [2014-04-14 15:37:48.175765] C [mem-pool.c:497:mem_put] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xed) [0x7f4e9809b20d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_finodelk_cbk+0xad) [0x7f4e93ddfe9d] (-->/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_submit_reply+0x21c) [0x7f4e93dd366c]))) 0-mem-pool: mem_put called on freed ptr 0x7f4ea3e59e84 of mem pool 0x6378b0

</snip>

thanks.

Comment 6 Krutika Dhananjay 2014-04-15 15:04:49 UTC

This is similar to the bug reported in http://lists.gnu.org/archive/html/gluster-devel/2014-04/msg00077.html which I've been looking into for 4 days now, while Pranith is away.

There are two issues here:

1. Why did the client disconnect from the server?

2. Blocked inodelk list corruption

I was able to root-cause the list corruption part, as follows:

During disconnect, the brick(s) release all blocked inodelks associated with the disconnecting client. But in the current code (in function pl_inodelk_client_cleanup()), the blocked inodelk object is freed without being deleted from the blocked_inodelks list of the domain object, causing this list to contain stale inodelks. When the same list is traversed again at some other point in time, there's a crash.

Since I am anyway looking into it, taking the bug in my name.

Comment 7 Krutika Dhananjay 2014-04-15 15:41:28 UTC

Note: This issue also exists with blocked_entrylks (in pl_entrylk_client_cleanup()).

Comment 8 Ben Turner 2014-04-15 17:23:50 UTC

This was found during FS sanity daily runs, the systems were reinstalled to run other component automation and the logs from that event are lost.  I am attempting to repro now but thus far I have been unable to.  One thing to note is I see alot of Samba disconnects in the /var/log/messages file:

Apr 15 12:52:04 gqac012 smbd[23120]: [2014/04/15 12:52:04.562926,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:52:04 gqac012 smbd[23120]:   testvol: Initialized volume from server localhost
Apr 15 12:52:25 gqac012 logger: 2014-04-15 12:52:25 /usr/bin/rhts-test-runner.sh 86529 2220 hearbeat...
Apr 15 12:53:25 gqac012 logger: 2014-04-15 12:53:25 /usr/bin/rhts-test-runner.sh 86529 2280 hearbeat...
Apr 15 12:54:10 gqac012 smbd[24134]: [2014/04/15 12:54:10.720573,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:54:10 gqac012 smbd[24134]:   testvol: Initialized volume from server localhost
Apr 15 12:54:26 gqac012 logger: 2014-04-15 12:54:26 /usr/bin/rhts-test-runner.sh 86529 2340 hearbeat...
Apr 15 12:55:25 gqac012 logger: 2014-04-15 12:55:25 /usr/bin/rhts-test-runner.sh 86529 2400 hearbeat...
Apr 15 12:56:16 gqac012 smbd[25170]: [2014/04/15 12:56:16.890376,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:56:16 gqac012 smbd[25170]:   testvol: Initialized volume from server localhost
Apr 15 12:56:25 gqac012 logger: 2014-04-15 12:56:25 /usr/bin/rhts-test-runner.sh 86529 2460 hearbeat...
Apr 15 12:57:26 gqac012 logger: 2014-04-15 12:57:26 /usr/bin/rhts-test-runner.sh 86529 2520 hearbeat...
Apr 15 12:58:23 gqac012 smbd[26192]: [2014/04/15 12:58:23.047897,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 12:58:23 gqac012 smbd[26192]:   testvol: Initialized volume from server localhost
Apr 15 12:58:25 gqac012 logger: 2014-04-15 12:58:25 /usr/bin/rhts-test-runner.sh 86529 2580 hearbeat...
Apr 15 12:59:25 gqac012 logger: 2014-04-15 12:59:25 /usr/bin/rhts-test-runner.sh 86529 2640 hearbeat...
Apr 15 13:00:25 gqac012 logger: 2014-04-15 13:00:25 /usr/bin/rhts-test-runner.sh 86529 2700 hearbeat...
Apr 15 13:00:29 gqac012 smbd[27232]: [2014/04/15 13:00:29.215589,  0] modules/vfs_glusterfs.c:282(vfs_gluster_connect)
Apr 15 13:00:29 gqac012 smbd[27232]:   testvol: Initialized volume from server localhost

I'll keep attempting to repro and update with my findings.

Comment 9 Ben Turner 2014-04-15 17:30:39 UTC

In the brick logs I am also seeing:

[2014-04-15 17:25:43.591607] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-7395-2014/04/15-17:25:43:384130-testvol-client-4-0-0
[2014-04-15 17:26:27.943657] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder
[2014-04-15 17:26:27.944127] E [index.c:268:check_delete_stale_index_file] 0-testvol-index: Base index is not created under index/base_indices_holder
[2014-04-15 17:27:49.912334] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0
[2014-04-15 17:27:49.912409] I [server-handshake.c:577:server_setvolume] 0-testvol-server: accepted client from gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-0 (version: 3.5qa2)
[2014-04-15 17:27:49.912610] I [client_t.c:184:gf_client_get] 0-testvol-server: client_uid=gqac012.sbu.lab.eng.bos.redhat.com-8418-2014/04/15-17:27:49:815187-testvol-client-4-0-

As soon as I repo I'll gather sosreports.

Comment 10 Krutika Dhananjay 2014-04-23 09:26:44 UTC

The following fix in upstream fixes issue #2 described in comment #6, and is under review:
                  http://review.gluster.org/#/c/7512/

And the following upstream patch sent by Pranith fixes issue #1, and is under review:
                  http://review.gluster.org/#/c/7531/

Comment 13 Nagaprasad Sathyanarayana 2014-05-19 10:56:34 UTC

Setting flags required to add BZs to RHS 3.0 Errata

Comment 15 Ben Turner 2014-05-29 02:39:52 UTC

Verified on:

glusterfs-3.6.0.9-1.el6.x86_64

Comment 17 errata-xmlrpc 2014-09-22 19:35:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.