Bug 1099270 - Gluster 3.5.0 NFS server crashes under load
Summary: Gluster 3.5.0 NFS server crashes under load
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: nfs
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On: 1010241 1067256 1166278 1196898
Blocks: glusterfs-3.5.10
TreeView+ depends on / blocked
 
Reported: 2014-05-20 01:13 UTC by Franco Broi
Modified: 2016-06-17 16:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-17 16:24:04 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Franco Broi 2014-05-20 01:13:28 UTC
Description of problem:

Gluster NFS server crashes with core dump


Version-Release number of selected component (if applicable):

3.5.0


How reproducible:

Crashed twice now, can probably reproduce the crash again.

Steps to Reproduce:
1. Start the NFS server
2. Using a single client, start 7+ random read/write tests
3. Wait for crash

Actual results:

Crashes after 30-40 minutes. core file is 16GB, 48MB after compression.

Expected results:


Additional info:

[2014-05-19 08:13:00.084506] W [rpcsvc.c:211:rpcsvc_program_actor] 0-rpc-service: RPC program version not available (req 100003 4)
[2014-05-19 08:13:00.084537] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-05-19 08:50:36configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.5.0

Traceback crash 1
#0  0x00007f0e3d924925 in raise () from /lib64/libc.so.6
#1  0x00007f0e3d926105 in abort () from /lib64/libc.so.6
#2  0x00007f0e3d962837 in __libc_message () from /lib64/libc.so.6
#3  0x00007f0e3d968166 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f0e3f0e2e0f in rpcsvc_drc_op_destroy (drc=0x21e2780, reply=0x7f0e302ee470) at rpc-drc.c:47
#5  0x00007f0e3f331bc1 in rb_destroy (tree=0x7f0e302ee02c, destroy=0x7f0e3f0e2e80 <rpcsvc_drc_rb_op_destroy>) at ../../contrib/rbtree/rb.c:876
#6  0x00007f0e3f0e2b5f in rpcsvc_remove_drc_client (drc=0x21e2780, client=0x230e540) at rpc-drc.c:84
#7  rpcsvc_drc_client_unref (drc=0x21e2780, client=0x230e540) at rpc-drc.c:316
#8  0x00007f0e3f0e2c98 in rpcsvc_drc_notify (svc=<value optimized out>, xl=<value optimized out>, event=<value optimized out>, data=0x230f670)
    at rpc-drc.c:683
#9  0x00007f0e3f0d9d35 in rpcsvc_handle_disconnect (svc=0x21a6990, trans=0x230f670) at rpcsvc.c:682
#10 0x00007f0e3f0db880 in rpcsvc_notify (trans=0x230f670, mydata=<value optimized out>, event=<value optimized out>, data=0x230f670)
    at rpcsvc.c:720
#11 0x00007f0e3f0dcf98 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#12 0x00007f0e3a93c9a1 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, poll_in=<value optimized out>, poll_out=0, 
    poll_err=0) at socket.c:2239
#14 0x00007f0e3f3512f7 in event_dispatch_epoll_handler (event_pool=0x2186ef0) at event-epoll.c:384
#15 event_dispatch_epoll (event_pool=0x2186ef0) at event-epoll.c:445
#16 0x00000000004075e4 in main (argc=11, argv=0x7fffabef9e38) at glusterfsd.c:1983

#4  0x00007f0e3f0e2e0f in rpcsvc_drc_op_destroy (drc=0x21e2780, reply=0x7f0e302ee470) at rpc-drc.c:47
47	                GF_FREE (reply->msg.rpchdr);

(gdb) p *reply->msg.rpchdr
$4 = {iov_base = 0x2100000000000000, iov_len = 1152921504606846976}

Traceback crash 2
#0  0x00007f67b8194272 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00007f67b8197405 in _int_malloc () from /lib64/libc.so.6
#2  0x00007f67b8198626 in calloc () from /lib64/libc.so.6
#3  0x00007f67b9daad2f in _dl_new_object () from /lib64/ld-linux-x86-64.so.2
#4  0x00007f67b9da71be in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2
#5  0x00007f67b9da837a in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#6  0x00007f67b9db2a44 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f67b9dae1b6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f67b9db24fa in _dl_open () from /lib64/ld-linux-x86-64.so.2
#9  0x00007f67b8244e10 in do_dlopen () from /lib64/libc.so.6
#10 0x00007f67b9dae1b6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#11 0x00007f67b8244f67 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00007f67b821cb25 in init () from /lib64/libc.so.6
#13 0x00007f67b889ed33 in pthread_once () from /lib64/libpthread.so.0
#14 0x00007f67b821cc54 in backtrace () from /lib64/libc.so.6
#15 0x00007f67b9b421f7 in gf_print_trace (signum=11, ctx=0x23dc010) at common-utils.c:588
#16 <signal handler called>
#17 list_del (drc=0x2453780, reply=0x7f67aa47e200) at ../../../libglusterfs/src/list.h:51
#18 rpcsvc_drc_op_destroy (drc=0x2453780, reply=0x7f67aa47e200) at rpc-drc.c:53
#19 0x00007f67b9b5dbc1 in rb_destroy (tree=0x7f67aa47e164, destroy=0x7f67b990ee80 <rpcsvc_drc_rb_op_destroy>) at ../../contrib/rbtree/rb.c:876
#20 0x00007f67b990eb5f in rpcsvc_remove_drc_client (drc=0x2453780, client=0x250f390) at rpc-drc.c:84
#21 rpcsvc_drc_client_unref (drc=0x2453780, client=0x250f390) at rpc-drc.c:316
#22 0x00007f67b990efc8 in rpcsvc_vacate_drc_entries (req=0x7f67b2753d00) at rpc-drc.c:466
#23 rpcsvc_add_op_to_cache (req=0x7f67b2753d00) at rpc-drc.c:493
#24 rpcsvc_cache_request (req=0x7f67b2753d00) at rpc-drc.c:549
#25 0x00007f67b990758e in rpcsvc_handle_rpc_call (svc=0x2417990, trans=<value optimized out>, msg=0x6e4faa0) at rpcsvc.c:601
#26 0x00007f67b99077e3 in rpcsvc_notify (trans=0x2c9f8b0, mydata=<value optimized out>, event=<value optimized out>, data=0x6e4faa0)
    at rpcsvc.c:725
#27 0x00007f67b9908f98 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#28 0x00007f67b5166fb5 in socket_event_poll_in (this=0x2c9f8b0) at socket.c:2119
#29 0x00007f67b51689fd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x2c9f8b0, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2232
#30 0x00007f67b9b7d2f7 in event_dispatch_epoll_handler (event_pool=0x23f7ef0) at event-epoll.c:384
#31 event_dispatch_epoll (event_pool=0x23f7ef0) at event-epoll.c:445
#32 0x00000000004075e4 in main (argc=11, argv=0x7fffd5f25aa8) at glusterfsd.c:1983


#17 list_del (drc=0x2453780, reply=0x7f67aa47e200) at ../../../libglusterfs/src/list.h:51
51		old->prev->next = old->next;

(gdb) p *old
$2 = {next = 0xbabebabe, prev = 0xcafecafe}

Comment 2 Raghavendra G 2014-05-20 05:22:47 UTC
The issue is that the duplicate request cache is modified in two paths:

1. In call path, while handling rpc calls.
2. While handling disconnects, to remove all cached requests.

1 and 2 do not synchronize while acting on DRC, resulting in memory corruption. Fix is on the way.

Comment 4 santosh pradhan 2014-05-21 08:52:31 UTC
http://review.gluster.org/#/c/7816/

Comment 5 Niels de Vos 2014-06-08 16:12:48 UTC
(In reply to santosh pradhan from comment #4)
> http://review.gluster.org/#/c/7816/

That is a patch for the master branch. This bug should be use for a backport when that patch gets merged.

Comment 6 Niels de Vos 2014-06-08 17:13:43 UTC
I'm moving this to glusterfs-3.5.2. The patch mentioned in comment #4 is not sufficient to make DRC completely stable. More work needs to be done, and we should not delay 3.5.1 for this.

The (temporary) workaround is to disable nfs.drc altogether.

Comment 7 Eric Schemmerling 2014-08-07 17:21:21 UTC
I too am running into this exact problem with 3.5.0 -  I can't really add any new info.  Just wondering if this fix made it into 3.5.2?

Comment 8 santosh pradhan 2014-08-07 18:24:25 UTC
(In reply to Eric Schemmerling from comment #7)
> I too am running into this exact problem with 3.5.0 -  I can't really add
> any new info.  Just wondering if this fix made it into 3.5.2?

I dont think it made into 3.5.2 release. But in 3.5.2 NFS DRC is turned OFF by default. As the issue was hit in DRC code path, the crash issue wont be seen in 3.5.2 for sure.

Thanks,
Santosh

Comment 9 santosh pradhan 2014-08-07 18:26:51 UTC
(In reply to santosh pradhan from comment #8)
> (In reply to Eric Schemmerling from comment #7)
> > I too am running into this exact problem with 3.5.0 -  I can't really add
> > any new info.  Just wondering if this fix made it into 3.5.2?
> 
> I dont think it made into 3.5.2 release. But in 3.5.2 NFS DRC is turned OFF
> by default. As the issue was hit in DRC code path, the crash issue wont be
> seen in 3.5.2 for sure.

Or you can turn the nfs drc OFF manually to avoid the issue.

steps:
1. gluster volume set <volume name> nfs.drc off
2. gluster volume start <volume name> force

Thanks,
Santosh

Comment 10 Niels de Vos 2015-09-15 14:09:48 UTC
There were no patches submitted in time for the glusterfs-3.5.6 release that should resolve this bug. This bug report is moved for tracking to the glusterfs-3.5.7 release, submitting patches/backports is very much appreciated.

Comment 11 Niels de Vos 2016-06-17 16:24:04 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.


Note You need to log in before you can comment on or make changes to this bug.