Bug 1005616

Summary: glusterfs client crash (signal received: 6)
Product: [Community] GlusterFS Reporter: cailiang.song <gluster>
Component: replicateAssignee: bugs <bugs>
Status: CLOSED DEFERRED QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.1CC: bugs, gluster-bugs, social
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-14 19:40:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description cailiang.song 2013-09-09 01:22:13 UTC
Description of problem:
I installed GlusterFS 3.3.1 in my 24 servers, created a DHT+AFR volume and mounted it with native client.
Recently, some glusterfs clients is crashed, the log is as below.

The OS is 64bit CentOS6.2, kernel version: 2.6.32-220.23.1.el6.x86_64 #1 SMP Fri Jun 28 00:56:49 CST 2013 x86_64 x86_64 x86_64 GNU/Linux


pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2013-09-05 00:37:40
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib64/libc.so.6[0x3ac0232900]
/lib64/libc.so.6(gsignal+0x35)[0x3ac0232885]
/lib64/libc.so.6(abort+0x175)[0x3ac0234065]
/lib64/libc.so.6[0x3ac026f7a7]
/lib64/libc.so.6[0x3ac02750c6]
/usr/lib/libglusterfs.so.0(mem_put+0x64)[0x7f3f99c2c684]
/usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_local_cleanup+0x60)[0x7f3f95209c30]
/usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_lookup_cbk+0x5a1)[0x7f3f952110f1]
/usr/lib/glusterfs/3.3.1/xlator/protocol/client.so(client3_1_lookup_cbk+0x6b0)[0x7f3f9544b550]
/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f3f999e44e5]
/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f3f999e4ce0]
/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f3f999dfeb8]
/usr/lib/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f3f96295764]
/usr/lib/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f3f96295847]
/usr/lib/libglusterfs.so.0(+0x3e464)[0x7f3f99c2b464]
/usr/sbin/glusterfs(main+0x58a)[0x40736a]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3ac021ecdd]
/usr/sbin/glusterfs[0x4042d9]
---------


Version-Release number of selected component (if applicable):


How reproducible:
It's a pity I don't know how to re-create the issue. While there are 1-2 crashed clients in total 120 clients every day.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Below is gdb result:

(gdb) where
#0  0x0000003267432885 in raise () from /lib64/libc.so.6
#1  0x0000003267434065 in abort () from /lib64/libc.so.6
#2  0x000000326746f7a7 in __libc_message () from /lib64/libc.so.6
#3  0x00000032674750c6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fc4f2847684 in mem_put (ptr=0x7fc4b0a4c03c) at mem-pool.c:559
#5  0x00007fc4f281cc9b in dict_destroy (this=0x7fc4f12cc5cc) at dict.c:397
#6  0x00007fc4ede24c30 in afr_local_cleanup (local=0x7fc4ce68ac20, this=<value optimized out>) at afr-common.c:848
#7  0x00007fc4ede2c0f1 in afr_lookup_done (frame=0x18d5ae4, cookie=0x0, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x18d5b20, 
    buf=0x7fffcb83ec50, xattr=0x7fc4f12e1818, postparent=0x7fffcb83ebe0) at afr-common.c:1881
#8  afr_lookup_cbk (frame=0x18d5ae4, cookie=0x0, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x18d5b20, buf=0x7fffcb83ec50, 
    xattr=0x7fc4f12e1818, postparent=0x7fffcb83ebe0) at afr-common.c:2044
#9  0x00007fc4ee066550 in client3_1_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7fc4f16f390c) at client3_1-fops.c:2636
#10 0x00007fc4f25ff4e5 in rpc_clnt_handle_reply (clnt=0x3b5c600, pollin=0x6ba00f0) at rpc-clnt.c:786
#11 0x00007fc4f25ffce0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x3b5c630, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:905
#12 0x00007fc4f25faeb8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489
#13 0x00007fc4eeeb0764 in socket_event_poll_in (this=0x3b6c060) at socket.c:1677
#14 0x00007fc4eeeb0847 in socket_event_handler (fd=<value optimized out>, idx=265, data=0x3b6c060, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792
#15 0x00007fc4f2846464 in event_dispatch_epoll_handler (event_pool=0x177cdf0) at event.c:785
#16 event_dispatch_epoll (event_pool=0x177cdf0) at event.c:847
#17 0x000000000040736a in main (argc=<value optimized out>, argv=0x7fffcb83efc8) at glusterfsd.c:1689

Comment 1 Lukas Bezdicka 2013-09-09 08:04:19 UTC
I think we had this one, what helped us was switching to 3.4.0

Comment 2 cailiang.song 2013-12-03 06:08:25 UTC
Another kind of client crash happened, gdb information is as below for you reference:

Core was generated by `/usr/sbin/glusterfs --log-level=INFO --volfile-id=gfs6 --volfile-server=bj-nx-c'.
Program terminated with signal 11, Segmentation fault.
#0  afr_frame_return (frame=<value optimized out>) at afr-common.c:983
983	                call_count = --local->call_count;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) where
#0  afr_frame_return (frame=<value optimized out>) at afr-common.c:983
#1  0x00007f8aa1c1ebbc in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7f8aa525b248, cookie=<value optimized out>, this=0x1a82e00, op_ret=<value optimized out>, 
    op_errno=<value optimized out>, preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:970
#2  0x00007f8aa1e5fecb in client3_1_setattr (frame=0x7f8aa54ec634, this=<value optimized out>, data=<value optimized out>) at client3_1-fops.c:5801
#3  0x00007f8aa1e58b41 in client_setattr (frame=0x7f8aa54ec634, this=<value optimized out>, loc=<value optimized out>, stbuf=<value optimized out>, valid=<value optimized out>, 
    xdata=<value optimized out>) at client.c:1915
#4  0x00007f8aa1c1f080 in afr_sh_entry_impunge_setattr (impunge_frame=0x7f8aa5454e10, this=<value optimized out>) at afr-self-heal-entry.c:1017
#5  0x00007f8aa1c1f5c0 in afr_sh_entry_impunge_xattrop_cbk (impunge_frame=0x7f8aa5454e10, cookie=0x1, this=0x1a82e00, op_ret=<value optimized out>, op_errno=22, xattr=<value optimized out>, 
    xdata=0x0) at afr-self-heal-entry.c:1067
#6  0x00007f8aa1e6b34e in client3_1_xattrop_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f8aa54ad5b8) at client3_1-fops.c:1715
#7  0x00000037eba0f4e5 in rpc_clnt_handle_reply (clnt=0x1eaccd0, pollin=0x2fba390) at rpc-clnt.c:786
#8  0x00000037eba0fce0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1eacd00, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:905
#9  0x00000037eba0aeb8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489
#10 0x00007f8aa2cb5764 in socket_event_poll_in (this=0x1ebc730) at socket.c:1677
#11 0x00007f8aa2cb5847 in socket_event_handler (fd=<value optimized out>, idx=127, data=0x1ebc730, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792
#12 0x00000037eb63e464 in event_dispatch_epoll_handler (event_pool=0x19eddf0) at event.c:785
#13 event_dispatch_epoll (event_pool=0x19eddf0) at event.c:847
#14 0x000000000040736a in main (argc=<value optimized out>, argv=0x7fff26cdcd78) at glusterfsd.c:1689

Comment 3 Niels de Vos 2014-08-29 09:27:22 UTC
Comment #2 contains many afr_* calls, setting component to replicate.

Comment 4 Niels de Vos 2014-11-27 14:54:33 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.