Description of problem: I installed GlusterFS 3.3.1 in my 24 servers, created a DHT+AFR volume and mounted it with native client. Recently, some glusterfs clients is crashed, the log is as below. The OS is 64bit CentOS6.2, kernel version: 2.6.32-220.23.1.el6.x86_64 #1 SMP Fri Jun 28 00:56:49 CST 2013 x86_64 x86_64 x86_64 GNU/Linux pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2013-09-05 00:37:40 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.1 /lib64/libc.so.6[0x3ac0232900] /lib64/libc.so.6(gsignal+0x35)[0x3ac0232885] /lib64/libc.so.6(abort+0x175)[0x3ac0234065] /lib64/libc.so.6[0x3ac026f7a7] /lib64/libc.so.6[0x3ac02750c6] /usr/lib/libglusterfs.so.0(mem_put+0x64)[0x7f3f99c2c684] /usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_local_cleanup+0x60)[0x7f3f95209c30] /usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_lookup_cbk+0x5a1)[0x7f3f952110f1] /usr/lib/glusterfs/3.3.1/xlator/protocol/client.so(client3_1_lookup_cbk+0x6b0)[0x7f3f9544b550] /usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f3f999e44e5] /usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f3f999e4ce0] /usr/lib/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f3f999dfeb8] /usr/lib/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f3f96295764] /usr/lib/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f3f96295847] /usr/lib/libglusterfs.so.0(+0x3e464)[0x7f3f99c2b464] /usr/sbin/glusterfs(main+0x58a)[0x40736a] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3ac021ecdd] /usr/sbin/glusterfs[0x4042d9] --------- Version-Release number of selected component (if applicable): How reproducible: It's a pity I don't know how to re-create the issue. While there are 1-2 crashed clients in total 120 clients every day. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Below is gdb result: (gdb) where #0 0x0000003267432885 in raise () from /lib64/libc.so.6 #1 0x0000003267434065 in abort () from /lib64/libc.so.6 #2 0x000000326746f7a7 in __libc_message () from /lib64/libc.so.6 #3 0x00000032674750c6 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007fc4f2847684 in mem_put (ptr=0x7fc4b0a4c03c) at mem-pool.c:559 #5 0x00007fc4f281cc9b in dict_destroy (this=0x7fc4f12cc5cc) at dict.c:397 #6 0x00007fc4ede24c30 in afr_local_cleanup (local=0x7fc4ce68ac20, this=<value optimized out>) at afr-common.c:848 #7 0x00007fc4ede2c0f1 in afr_lookup_done (frame=0x18d5ae4, cookie=0x0, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x18d5b20, buf=0x7fffcb83ec50, xattr=0x7fc4f12e1818, postparent=0x7fffcb83ebe0) at afr-common.c:1881 #8 afr_lookup_cbk (frame=0x18d5ae4, cookie=0x0, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x18d5b20, buf=0x7fffcb83ec50, xattr=0x7fc4f12e1818, postparent=0x7fffcb83ebe0) at afr-common.c:2044 #9 0x00007fc4ee066550 in client3_1_lookup_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7fc4f16f390c) at client3_1-fops.c:2636 #10 0x00007fc4f25ff4e5 in rpc_clnt_handle_reply (clnt=0x3b5c600, pollin=0x6ba00f0) at rpc-clnt.c:786 #11 0x00007fc4f25ffce0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x3b5c630, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:905 #12 0x00007fc4f25faeb8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489 #13 0x00007fc4eeeb0764 in socket_event_poll_in (this=0x3b6c060) at socket.c:1677 #14 0x00007fc4eeeb0847 in socket_event_handler (fd=<value optimized out>, idx=265, data=0x3b6c060, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792 #15 0x00007fc4f2846464 in event_dispatch_epoll_handler (event_pool=0x177cdf0) at event.c:785 #16 event_dispatch_epoll (event_pool=0x177cdf0) at event.c:847 #17 0x000000000040736a in main (argc=<value optimized out>, argv=0x7fffcb83efc8) at glusterfsd.c:1689
I think we had this one, what helped us was switching to 3.4.0
Another kind of client crash happened, gdb information is as below for you reference: Core was generated by `/usr/sbin/glusterfs --log-level=INFO --volfile-id=gfs6 --volfile-server=bj-nx-c'. Program terminated with signal 11, Segmentation fault. #0 afr_frame_return (frame=<value optimized out>) at afr-common.c:983 983 call_count = --local->call_count; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) where #0 afr_frame_return (frame=<value optimized out>) at afr-common.c:983 #1 0x00007f8aa1c1ebbc in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7f8aa525b248, cookie=<value optimized out>, this=0x1a82e00, op_ret=<value optimized out>, op_errno=<value optimized out>, preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:970 #2 0x00007f8aa1e5fecb in client3_1_setattr (frame=0x7f8aa54ec634, this=<value optimized out>, data=<value optimized out>) at client3_1-fops.c:5801 #3 0x00007f8aa1e58b41 in client_setattr (frame=0x7f8aa54ec634, this=<value optimized out>, loc=<value optimized out>, stbuf=<value optimized out>, valid=<value optimized out>, xdata=<value optimized out>) at client.c:1915 #4 0x00007f8aa1c1f080 in afr_sh_entry_impunge_setattr (impunge_frame=0x7f8aa5454e10, this=<value optimized out>) at afr-self-heal-entry.c:1017 #5 0x00007f8aa1c1f5c0 in afr_sh_entry_impunge_xattrop_cbk (impunge_frame=0x7f8aa5454e10, cookie=0x1, this=0x1a82e00, op_ret=<value optimized out>, op_errno=22, xattr=<value optimized out>, xdata=0x0) at afr-self-heal-entry.c:1067 #6 0x00007f8aa1e6b34e in client3_1_xattrop_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f8aa54ad5b8) at client3_1-fops.c:1715 #7 0x00000037eba0f4e5 in rpc_clnt_handle_reply (clnt=0x1eaccd0, pollin=0x2fba390) at rpc-clnt.c:786 #8 0x00000037eba0fce0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1eacd00, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:905 #9 0x00000037eba0aeb8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489 #10 0x00007f8aa2cb5764 in socket_event_poll_in (this=0x1ebc730) at socket.c:1677 #11 0x00007f8aa2cb5847 in socket_event_handler (fd=<value optimized out>, idx=127, data=0x1ebc730, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792 #12 0x00000037eb63e464 in event_dispatch_epoll_handler (event_pool=0x19eddf0) at event.c:785 #13 event_dispatch_epoll (event_pool=0x19eddf0) at event.c:847 #14 0x000000000040736a in main (argc=<value optimized out>, argv=0x7fff26cdcd78) at glusterfsd.c:1689
Comment #2 contains many afr_* calls, setting component to replicate.
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.