Description of problem: Found a core in "/var/log/core" directory of one of the machine in 2*2 distributed-replicate volume. There were lot of self heal operations performed. Not sure which operation caused this crash. Version-Release number of selected component (if applicable): [root@hicks entries]# gluster --version glusterfs 3.3.0rhs built on Sep 10 2012 00:49:11 (glusterfs-rdma-3.3.0rhs-28.el6rhs.x86_64) Backtrace: ========== Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'. Program terminated with signal 6, Aborted. #0 0x00007f8b1ddc38a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.5.x86_64 libgcc-4.4.6-4.el6.x86_64 openssl-1.0.0-25.el6_3.1.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt #0 0x00007f8b1ddc38a5 in raise () from /lib64/libc.so.6 #1 0x00007f8b1ddc5085 in abort () from /lib64/libc.so.6 #2 0x00007f8b1de00fe7 in __libc_message () from /lib64/libc.so.6 #3 0x00007f8b1de06916 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f8b1de0a394 in _int_malloc () from /lib64/libc.so.6 #5 0x00007f8b1de0b141 in malloc () from /lib64/libc.so.6 #6 0x00007f8b1de92738 in __vasprintf_chk () from /lib64/libc.so.6 #7 0x0000003032e19520 in vasprintf (domain=0x40ea58 "glusterfsd-mgmt", file=0x40e9c6 "glusterfsd-mgmt.c", function=0x40f5f0 "is_graph_topology_equal", line=1437, level=GF_LOG_DEBUG, fmt=0x40ee20 "graphs are not equal") at /usr/include/bits/stdio2.h:199 #8 _gf_log (domain=0x40ea58 "glusterfsd-mgmt", file=0x40e9c6 "glusterfsd-mgmt.c", function=0x40f5f0 "is_graph_topology_equal", line=1437, level=GF_LOG_DEBUG, fmt=0x40ee20 "graphs are not equal") at logging.c:565 #9 0x000000000040cb3d in is_graph_topology_equal (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f8b1cddb904) at glusterfsd-mgmt.c:1436 #10 glusterfs_volfile_reconfigure (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f8b1cddb904) at glusterfsd-mgmt.c:1487 #11 mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f8b1cddb904) at glusterfsd-mgmt.c:1589 #12 0x000000303360f0c5 in rpc_clnt_handle_reply (clnt=0x2732a10, pollin=0x313e580) at rpc-clnt.c:788 #13 0x000000303360f8c0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x2732a40, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:907 #14 0x000000303360b018 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489 #15 0x00007f8b1aa5f954 in socket_event_poll_in (this=0x2737640) at socket.c:1677 #16 0x00007f8b1aa5fa37 in socket_event_handler (fd=<value optimized out>, idx=5, data=0x2737640, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792 #17 0x0000003032e3ed84 in event_dispatch_epoll_handler (event_pool=0x2720630) at event.c:785 #18 event_dispatch_epoll (event_pool=0x2720630) at event.c:847 #19 0x00000000004073ca in main (argc=<value optimized out>, argv=0x7fff018d6de8) ---Type <return> to continue, or q <return> to quit---q at glusterfsd.c:1689Quit (gdb) q
looking at the backtrace, didn't notice anything serious... was the memory usage very high? do you still have the core? want 'thread apply all bt full' command output...
I am not sure about the memory usage at the time of core. I am attaching the output of 'thread apply all bt full' command. Let me know in case you want to have complete core file.
Created attachment 619621 [details] 'thread apply all bt full' output
Were any volume set operations running? Was there any nfs mount of the volume? The crashed process seems to be nfs server.
Also can you attached the logs of the crashed glusterfs process?
VM is re-provisioned, hence can not provide any further log
With master branch not seen this happening anymore, running the similar type of tests in longevity test-bed for more than 2weeks and this issue is not seen. Marking it as WORKSFORME (with Fixed in version as 3.3.0.5rhs-36), please feel free to reopen if seen again.