Created attachment 1580779 [details] Gluster Client Log Description of problem: During a period of a large write, a 42 second disconnect error occurred in the logs. This occurs from time to time, but recovers. But this time, about ~10 seconds later, the client/glusterfs crashed. The error in the client logs was the following: [2019-06-11 15:31:42.794126] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-somecompany-client-1: disconnected from somecompany-client-1. Client process will keep trying to connect to glusterd until brick's port is available pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(WRITE) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-06-11 15:31:53 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 4.1.6 /lib64/libglusterfs.so.0(+0x25940)[0x7f66fd4ee940] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f66fd4f88a4] /lib64/libc.so.6(+0x36280)[0x7f66fbb53280] /usr/lib64/glusterfs/4.1.6/xlator/protocol/client.so(+0x615e3)[0x7f66f60e35e3] /lib64/libgfrpc.so.0(+0xec20)[0x7f66fd2bbc20] /lib64/libgfrpc.so.0(+0xefb3)[0x7f66fd2bbfb3] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f66fd2b7e93] /usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0x7636)[0x7f66f83cb636] /usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0xa107)[0x7f66f83ce107] /lib64/libglusterfs.so.0(+0x890c4)[0x7f66fd5520c4] /lib64/libpthread.so.0(+0x7dd5)[0x7f66fc352dd5] /lib64/libc.so.6(clone+0x6d)[0x7f66fbc1aead] Version-Release number of selected component (if applicable): Gluster 4.1.7 Centos 7.6.1810 (Core) How reproducible: Not really sure, but we believe it has something to do with a very large write (~1-3GBs). During that time, either the IO or the network was busy, causing the 42 second disconnect. This was a 3-brick setup with one of the bricks being an arbiter brick. The primary EC2 instance had one of the data bricks and an arbiter brick and the secondary had just one of the data bricks. Both had a FUSE-client mount that connected to the the volume. The primary server was the one doing the large write at the time, and the primary's glusterfs client was the client that crashed, in which we could not access the files in the mount (Transport endpoint is not connected). The secondary's glusterfs client was still able to access the files. "gluster volume status" showed that all the bricks were up and running. We were able to unmount and mount the client later, but at that point, we were unsure if the services using the mount were using stale file pointers, so we restarted the servers to make sure everything was okay. Sadly, the coredump was corrupted and was not recoverable (unrelated). Steps to Reproduce: 1. N/A Actual results: Client glusterfs process crashed and did not recover, so we were unable to access the files on the mount Expected results: Client glusterfs process does not crash, so that we are able to access the files on the mount. Or it crashes and there is a way to recover the mount without having to remount. Additional info: Servers have been up for a few weeks with similar load, but have had no issues until now.
Appreciate if you can provide output of 'thread apply all bt full' from `$ gdb -c <corefile>` Also, there were many stability fixes which happened in glusterfs in glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to latest.
(In reply to Amar Tumballi from comment #1) > Appreciate if you can provide output of 'thread apply all bt full' from `$ > gdb -c <corefile>` > > > Also, there were many stability fixes which happened in glusterfs in > glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to > latest. Sadly, we lost the core(In reply to Amar Tumballi from comment #1) > Appreciate if you can provide output of 'thread apply all bt full' from `$ > gdb -c <corefile>` > > > Also, there were many stability fixes which happened in glusterfs in > glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to > latest. Sadly, we corrupted our core dump and we restarted the site so a good portion of our logs were removed, so we don't really have much for debugging. We weren't sure if there was anything in the stacktrace that could be used to tell us why it crashed. We usually upgrade to the latest long-term release unless there is a CVE or there is a good chance that a critical bug has been fixed in the short term releases, and not in the long term release (which hasn't happened yet).
This bug is moved to https://github.com/gluster/glusterfs/issues/888, and will be tracked there from now on. Visit GitHub issues URL for further details