Description of problem: The FUSE mounted gluster became unavailable, reporting "stat: cannot stat ...: Transport endpoint is not connected". Logs showed that something crashed (see Additional info). Version-Release number of selected component (if applicable): glusterfs.x86_64 3.8.4-1.fc24 @updates glusterfs-api.x86_64 3.8.4-1.fc24 @updates glusterfs-cli.x86_64 3.8.4-1.fc24 @updates glusterfs-client-xlators.x86_64 3.8.4-1.fc24 @updates glusterfs-fuse.x86_64 3.8.4-1.fc24 @updates glusterfs-libs.x86_64 3.8.4-1.fc24 @updates glusterfs-server.x86_64 3.8.4-1.fc24 @updates How reproducible: Quite easily, I would say that reproducible in 30% of attempts. Steps to Reproduce: 1. Setup 6 FUSE mount points from different nodes to a single Gluster volume. 2. Start very heavy read / write traffic through each of the mount points (approx 1050 MBit/s of cumulated write and the same amount of cumulated read traffic across all the mount points together). 3. Make the volume slowly fill with data. 4. At least one of the mount points will go down in the moment the volume gets full. Actual results: At least one of the 6 mount points goes down. Expected results: API will keep correctly responding to a POSIX calls as it should on normally filled in volume, when there is more free space again, things will just keep working, as it is on the non-crashed clients. Additional info: [2016-10-11 06:36:14.835862] W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device] The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-0: remote operation failed [No space left on device]" repeated 4 times between [2016-10-11 06:36:08.285146] and [2016-10-11 06:36:13.803409] The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device]" repeated 12 times between [2016-10-11 06:36:14.835862] and [2016-10-11 06:36:14.840894] pending frames: frame : type(1) op(OPEN) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(STATFS) frame : type(1) op(LOOKUP) frame : type(1) op(OPEN) frame : type(0) op(0) frame : type(1) op(LOOKUP) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(FLUSH) frame : type(1) op(FLUSH) frame : type(0) op(0) frame : type(1) op(FLUSH) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(LOOKUP) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(1) op(FLUSH) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-10-11 06:36:14 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x7e)[0x7efd2ddc31fe] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7efd2ddcc974] /lib64/libc.so.6(+0x34ed0)[0x7efd2c428ed0] /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x68f7)[0x7efd25dd98f7] /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6b5b)[0x7efd25dd9b5b] /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6c37)[0x7efd25dd9c37] /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x51ed1)[0x7efd26035ed1] /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so(+0x16f97)[0x7efd26281f97] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7efd2db8e970] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x27c)[0x7efd2db8ecec] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7efd2db8b073] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8ac9)[0x7efd28788ac9] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8cb8)[0x7efd28788cb8] /lib64/libglusterfs.so.0(+0x7a42a)[0x7efd2de1642a] /lib64/libpthread.so.0(+0x75ba)[0x7efd2cc1e5ba] /lib64/libc.so.6(clone+0x6d)[0x7efd2c4f77cd] ---------
Raghavenda, have you seen something like this before?
(In reply to Niels de Vos from comment #1) > Raghavenda, have you seen something like this before? Its difficult to say as bt doesn't involve any symbols. Logs after installing gluster debuginfo packages or a bt through gdb would've helped. Is it possible to get them? However there are some fixes to write-behind that might've fixed memory corruptions (though not sure whether this bug is same issue): https://review.gluster.org/16464
(In reply to Raghavendra G from comment #2) > (In reply to Niels de Vos from comment #1) > > Raghavenda, have you seen something like this before? > > Its difficult to say as bt doesn't involve any symbols. Logs after > installing gluster debuginfo packages or a bt through gdb would've helped. > Is it possible to get them? However there are some fixes to write-behind > that might've fixed memory corruptions (though not sure whether this bug is > same issue): > https://review.gluster.org/16464 Looking at the bug title, this patch could be related as it fixes a memory corruption in the code-path where we encounter short writes.
(In reply to Raghavendra G from comment #3) > (In reply to Raghavendra G from comment #2) > > (In reply to Niels de Vos from comment #1) > > > Raghavenda, have you seen something like this before? > > > > Its difficult to say as bt doesn't involve any symbols. Logs after > > installing gluster debuginfo packages or a bt through gdb would've helped. > > Is it possible to get them? However there are some fixes to write-behind > > that might've fixed memory corruptions (though not sure whether this bug is > > same issue): > > https://review.gluster.org/16464 > > Looking at the bug title, this patch could be related as it fixes a memory > corruption in the code-path where we encounter short writes. Unfortunately we do not have the setup with that version and type of load any more (we have moved away from Gluster for that usecase because we had much more problems and unstabilities as well), so I cannot provide you with more information. As for whether the above mentioned patch set solves the problem, I afraid that without reading a lot of code there I cannot easily say. I believe in your expertise and professional experience based feeling:)
Closing as setup exhibiting the issue has been decommissioned.