Bug 1383597

Summary: Crash when drive became full
Product: [Community] GlusterFS Reporter: Pavel Černohorský <pavel.cernohorsky>
Component: write-behindAssignee: Csaba Henk <csaba>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.8CC: bugs, mliyazud, ndevos, pavel.cernohorsky, rgowdapp
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-06 12:00:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Černohorský 2016-10-11 07:54:02 UTC
Description of problem:
The FUSE mounted gluster became unavailable, reporting "stat: cannot stat ...: Transport endpoint is not connected". Logs showed that something crashed (see Additional info).

Version-Release number of selected component (if applicable):
glusterfs.x86_64                     3.8.4-1.fc24               @updates        
glusterfs-api.x86_64                 3.8.4-1.fc24               @updates        
glusterfs-cli.x86_64                 3.8.4-1.fc24               @updates        
glusterfs-client-xlators.x86_64      3.8.4-1.fc24               @updates        
glusterfs-fuse.x86_64                3.8.4-1.fc24               @updates        
glusterfs-libs.x86_64                3.8.4-1.fc24               @updates        
glusterfs-server.x86_64              3.8.4-1.fc24               @updates

How reproducible:
Quite easily, I would say that reproducible in 30% of attempts.

Steps to Reproduce:
1. Setup 6 FUSE mount points from different nodes to a single Gluster volume.
2. Start very heavy read / write traffic through each of the mount points (approx 1050 MBit/s of cumulated write and the same amount of cumulated read traffic across all the mount points together).
3. Make the volume slowly fill with data.
4. At least one of the mount points will go down in the moment the volume gets full.

Actual results:
At least one of the 6 mount points goes down.

Expected results:
API will keep correctly responding to a POSIX calls as it should on normally filled in volume, when there is more free space again, things will just keep working, as it is on the non-crashed clients.

Additional info:
[2016-10-11 06:36:14.835862] W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device]
The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-0: remote operation failed [No space left on device]" repeated 4 times between [2016-10-11 06:36:08.285146] and [2016-10-11 06:36:13.803409]
The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device]" repeated 12 times between [2016-10-11 06:36:14.835862] and [2016-10-11 06:36:14.840894]
pending frames:
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(STATFS)
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2016-10-11 06:36:14
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x7e)[0x7efd2ddc31fe]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7efd2ddcc974]
/lib64/libc.so.6(+0x34ed0)[0x7efd2c428ed0]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x68f7)[0x7efd25dd98f7]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6b5b)[0x7efd25dd9b5b]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6c37)[0x7efd25dd9c37]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x51ed1)[0x7efd26035ed1]
/usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so(+0x16f97)[0x7efd26281f97]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7efd2db8e970]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x27c)[0x7efd2db8ecec]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7efd2db8b073]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8ac9)[0x7efd28788ac9]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8cb8)[0x7efd28788cb8]
/lib64/libglusterfs.so.0(+0x7a42a)[0x7efd2de1642a]
/lib64/libpthread.so.0(+0x75ba)[0x7efd2cc1e5ba]
/lib64/libc.so.6(clone+0x6d)[0x7efd2c4f77cd]
---------

Comment 1 Niels de Vos 2016-10-18 12:49:58 UTC
Raghavenda, have you seen something like this before?

Comment 2 Raghavendra G 2017-02-10 04:57:51 UTC
(In reply to Niels de Vos from comment #1)
> Raghavenda, have you seen something like this before?

Its difficult to say as bt doesn't involve any symbols. Logs after installing gluster debuginfo packages or a bt through gdb would've helped. Is it possible to get them? However there are some fixes to write-behind that might've fixed memory corruptions (though not sure whether this bug is same issue):
https://review.gluster.org/16464

Comment 3 Raghavendra G 2017-02-10 04:59:19 UTC
(In reply to Raghavendra G from comment #2)
> (In reply to Niels de Vos from comment #1)
> > Raghavenda, have you seen something like this before?
> 
> Its difficult to say as bt doesn't involve any symbols. Logs after
> installing gluster debuginfo packages or a bt through gdb would've helped.
> Is it possible to get them? However there are some fixes to write-behind
> that might've fixed memory corruptions (though not sure whether this bug is
> same issue):
> https://review.gluster.org/16464

Looking at the bug title, this patch could be related as it fixes a memory corruption in the code-path where we encounter short writes.

Comment 4 Pavel Černohorský 2017-02-10 07:51:58 UTC
(In reply to Raghavendra G from comment #3)
> (In reply to Raghavendra G from comment #2)
> > (In reply to Niels de Vos from comment #1)
> > > Raghavenda, have you seen something like this before?
> > 
> > Its difficult to say as bt doesn't involve any symbols. Logs after
> > installing gluster debuginfo packages or a bt through gdb would've helped.
> > Is it possible to get them? However there are some fixes to write-behind
> > that might've fixed memory corruptions (though not sure whether this bug is
> > same issue):
> > https://review.gluster.org/16464
> 
> Looking at the bug title, this patch could be related as it fixes a memory
> corruption in the code-path where we encounter short writes.

Unfortunately we do not have the setup with that version and type of load any more (we have moved away from Gluster for that usecase because we had much more problems and unstabilities as well), so I cannot provide you with more information.

As for whether the above mentioned patch set solves the problem, I afraid that without reading a lot of code there I cannot easily say. I believe in your expertise and professional experience based feeling:)

Comment 5 Csaba Henk 2017-09-06 12:00:21 UTC
Closing as setup exhibiting the issue has been decommissioned.