1383597 – Crash when drive became full

Bug 1383597 - Crash when drive became full

Summary: Crash when drive became full

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	write-behind
Sub Component:
Version:	3.8
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Csaba Henk
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-11 07:54 UTC by Pavel Černohorský
Modified:	2017-09-06 12:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-09-06 12:00:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pavel Černohorský 2016-10-11 07:54:02 UTC

Description of problem:
The FUSE mounted gluster became unavailable, reporting "stat: cannot stat ...: Transport endpoint is not connected". Logs showed that something crashed (see Additional info).

Version-Release number of selected component (if applicable):
glusterfs.x86_64                     3.8.4-1.fc24               @updates        
glusterfs-api.x86_64                 3.8.4-1.fc24               @updates        
glusterfs-cli.x86_64                 3.8.4-1.fc24               @updates        
glusterfs-client-xlators.x86_64      3.8.4-1.fc24               @updates        
glusterfs-fuse.x86_64                3.8.4-1.fc24               @updates        
glusterfs-libs.x86_64                3.8.4-1.fc24               @updates        
glusterfs-server.x86_64              3.8.4-1.fc24               @updates

How reproducible:
Quite easily, I would say that reproducible in 30% of attempts.

Steps to Reproduce:
1. Setup 6 FUSE mount points from different nodes to a single Gluster volume.
2. Start very heavy read / write traffic through each of the mount points (approx 1050 MBit/s of cumulated write and the same amount of cumulated read traffic across all the mount points together).
3. Make the volume slowly fill with data.
4. At least one of the mount points will go down in the moment the volume gets full.

Actual results:
At least one of the 6 mount points goes down.

Expected results:
API will keep correctly responding to a POSIX calls as it should on normally filled in volume, when there is more free space again, things will just keep working, as it is on the non-crashed clients.

Additional info:
[2016-10-11 06:36:14.835862] W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device]
The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-0: remote operation failed [No space left on device]" repeated 4 times between [2016-10-11 06:36:08.285146] and [2016-10-11 06:36:13.803409]
The message "W [MSGID: 114031] [client-rpc-fops.c:854:client3_3_writev_cbk] 0-ramcache-client-2: remote operation failed [No space left on device]" repeated 12 times between [2016-10-11 06:36:14.835862] and [2016-10-11 06:36:14.840894]
pending frames:
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(STATFS)
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(FLUSH)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2016-10-11 06:36:14
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x7e)[0x7efd2ddc31fe]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7efd2ddcc974]
/lib64/libc.so.6(+0x34ed0)[0x7efd2c428ed0]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x68f7)[0x7efd25dd98f7]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6b5b)[0x7efd25dd9b5b]
/usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so(+0x6c37)[0x7efd25dd9c37]
/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x51ed1)[0x7efd26035ed1]
/usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so(+0x16f97)[0x7efd26281f97]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7efd2db8e970]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x27c)[0x7efd2db8ecec]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7efd2db8b073]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8ac9)[0x7efd28788ac9]
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x8cb8)[0x7efd28788cb8]
/lib64/libglusterfs.so.0(+0x7a42a)[0x7efd2de1642a]
/lib64/libpthread.so.0(+0x75ba)[0x7efd2cc1e5ba]
/lib64/libc.so.6(clone+0x6d)[0x7efd2c4f77cd]
---------

Comment 1 Niels de Vos 2016-10-18 12:49:58 UTC

Raghavenda, have you seen something like this before?

Comment 2 Raghavendra G 2017-02-10 04:57:51 UTC

(In reply to Niels de Vos from comment #1)
> Raghavenda, have you seen something like this before?

Its difficult to say as bt doesn't involve any symbols. Logs after installing gluster debuginfo packages or a bt through gdb would've helped. Is it possible to get them? However there are some fixes to write-behind that might've fixed memory corruptions (though not sure whether this bug is same issue):
https://review.gluster.org/16464

Comment 3 Raghavendra G 2017-02-10 04:59:19 UTC

(In reply to Raghavendra G from comment #2)
> (In reply to Niels de Vos from comment #1)
> > Raghavenda, have you seen something like this before?
> 
> Its difficult to say as bt doesn't involve any symbols. Logs after
> installing gluster debuginfo packages or a bt through gdb would've helped.
> Is it possible to get them? However there are some fixes to write-behind
> that might've fixed memory corruptions (though not sure whether this bug is
> same issue):
> https://review.gluster.org/16464

Looking at the bug title, this patch could be related as it fixes a memory corruption in the code-path where we encounter short writes.

Comment 4 Pavel Černohorský 2017-02-10 07:51:58 UTC

(In reply to Raghavendra G from comment #3)
> (In reply to Raghavendra G from comment #2)
> > (In reply to Niels de Vos from comment #1)
> > > Raghavenda, have you seen something like this before?
> > 
> > Its difficult to say as bt doesn't involve any symbols. Logs after
> > installing gluster debuginfo packages or a bt through gdb would've helped.
> > Is it possible to get them? However there are some fixes to write-behind
> > that might've fixed memory corruptions (though not sure whether this bug is
> > same issue):
> > https://review.gluster.org/16464
> 
> Looking at the bug title, this patch could be related as it fixes a memory
> corruption in the code-path where we encounter short writes.

Unfortunately we do not have the setup with that version and type of load any more (we have moved away from Gluster for that usecase because we had much more problems and unstabilities as well), so I cannot provide you with more information.

As for whether the above mentioned patch set solves the problem, I afraid that without reading a lot of code there I cannot easily say. I believe in your expertise and professional experience based feeling:)

Comment 5 Csaba Henk 2017-09-06 12:00:21 UTC

Closing as setup exhibiting the issue has been decommissioned.

Note You need to log in before you can comment on or make changes to this bug.