1720733 – glusterfs 4.1.7 client crash

Bug 1720733 - glusterfs 4.1.7 client crash

Summary: glusterfs 4.1.7 client crash

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libglusterfsclient
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Csaba Henk
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-14 17:50 UTC by Danny Lee
Modified:	2020-03-12 12:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:22:59 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Gluster Client Log (82.79 KB, text/plain) 2019-06-14 17:50 UTC, Danny Lee	no flags	Details
View All

Description Danny Lee 2019-06-14 17:50:10 UTC

Created attachment 1580779 [details]
Gluster Client Log

Description of problem:
During a period of a large write, a 42 second disconnect error occurred in the logs. This occurs from time to time, but recovers.  But this time, about ~10 seconds later, the client/glusterfs crashed.  The error in the client logs was the following:

[2019-06-11 15:31:42.794126] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-somecompany-client-1: disconnected from somecompany-client-1. Client process will keep trying to connect to glusterd until brick's port is available
pending frames:
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-06-11 15:31:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.1.6
/lib64/libglusterfs.so.0(+0x25940)[0x7f66fd4ee940]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f66fd4f88a4]
/lib64/libc.so.6(+0x36280)[0x7f66fbb53280]
/usr/lib64/glusterfs/4.1.6/xlator/protocol/client.so(+0x615e3)[0x7f66f60e35e3]
/lib64/libgfrpc.so.0(+0xec20)[0x7f66fd2bbc20]
/lib64/libgfrpc.so.0(+0xefb3)[0x7f66fd2bbfb3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f66fd2b7e93]
/usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0x7636)[0x7f66f83cb636]
/usr/lib64/glusterfs/4.1.6/rpc-transport/socket.so(+0xa107)[0x7f66f83ce107]
/lib64/libglusterfs.so.0(+0x890c4)[0x7f66fd5520c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f66fc352dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f66fbc1aead]


Version-Release number of selected component (if applicable):
Gluster 4.1.7
Centos 7.6.1810 (Core)

How reproducible:
Not really sure, but we believe it has something to do with a very large write (~1-3GBs).  During that time, either the IO or the network was busy, causing the 42 second disconnect.

This was a 3-brick setup with one of the bricks being an arbiter brick. The primary EC2 instance had one of the data bricks and an arbiter brick and the secondary had just one of the data bricks. Both had a FUSE-client mount that connected to the the volume.

The primary server was the one doing the large write at the time, and the primary's glusterfs client was the client that crashed, in which we could not access the files in the mount (Transport endpoint is not connected). The secondary's glusterfs client was still able to access the files.  "gluster volume status" showed that all the bricks were up and running.

We were able to unmount and mount the client later, but at that point, we were unsure if the services using the mount were using stale file pointers, so we restarted the servers to make sure everything was okay. Sadly, the coredump was corrupted and was not recoverable (unrelated).

Steps to Reproduce:
1. N/A

Actual results:
Client glusterfs process crashed and did not recover, so we were unable to access the files on the mount

Expected results:
Client glusterfs process does not crash, so that we are able to access the files on the mount.  Or it crashes and there is a way to recover the mount without having to remount.

Additional info:
Servers have been up for a few weeks with similar load, but have had no issues until now.

Comment 1 Amar Tumballi 2019-06-18 09:40:53 UTC

Appreciate if you can provide output of 'thread apply all bt full' from `$ gdb -c <corefile>`


Also, there were many stability fixes which happened in glusterfs in glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to latest.

Comment 2 Danny Lee 2019-06-18 16:41:54 UTC

(In reply to Amar Tumballi from comment #1)
> Appreciate if you can provide output of 'thread apply all bt full' from `$
> gdb -c <corefile>`
> 
> 
> Also, there were many stability fixes which happened in glusterfs in
> glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to
> latest.

Sadly, we lost the core(In reply to Amar Tumballi from comment #1)
> Appreciate if you can provide output of 'thread apply all bt full' from `$
> gdb -c <corefile>`
> 
> 
> Also, there were many stability fixes which happened in glusterfs in
> glusterfs-5 and glusterfs-6 series. It would be great if you can upgrade to
> latest.

Sadly, we corrupted our core dump and we restarted the site so a good portion of our logs were removed, so we don't really have much for debugging.  We weren't sure if there was anything in the stacktrace that could be used to tell us why it crashed.

We usually upgrade to the latest long-term release unless there is a CVE or there is a good chance that a critical bug has been fixed in the short term releases, and not in the long term release (which hasn't happened yet).

Comment 3 Worker Ant 2020-03-12 12:22:59 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/888, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.