1626085 – "glusterfs --process-name fuse" crashes and leads to "Transport endpoint is not connected"

Bug 1626085 - "glusterfs --process-name fuse" crashes and leads to "Transport endpoint is not connected"

Summary: "glusterfs --process-name fuse" crashes and leads to "Transport endpoint is n...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	fuse
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-06 14:38 UTC by Omar K
Modified:	2019-06-14 10:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-5.4, glusterfs-6.0
Clone Of:
Environment:
Last Closed:	2019-06-14 10:30:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Omar K 2018-09-06 14:38:38 UTC

Gluster version: 4.1.2
OS: SUSE Linux Enterprise 15
Number of nodes: 4 (node1, node2, node3, node4)
Replica of volume: 4
Name of volume: testvol
Affected node: node1

Description of problem:
"glusterfs fuse" process on node1 died after executing "df -B1 /gluster/volumes/testvol" on node1 where /gluster/volumes/testvol is the mountpoint of the volume.

The exact command is probably not relevant because it is part of an automated test that is executed very frequently (usually with success).

Trying to access the mount leads to following error:

$ ls /gluster/volumes/testvol
ls: cannot access '/gluster/volumes/testvol': Transport endpoint is not connected


/etc/fstab:

localhost:/testvol /gluster/volumes/testvol glusterfs defaults,_netdev,noauto,x-systemd.automount,aux-gfid-mount 0 0


The "glusterfs" process for this volume on node1 is missing. On all other nodes of the cluster it is present:

/usr/sbin/glusterfs --aux-gfid-mount --process-name fuse --volfile-server=localhost --volfile-id=/testvol /gluster/volumes/testvol



The log file gluster-volumes-testvol.log contains the following as the last entry. The time stamp matches the time the test failed:

============ START of log ==================

pending frames:
frame : type(1) op(STATFS)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash: 
2018-09-06 03:46:50
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.1.2
/usr/lib64/libglusterfs.so.0(+0x26ecc)[0x7fce55895ecc]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fce5589fef6]
/lib64/libc.so.6(+0x36160)[0x7fce54c90160]
/lib64/libc.so.6(gsignal+0x110)[0x7fce54c900e0]
/lib64/libc.so.6(abort+0x151)[0x7fce54c916c1]
/lib64/libc.so.6(+0x2e6fa)[0x7fce54c886fa]
/lib64/libc.so.6(+0x2e772)[0x7fce54c88772]
/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fce5501e0b8]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x642c5)[0x7fce4f1b32c5]
/usr/lib64/glusterfs/4.1.2/xlator/protocol/client.so(+0x6657f)[0x7fce4f44d57f]
/usr/lib64/libgfrpc.so.0(+0xe840)[0x7fce55661840]
/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fce55661b6f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fce5565dff3]
/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x73ca)[0x7fce509253ca]
/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x9d02)[0x7fce50927d02]
/usr/lib64/libglusterfs.so.0(+0x845c7)[0x7fce558f35c7]
/lib64/libpthread.so.0(+0x7559)[0x7fce5501b559]
/lib64/libc.so.6(clone+0x3f)[0x7fce54d5282f]
---------

============== End of log ==============


All other log files contain no relevant information as far as I can see. Let me know if you would like any of them anyway.

Unmounting and then mounting the volume again does NOT work.

Executing "umount /gluster/volumes/testvol" and then "
/usr/sbin/glusterfs --aux-gfid-mount --process-name fuse --volfile-server=localhost --volfile-id=/testvol /gluster/volumes/testvol" did work. The mount is up an running again.

Comment 1 Nithya Balachandran 2018-09-19 02:44:53 UTC

Do you have a coredump for the crash? If yes, please send it to us.

Comment 2 Nithya Balachandran 2018-09-19 02:46:21 UTC

Please provide the following info as well:

gluster volume info

The command used to mount the volume when it crashes. Is is the same as the one in the description?

Comment 3 Christophe Combelles 2018-10-08 10:30:20 UTC

Hi,
we got a similar crash on the glusterfs process which also leads in the mount point being unusable.

Could you point me to the way to get the coredump ?  Is it created automatically or should I enable something to get it?


pending frames:
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-10-08 09:14:20
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.1.2
/lib64/libglusterfs.so.0(+0x25920)[0x7f6f226e3920]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6f226ed874]
/lib64/libc.so.6(+0x362f0)[0x7f6f20d482f0]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x44db8)[0x7f6f1ad72db8]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x22050)[0x7f6f1ad50050]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x23332)[0x7f6f1ad51332]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x42a9c)[0x7f6f1ad70a9c]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6e8d8)[0x7f6f1b0528d8]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6f08a)[0x7f6f1b05308a]
/usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6fba9)[0x7f6f1b053ba9]
/usr/lib64/glusterfs/4.1.2/xlator/protocol/client.so(+0x61f02)[0x7f6f1b2daf02]
/lib64/libgfrpc.so.0(+0xec20)[0x7f6f224b0c20]
/lib64/libgfrpc.so.0(+0xefb3)[0x7f6f224b0fb3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f6f224ace93]
/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x7626)[0x7f6f1d5c0626]
/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0xa0f7)[0x7f6f1d5c30f7]
/lib64/libglusterfs.so.0(+0x89094)[0x7f6f22747094]
/lib64/libpthread.so.0(+0x7e25)[0x7f6f21547e25]
/lib64/libc.so.6(clone+0x6d)[0x7f6f20e10bad]
---------

volume info :

Volume Name: vol_nextcloud
Type: Distributed-Replicate
Volume ID: f50d2270-abf8-47d5-97a0-af8eba2f2f0e
Status: Started
Snapshot Count: 31
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: sd-135021:/data/glusterfs/vol_nextcloud/brick1/brick
Brick2: sd-135024:/data/glusterfs/vol_nextcloud/brick1/brick
Brick3: sd-135609:/data/glusterfs/vol_nextcloud/brick1/brick
Brick4: sd-135608:/data/glusterfs/vol_nextcloud/brick1/brick
Options Reconfigured:
features.scrub: Active
features.bitrot: on
features.quota-deem-statfs: off
features.inode-quota: off
features.quota: off
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.nufa: enable
auth.allow: 10.88.80.*
features.barrier: disable

------------------
the volume is mounted (from the CoreOs host) with:

docker exec -ti glusterfs-server mount -t glusterfs -o acl sd-135608:/vol_nextcloud /mnt/gluster_nextcloud

------------------

Comment 4 anthony.tarlano 2018-10-12 22:49:01 UTC

I have also see this behavior during the last week on a volume that has been up for more then a year on client that are mount using fuse.


pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-10-12 21:50:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.11.3
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f42a4152ad0]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f42a415c2f4]
/lib64/libc.so.6(+0x36280)[0x7f42a27b3280]
/lib64/libc.so.6(+0x155d66)[0x7f42a28d2d66]
/usr/lib64/glusterfs/3.11.3/xlator/performance/quick-read.so(+0x3884)[0x7f42958af884]
/usr/lib64/glusterfs/3.11.3/xlator/performance/quick-read.so(+0x3e5a)[0x7f42958afe5a]
/usr/lib64/glusterfs/3.11.3/xlator/performance/io-cache.so(+0x824c)[0x7f4295abd24c]
/usr/lib64/glusterfs/3.11.3/xlator/performance/write-behind.so(+0xd22e)[0x7f42960ec22e]
/usr/lib64/glusterfs/3.11.3/xlator/cluster/distribute.so(+0x3973f)[0x7f429632c73f]
/usr/lib64/glusterfs/3.11.3/xlator/cluster/distribute.so(+0x3a4ba)[0x7f429632d4ba]
/usr/lib64/glusterfs/3.11.3/xlator/cluster/replicate.so(+0x63720)[0x7f42965ea720]
/usr/lib64/glusterfs/3.11.3/xlator/protocol/client.so(+0x22f6d)[0x7f429682af6d]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f42a3f1b990]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7f42a3f1bc77]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f42a3f17ac3]
/usr/lib64/glusterfs/3.11.3/rpc-transport/socket.so(+0x7526)[0x7f4298b0e526]
/usr/lib64/glusterfs/3.11.3/rpc-transport/socket.so(+0x9acc)[0x7f4298b10acc]
/lib64/libglusterfs.so.0(+0x864a6)[0x7f42a41ae4a6]
/lib64/libpthread.so.0(+0x7dd5)[0x7f42a2fb2dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f42a287bb3d]
---------

Comment 5 Omar K 2018-10-15 21:05:41 UTC

I do not have a core dump. The command used to mount the volume is the one in the description.

Comment 6 Nithya Balachandran 2018-10-16 04:00:55 UTC

Please use  https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-backtrace.sh

to get the function names for the backtrace in the logs and send it to us.

Comment 7 GCth 2018-11-05 10:08:03 UTC

I have a similar issue, with ~1GB core dumped and the following stack trace:

Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=d.glusterfs.company --'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x7f4f68a036c8) at ../nptl/pthread_mutex_lock.c:67
67      ../nptl/pthread_mutex_lock.c: No such file or directory.
[Current thread is 1 (Thread 0x7f1facf1f700 (LWP 1245))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x7f4f68a036c8) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007f1fba312152 in __gf_free (free_ptr=0x7f1f99016d20) at mem-pool.c:333
#2  0x00007f1fba2ddd8e in dict_destroy (this=0x7f1f99001598) at dict.c:701
#3  0x00007f1fba2dde95 in dict_unref (this=<optimized out>) at dict.c:753
#4  0x00007f1fafdc5174 in afr_local_cleanup (local=0x7f1f98a79688, this=<optimized out>) at afr-common.c:2091
#5  0x00007f1fafd9ff54 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:285
#6  0x00007f1fafda65ba in afr_unlock (frame=frame@entry=0x7f1f98f8f778, this=this@entry=0x7f1fb001d690) at afr-lk-common.c:1085
#7  0x00007f1fafd9ddf9 in afr_changelog_post_op_done (frame=frame@entry=0x7f1f98f8f778, this=this@entry=0x7f1fb001d690) at afr-transaction.c:603
#8  0x00007f1fafda0ae5 in afr_changelog_post_op_now (frame=0x7f1f98f8f778, this=0x7f1fb001d690) at afr-transaction.c:1071
#9  0x00007f1fafda0cec in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2012
#10 0x00007f1fafda2606 in afr_transaction_start (local=local@entry=0x7f1f98f94558, this=this@entry=0x7f1fb001d690) at afr-transaction.c:2415
#11 0x00007f1fafda2a4a in afr_transaction (frame=frame@entry=0x7f1f98fe6088, this=this@entry=0x7f1fb001d690, type=type@entry=AFR_DATA_TRANSACTION) at afr-transaction.c:2481
#12 0x00007f1fafd8fb70 in afr_do_writev (frame=frame@entry=0x7f1f98f38cd8, this=this@entry=0x7f1fb001d690) at afr-inode-write.c:477
#13 0x00007f1fafd9081d in afr_writev (frame=frame@entry=0x7f1f98f38cd8, this=this@entry=0x7f1fb001d690, fd=fd@entry=0x7f1f98f179a8, vector=0x7f1f98b34700, count=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at afr-inode-write.c:555
#14 0x00007f1fafb1fc4f in dht_writev (frame=frame@entry=0x7f1f99047638, this=<optimized out>, fd=0x7f1f98f179a8, vector=vector@entry=0x7f1facf1e540, count=count@entry=1, off=<optimized out>, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at dht-inode-write.c:223
#15 0x00007f1faf8a0b77 in wb_fulfill_head (wb_inode=wb_inode@entry=0x7f1f98816ff0, head=0x7f1f99066f10) at write-behind.c:1156
#16 0x00007f1faf8a0dfb in wb_fulfill (wb_inode=wb_inode@entry=0x7f1f98816ff0, liabilities=liabilities@entry=0x7f1facf1e6a0) at write-behind.c:1233
#17 0x00007f1faf8a21b6 in wb_process_queue (wb_inode=wb_inode@entry=0x7f1f98816ff0) at write-behind.c:1784
#18 0x00007f1faf8a2ab3 in wb_writev (frame=frame@entry=0x7f1f98f7fb68, this=<optimized out>, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at write-behind.c:1887
#19 0x00007f1faf68f4f3 in ra_writev (frame=frame@entry=0x7f1f98e226c8, this=0x7f1fb0022a60, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at read-ahead.c:650
#20 0x00007f1faf47ce69 in rda_writev (frame=frame@entry=0x7f1f98ff8d78, this=0x7f1fb0024530, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, off=off@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at readdir-ahead.c:704
#21 0x00007f1faf268b83 in ioc_writev (frame=frame@entry=0x7f1f98fcc278, this=0x7f1fb00265a0, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at io-cache.c:1228
#22 0x00007f1faf05796f in qr_writev (frame=frame@entry=0x7f1f98fe6088, this=0x7f1fb0028020, fd=fd@entry=0x7f1f98f179a8, iov=iov@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at quick-read.c:845
#23 0x00007f1fba38c380 in default_writev_resume (frame=0x7f1f990746f8, this=0x7f1fb0029ca0, fd=0x7f1f98f179a8, vector=0x7f1f98e5d440, count=1, off=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at defaults.c:2334
#24 0x00007f1fba30e900 in call_resume_wind (stub=<optimized out>) at call-stub.c:2040
#25 0x00007f1fba30eab5 in call_resume (stub=0x7f1f98f41a88) at call-stub.c:2497
#26 0x00007f1faee4b770 in open_and_resume (this=this@entry=0x7f1fb0029ca0, fd=fd@entry=0x7f1f98f179a8, stub=0x7f1f98f41a88) at open-behind.c:480
#27 0x00007f1faee4b7c3 in ob_writev (frame=frame@entry=0x7f1f990746f8, this=0x7f1fb0029ca0, fd=fd@entry=0x7f1f98f179a8, iov=iov@entry=0x7f1f98ffb318, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at open-behind.c:680
#28 0x00007f1faec2c0bb in mdc_writev (frame=0x7f1f98f38cd8, this=0x7f1fb002b860, fd=0x7f1f98f179a8, vector=0x7f1f98ffb318, count=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at md-cache.c:2055
#29 0x00007f1fae9fd775 in ?? () from /usr/lib/x86_64-linux-gnu/glusterfs/5.0/xlator/debug/io-stats.so
#30 0x00007f1fba372d9a in default_writev (frame=0x7f1f99047638, this=<optimized out>, fd=0x7f1f98f179a8, vector=0x7f1f98ffb318, count=1, off=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at defaults.c:3163
#31 0x00007f1fae7e4112 in meta_writev (frame=frame@entry=0x7f1f99047638, this=this@entry=0x7f1fb0004c70, fd=0x7f1f98f179a8, iov=iov@entry=0x7f1f98ffb318, count=count@entry=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at meta.c:131
#32 0x00007f1fb78cf902 in fuse_write_resume (state=0x7f1f98ffabe0) at fuse-bridge.c:2485
#33 0x00007f1fb78c54d5 in fuse_resolve_done (state=<optimized out>) at fuse-resolve.c:629
#34 fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:653
#35 0x00007f1fb78c5220 in fuse_resolve (state=0x7f1f98ffabe0) at fuse-resolve.c:620
#36 0x00007f1fb78c551e in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:650
#37 0x00007f1fb78c4813 in fuse_resolve_continue (state=state@entry=0x7f1f98ffabe0) at fuse-resolve.c:668
#38 0x00007f1fb78c51a1 in fuse_resolve_fd (state=0x7f1f98ffabe0) at fuse-resolve.c:543
#39 fuse_resolve (state=0x7f1f98ffabe0) at fuse-resolve.c:611
#40 0x00007f1fb78c54fe in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:644
#41 0x00007f1fb78c5540 in fuse_resolve_and_resume (state=0x7f1f98ffabe0, fn=0x7f1fb78cf600 <fuse_write_resume>) at fuse-resolve.c:680
#42 0x00007f1fb78dc96b in fuse_thread_proc (data=0x557f2e638d20) at fuse-bridge.c:5115
#43 0x00007f1fb9566494 in start_thread (arg=0x7f1facf1f700) at pthread_create.c:333
#44 0x00007f1fb8e15acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Comment 8 Nithya Balachandran 2018-11-05 10:57:49 UTC

@Ravi,

Can you please take a look at this ? The stack seems to indicate a problem in AFR.

Comment 9 GCth 2019-02-01 09:54:35 UTC

Is there anything else I can do to help fixing this issue?
We had to implement monitoring and restarting solution for our glusterfs
clusters as they crash frequently, causing open files to be unavailable
and dependent applications to stop working correctly.

Comment 10 Ravishankar N 2019-02-01 10:13:05 UTC

(In reply to GCth from comment #9)
> Is there anything else I can do to help fixing this issue?
> We had to implement monitoring and restarting solution for our glusterfs
> clusters as they crash frequently, causing open files to be unavailable
> and dependent applications to stop working correctly.

Are all crashes in AFR with the same back trace as in comment #7? What workload are you running on your 4.1 gluster volume? It would be great if you can give a consistent reproducer which we can try on our setup.

Comment 11 GCth 2019-02-04 15:14:57 UTC

Up until line #17 they are the same, here's another example:

Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=xxxx --'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330
330     mem-pool.c: No such file or directory.
[Current thread is 1 (Thread 0x7f8d521bf700 (LWP 2217))]
(gdb) bt
#0  0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330
#1  0x00007f8d5e842e1e in dict_destroy (this=0x7f8d4994f708) at dict.c:701
#2  0x00007f8d5e842f25 in dict_unref (this=<optimized out>) at dict.c:753
#3  0x00007f8d584330d4 in afr_local_cleanup (local=0x7f8d49a56cc8, this=<optimized out>) at afr-common.c:2091
#4  0x00007f8d5840d584 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:369
#5  0x00007f8d5841483a in afr_unlock (frame=frame@entry=0x7f8d4995ec08, this=this@entry=0x7f8d54019d40) at afr-lk-common.c:1085
#6  0x00007f8d5840aeca in afr_changelog_post_op_done (frame=frame@entry=0x7f8d4995ec08, this=this@entry=0x7f8d54019d40) at afr-transaction.c:778
#7  0x00007f8d5840e105 in afr_changelog_post_op_do (frame=0x7f8d4995ec08, this=0x7f8d54019d40) at afr-transaction.c:1442
#8  0x00007f8d5840edcf in afr_changelog_post_op_now (frame=0x7f8d4995ec08, this=0x7f8d54019d40) at afr-transaction.c:1512
#9  0x00007f8d5840ef4c in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2444
#10 0x00007f8d58410866 in afr_transaction_start (local=local@entry=0x7f8d4cd6ed18, this=this@entry=0x7f8d54019d40) at afr-transaction.c:2847
#11 0x00007f8d58410c89 in afr_transaction (frame=frame@entry=0x7f8d4e643068, this=this@entry=0x7f8d54019d40, type=type@entry=AFR_DATA_TRANSACTION) at afr-transaction.c:2918
#12 0x00007f8d583fcb70 in afr_do_writev (frame=frame@entry=0x7f8d4e245608, this=this@entry=0x7f8d54019d40) at afr-inode-write.c:477
#13 0x00007f8d583fd81d in afr_writev (frame=frame@entry=0x7f8d4e245608, this=this@entry=0x7f8d54019d40, fd=fd@entry=0x7f8d499f3758, vector=0x7f8d4e932b40, count=1, offset=1024, flags=32769, iobref=0x7f8d488cb3b0, xdata=0x0) at afr-inode-write.c:555
#14 0x00007f8d5818cbef in dht_writev (frame=frame@entry=0x7f8d4e29c598, this=<optimized out>, fd=0x7f8d499f3758, vector=vector@entry=0x7f8d521be5c0, count=count@entry=1, off=<optimized out>, flags=32769, iobref=0x7f8d488cb3b0, xdata=0x0) at dht-inode-write.c:223
#15 0x00007f8d53df0b77 in wb_fulfill_head (wb_inode=wb_inode@entry=0x7f8d49a25310, head=0x7f8d49bbcb40) at write-behind.c:1156
#16 0x00007f8d53df0dfb in wb_fulfill (wb_inode=wb_inode@entry=0x7f8d49a25310, liabilities=liabilities@entry=0x7f8d521be720) at write-behind.c:1233
#17 0x00007f8d53df21b6 in wb_process_queue (wb_inode=wb_inode@entry=0x7f8d49a25310) at write-behind.c:1784
#18 0x00007f8d53df233f in wb_fulfill_cbk (frame=frame@entry=0x7f8d49cc15a8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=1024, op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7f8d49c7f8c0, postbuf=<optimized out>, xdata=<optimized out>) at write-behind.c:1105
#19 0x00007f8d5818b31e in dht_writev_cbk (frame=0x7f8d498dfa48, cookie=<optimized out>, this=<optimized out>, op_ret=1024, op_errno=0, prebuf=0x7f8d49c7f8c0, postbuf=0x7f8d49c7f958, xdata=0x7f8d4e65a7e8) at dht-inode-write.c:140
#20 0x00007f8d583fc2b7 in afr_writev_unwind (frame=frame@entry=0x7f8d48374db8, this=this@entry=0x7f8d54019d40) at afr-inode-write.c:234
#21 0x00007f8d583fc83e in afr_writev_wind_cbk (frame=0x7f8d4995ec08, cookie=<optimized out>, this=0x7f8d54019d40, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7f8d521be9d0, xdata=0x7f8d4e6e30d8) at afr-inode-write.c:388
#22 0x00007f8d586c4865 in client4_0_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f8d4e621578) at client-rpc-fops_v2.c:685
#23 0x00007f8d5e61c130 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f8d54085540, pollin=pollin@entry=0x7f8d4ea47850) at rpc-clnt.c:755
#24 0x00007f8d5e61c48f in rpc_clnt_notify (trans=0x7f8d54085800, mydata=0x7f8d54085570, event=<optimized out>, data=0x7f8d4ea47850) at rpc-clnt.c:923
#25 0x00007f8d5e618893 in rpc_transport_notify (this=this@entry=0x7f8d54085800, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f8d4ea47850) at rpc-transport.c:525
#26 0x00007f8d59401671 in socket_event_poll_in (notify_handled=true, this=0x7f8d54085800) at socket.c:2504
#27 socket_event_handler (fd=<optimized out>, idx=idx@entry=2, gen=4, data=data@entry=0x7f8d54085800, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>) at socket.c:2905
#28 0x00007f8d5e8ab945 in event_dispatch_epoll_handler (event=0x7f8d521bee8c, event_pool=0x56110317e0b0) at event-epoll.c:591
#29 event_dispatch_epoll_worker (data=0x7f8d5406f7e0) at event-epoll.c:668
#30 0x00007f8d5dacb494 in start_thread (arg=0x7f8d521bf700) at pthread_create.c:333
#31 0x00007f8d5d374acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

All the gluster instances looks similar to the following setup:

Type: Distributed-Replicate
Volume ID: e9dd963c...
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.10.11.1:/export/data1
Brick2: 10.10.11.2:/export/data1
Brick3: 10.10.11.3:/export/data1
Brick4: 10.10.11.4:/export/data1
Options Reconfigured:
cluster.favorite-child-policy: mtime
cluster.self-heal-daemon: enable
performance.cache-size: 1GB
performance.quick-read: on
performance.stat-prefetch: on
performance.read-ahead: on
performance.readdir-ahead: on
auth.allow: 10.*.*.*
transport.address-family: inet
nfs.disable: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000

I do not have a reproducer, the gluster instance is 2-5TB of files, mostly small ones, with lots of directories.
They reach up to 10M inodes used as reported by df -hi, brick storage is on XFS as recommended.
The crash of individual glusterfs process happens once every several days.

Comment 12 GCth 2019-02-04 15:16:09 UTC

One more - it's currently:

glusterfs 5.3

installed from https://download.gluster.org/pub/gluster/glusterfs/5/LATEST/Debian/stretch/amd64/apt

Comment 13 Nithya Balachandran 2019-02-21 13:48:37 UTC

(In reply to GCth from comment #12)
> One more - it's currently:
> 
> glusterfs 5.3
> 
> installed from
> https://download.gluster.org/pub/gluster/glusterfs/5/LATEST/Debian/stretch/
> amd64/apt

We found an issue with the write-behind functionality in 5.3. Please run
gluster v set <volname> performance.write-behind off
and see if you still see crashes.

Comment 14 GCth 2019-03-09 16:54:19 UTC

Update to 5.4 fixed this issue for me.

Comment 15 Nithya Balachandran 2019-03-11 02:12:03 UTC

(In reply to GCth from comment #14)
> Update to 5.4 fixed this issue for me.

Excellent. Thanks for letting us know.

Note You need to log in before you can comment on or make changes to this bug.