Gluster version: 4.1.2 OS: SUSE Linux Enterprise 15 Number of nodes: 4 (node1, node2, node3, node4) Replica of volume: 4 Name of volume: testvol Affected node: node1 Description of problem: "glusterfs fuse" process on node1 died after executing "df -B1 /gluster/volumes/testvol" on node1 where /gluster/volumes/testvol is the mountpoint of the volume. The exact command is probably not relevant because it is part of an automated test that is executed very frequently (usually with success). Trying to access the mount leads to following error: $ ls /gluster/volumes/testvol ls: cannot access '/gluster/volumes/testvol': Transport endpoint is not connected /etc/fstab: localhost:/testvol /gluster/volumes/testvol glusterfs defaults,_netdev,noauto,x-systemd.automount,aux-gfid-mount 0 0 The "glusterfs" process for this volume on node1 is missing. On all other nodes of the cluster it is present: /usr/sbin/glusterfs --aux-gfid-mount --process-name fuse --volfile-server=localhost --volfile-id=/testvol /gluster/volumes/testvol The log file gluster-volumes-testvol.log contains the following as the last entry. The time stamp matches the time the test failed: ============ START of log ================== pending frames: frame : type(1) op(STATFS) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2018-09-06 03:46:50 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 4.1.2 /usr/lib64/libglusterfs.so.0(+0x26ecc)[0x7fce55895ecc] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fce5589fef6] /lib64/libc.so.6(+0x36160)[0x7fce54c90160] /lib64/libc.so.6(gsignal+0x110)[0x7fce54c900e0] /lib64/libc.so.6(abort+0x151)[0x7fce54c916c1] /lib64/libc.so.6(+0x2e6fa)[0x7fce54c886fa] /lib64/libc.so.6(+0x2e772)[0x7fce54c88772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fce5501e0b8] /usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x642c5)[0x7fce4f1b32c5] /usr/lib64/glusterfs/4.1.2/xlator/protocol/client.so(+0x6657f)[0x7fce4f44d57f] /usr/lib64/libgfrpc.so.0(+0xe840)[0x7fce55661840] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fce55661b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fce5565dff3] /usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x73ca)[0x7fce509253ca] /usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x9d02)[0x7fce50927d02] /usr/lib64/libglusterfs.so.0(+0x845c7)[0x7fce558f35c7] /lib64/libpthread.so.0(+0x7559)[0x7fce5501b559] /lib64/libc.so.6(clone+0x3f)[0x7fce54d5282f] --------- ============== End of log ============== All other log files contain no relevant information as far as I can see. Let me know if you would like any of them anyway. Unmounting and then mounting the volume again does NOT work. Executing "umount /gluster/volumes/testvol" and then " /usr/sbin/glusterfs --aux-gfid-mount --process-name fuse --volfile-server=localhost --volfile-id=/testvol /gluster/volumes/testvol" did work. The mount is up an running again.
Do you have a coredump for the crash? If yes, please send it to us.
Please provide the following info as well: gluster volume info The command used to mount the volume when it crashes. Is is the same as the one in the description?
Hi, we got a similar crash on the glusterfs process which also leads in the mount point being unusable. Could you point me to the way to get the coredump ? Is it created automatically or should I enable something to get it? pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-10-08 09:14:20 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 4.1.2 /lib64/libglusterfs.so.0(+0x25920)[0x7f6f226e3920] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6f226ed874] /lib64/libc.so.6(+0x362f0)[0x7f6f20d482f0] /usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x44db8)[0x7f6f1ad72db8] /usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x22050)[0x7f6f1ad50050] /usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x23332)[0x7f6f1ad51332] /usr/lib64/glusterfs/4.1.2/xlator/cluster/nufa.so(+0x42a9c)[0x7f6f1ad70a9c] /usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6e8d8)[0x7f6f1b0528d8] /usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6f08a)[0x7f6f1b05308a] /usr/lib64/glusterfs/4.1.2/xlator/cluster/replicate.so(+0x6fba9)[0x7f6f1b053ba9] /usr/lib64/glusterfs/4.1.2/xlator/protocol/client.so(+0x61f02)[0x7f6f1b2daf02] /lib64/libgfrpc.so.0(+0xec20)[0x7f6f224b0c20] /lib64/libgfrpc.so.0(+0xefb3)[0x7f6f224b0fb3] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f6f224ace93] /usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0x7626)[0x7f6f1d5c0626] /usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0xa0f7)[0x7f6f1d5c30f7] /lib64/libglusterfs.so.0(+0x89094)[0x7f6f22747094] /lib64/libpthread.so.0(+0x7e25)[0x7f6f21547e25] /lib64/libc.so.6(clone+0x6d)[0x7f6f20e10bad] --------- volume info : Volume Name: vol_nextcloud Type: Distributed-Replicate Volume ID: f50d2270-abf8-47d5-97a0-af8eba2f2f0e Status: Started Snapshot Count: 31 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: sd-135021:/data/glusterfs/vol_nextcloud/brick1/brick Brick2: sd-135024:/data/glusterfs/vol_nextcloud/brick1/brick Brick3: sd-135609:/data/glusterfs/vol_nextcloud/brick1/brick Brick4: sd-135608:/data/glusterfs/vol_nextcloud/brick1/brick Options Reconfigured: features.scrub: Active features.bitrot: on features.quota-deem-statfs: off features.inode-quota: off features.quota: off performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.nufa: enable auth.allow: 10.88.80.* features.barrier: disable ------------------ the volume is mounted (from the CoreOs host) with: docker exec -ti glusterfs-server mount -t glusterfs -o acl sd-135608:/vol_nextcloud /mnt/gluster_nextcloud ------------------
I have also see this behavior during the last week on a volume that has been up for more then a year on client that are mount using fuse. pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-10-12 21:50:54 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.11.3 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f42a4152ad0] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f42a415c2f4] /lib64/libc.so.6(+0x36280)[0x7f42a27b3280] /lib64/libc.so.6(+0x155d66)[0x7f42a28d2d66] /usr/lib64/glusterfs/3.11.3/xlator/performance/quick-read.so(+0x3884)[0x7f42958af884] /usr/lib64/glusterfs/3.11.3/xlator/performance/quick-read.so(+0x3e5a)[0x7f42958afe5a] /usr/lib64/glusterfs/3.11.3/xlator/performance/io-cache.so(+0x824c)[0x7f4295abd24c] /usr/lib64/glusterfs/3.11.3/xlator/performance/write-behind.so(+0xd22e)[0x7f42960ec22e] /usr/lib64/glusterfs/3.11.3/xlator/cluster/distribute.so(+0x3973f)[0x7f429632c73f] /usr/lib64/glusterfs/3.11.3/xlator/cluster/distribute.so(+0x3a4ba)[0x7f429632d4ba] /usr/lib64/glusterfs/3.11.3/xlator/cluster/replicate.so(+0x63720)[0x7f42965ea720] /usr/lib64/glusterfs/3.11.3/xlator/protocol/client.so(+0x22f6d)[0x7f429682af6d] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f42a3f1b990] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7f42a3f1bc77] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f42a3f17ac3] /usr/lib64/glusterfs/3.11.3/rpc-transport/socket.so(+0x7526)[0x7f4298b0e526] /usr/lib64/glusterfs/3.11.3/rpc-transport/socket.so(+0x9acc)[0x7f4298b10acc] /lib64/libglusterfs.so.0(+0x864a6)[0x7f42a41ae4a6] /lib64/libpthread.so.0(+0x7dd5)[0x7f42a2fb2dd5] /lib64/libc.so.6(clone+0x6d)[0x7f42a287bb3d] ---------
I do not have a core dump. The command used to mount the volume is the one in the description.
Please use https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-backtrace.sh to get the function names for the backtrace in the logs and send it to us.
I have a similar issue, with ~1GB core dumped and the following stack trace: Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=d.glusterfs.company --'. Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___pthread_mutex_lock (mutex=0x7f4f68a036c8) at ../nptl/pthread_mutex_lock.c:67 67 ../nptl/pthread_mutex_lock.c: No such file or directory. [Current thread is 1 (Thread 0x7f1facf1f700 (LWP 1245))] (gdb) bt #0 __GI___pthread_mutex_lock (mutex=0x7f4f68a036c8) at ../nptl/pthread_mutex_lock.c:67 #1 0x00007f1fba312152 in __gf_free (free_ptr=0x7f1f99016d20) at mem-pool.c:333 #2 0x00007f1fba2ddd8e in dict_destroy (this=0x7f1f99001598) at dict.c:701 #3 0x00007f1fba2dde95 in dict_unref (this=<optimized out>) at dict.c:753 #4 0x00007f1fafdc5174 in afr_local_cleanup (local=0x7f1f98a79688, this=<optimized out>) at afr-common.c:2091 #5 0x00007f1fafd9ff54 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:285 #6 0x00007f1fafda65ba in afr_unlock (frame=frame@entry=0x7f1f98f8f778, this=this@entry=0x7f1fb001d690) at afr-lk-common.c:1085 #7 0x00007f1fafd9ddf9 in afr_changelog_post_op_done (frame=frame@entry=0x7f1f98f8f778, this=this@entry=0x7f1fb001d690) at afr-transaction.c:603 #8 0x00007f1fafda0ae5 in afr_changelog_post_op_now (frame=0x7f1f98f8f778, this=0x7f1fb001d690) at afr-transaction.c:1071 #9 0x00007f1fafda0cec in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2012 #10 0x00007f1fafda2606 in afr_transaction_start (local=local@entry=0x7f1f98f94558, this=this@entry=0x7f1fb001d690) at afr-transaction.c:2415 #11 0x00007f1fafda2a4a in afr_transaction (frame=frame@entry=0x7f1f98fe6088, this=this@entry=0x7f1fb001d690, type=type@entry=AFR_DATA_TRANSACTION) at afr-transaction.c:2481 #12 0x00007f1fafd8fb70 in afr_do_writev (frame=frame@entry=0x7f1f98f38cd8, this=this@entry=0x7f1fb001d690) at afr-inode-write.c:477 #13 0x00007f1fafd9081d in afr_writev (frame=frame@entry=0x7f1f98f38cd8, this=this@entry=0x7f1fb001d690, fd=fd@entry=0x7f1f98f179a8, vector=0x7f1f98b34700, count=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at afr-inode-write.c:555 #14 0x00007f1fafb1fc4f in dht_writev (frame=frame@entry=0x7f1f99047638, this=<optimized out>, fd=0x7f1f98f179a8, vector=vector@entry=0x7f1facf1e540, count=count@entry=1, off=<optimized out>, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at dht-inode-write.c:223 #15 0x00007f1faf8a0b77 in wb_fulfill_head (wb_inode=wb_inode@entry=0x7f1f98816ff0, head=0x7f1f99066f10) at write-behind.c:1156 #16 0x00007f1faf8a0dfb in wb_fulfill (wb_inode=wb_inode@entry=0x7f1f98816ff0, liabilities=liabilities@entry=0x7f1facf1e6a0) at write-behind.c:1233 #17 0x00007f1faf8a21b6 in wb_process_queue (wb_inode=wb_inode@entry=0x7f1f98816ff0) at write-behind.c:1784 #18 0x00007f1faf8a2ab3 in wb_writev (frame=frame@entry=0x7f1f98f7fb68, this=<optimized out>, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at write-behind.c:1887 #19 0x00007f1faf68f4f3 in ra_writev (frame=frame@entry=0x7f1f98e226c8, this=0x7f1fb0022a60, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at read-ahead.c:650 #20 0x00007f1faf47ce69 in rda_writev (frame=frame@entry=0x7f1f98ff8d78, this=0x7f1fb0024530, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, off=off@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at readdir-ahead.c:704 #21 0x00007f1faf268b83 in ioc_writev (frame=frame@entry=0x7f1f98fcc278, this=0x7f1fb00265a0, fd=fd@entry=0x7f1f98f179a8, vector=vector@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at io-cache.c:1228 #22 0x00007f1faf05796f in qr_writev (frame=frame@entry=0x7f1f98fe6088, this=0x7f1fb0028020, fd=fd@entry=0x7f1f98f179a8, iov=iov@entry=0x7f1f98e5d440, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at quick-read.c:845 #23 0x00007f1fba38c380 in default_writev_resume (frame=0x7f1f990746f8, this=0x7f1fb0029ca0, fd=0x7f1f98f179a8, vector=0x7f1f98e5d440, count=1, off=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at defaults.c:2334 #24 0x00007f1fba30e900 in call_resume_wind (stub=<optimized out>) at call-stub.c:2040 #25 0x00007f1fba30eab5 in call_resume (stub=0x7f1f98f41a88) at call-stub.c:2497 #26 0x00007f1faee4b770 in open_and_resume (this=this@entry=0x7f1fb0029ca0, fd=fd@entry=0x7f1f98f179a8, stub=0x7f1f98f41a88) at open-behind.c:480 #27 0x00007f1faee4b7c3 in ob_writev (frame=frame@entry=0x7f1f990746f8, this=0x7f1fb0029ca0, fd=fd@entry=0x7f1f98f179a8, iov=iov@entry=0x7f1f98ffb318, count=count@entry=1, offset=offset@entry=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at open-behind.c:680 #28 0x00007f1faec2c0bb in mdc_writev (frame=0x7f1f98f38cd8, this=0x7f1fb002b860, fd=0x7f1f98f179a8, vector=0x7f1f98ffb318, count=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at md-cache.c:2055 #29 0x00007f1fae9fd775 in ?? () from /usr/lib/x86_64-linux-gnu/glusterfs/5.0/xlator/debug/io-stats.so #30 0x00007f1fba372d9a in default_writev (frame=0x7f1f99047638, this=<optimized out>, fd=0x7f1f98f179a8, vector=0x7f1f98ffb318, count=1, off=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at defaults.c:3163 #31 0x00007f1fae7e4112 in meta_writev (frame=frame@entry=0x7f1f99047638, this=this@entry=0x7f1fb0004c70, fd=0x7f1f98f179a8, iov=iov@entry=0x7f1f98ffb318, count=count@entry=1, offset=192332, flags=33793, iobref=0x7f1f98a9b690, xdata=0x0) at meta.c:131 #32 0x00007f1fb78cf902 in fuse_write_resume (state=0x7f1f98ffabe0) at fuse-bridge.c:2485 #33 0x00007f1fb78c54d5 in fuse_resolve_done (state=<optimized out>) at fuse-resolve.c:629 #34 fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:653 #35 0x00007f1fb78c5220 in fuse_resolve (state=0x7f1f98ffabe0) at fuse-resolve.c:620 #36 0x00007f1fb78c551e in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:650 #37 0x00007f1fb78c4813 in fuse_resolve_continue (state=state@entry=0x7f1f98ffabe0) at fuse-resolve.c:668 #38 0x00007f1fb78c51a1 in fuse_resolve_fd (state=0x7f1f98ffabe0) at fuse-resolve.c:543 #39 fuse_resolve (state=0x7f1f98ffabe0) at fuse-resolve.c:611 #40 0x00007f1fb78c54fe in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:644 #41 0x00007f1fb78c5540 in fuse_resolve_and_resume (state=0x7f1f98ffabe0, fn=0x7f1fb78cf600 <fuse_write_resume>) at fuse-resolve.c:680 #42 0x00007f1fb78dc96b in fuse_thread_proc (data=0x557f2e638d20) at fuse-bridge.c:5115 #43 0x00007f1fb9566494 in start_thread (arg=0x7f1facf1f700) at pthread_create.c:333 #44 0x00007f1fb8e15acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
@Ravi, Can you please take a look at this ? The stack seems to indicate a problem in AFR.
Is there anything else I can do to help fixing this issue? We had to implement monitoring and restarting solution for our glusterfs clusters as they crash frequently, causing open files to be unavailable and dependent applications to stop working correctly.
(In reply to GCth from comment #9) > Is there anything else I can do to help fixing this issue? > We had to implement monitoring and restarting solution for our glusterfs > clusters as they crash frequently, causing open files to be unavailable > and dependent applications to stop working correctly. Are all crashes in AFR with the same back trace as in comment #7? What workload are you running on your 4.1 gluster volume? It would be great if you can give a consistent reproducer which we can try on our setup.
Up until line #17 they are the same, here's another example: Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=xxxx --'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330 330 mem-pool.c: No such file or directory. [Current thread is 1 (Thread 0x7f8d521bf700 (LWP 2217))] (gdb) bt #0 0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330 #1 0x00007f8d5e842e1e in dict_destroy (this=0x7f8d4994f708) at dict.c:701 #2 0x00007f8d5e842f25 in dict_unref (this=<optimized out>) at dict.c:753 #3 0x00007f8d584330d4 in afr_local_cleanup (local=0x7f8d49a56cc8, this=<optimized out>) at afr-common.c:2091 #4 0x00007f8d5840d584 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:369 #5 0x00007f8d5841483a in afr_unlock (frame=frame@entry=0x7f8d4995ec08, this=this@entry=0x7f8d54019d40) at afr-lk-common.c:1085 #6 0x00007f8d5840aeca in afr_changelog_post_op_done (frame=frame@entry=0x7f8d4995ec08, this=this@entry=0x7f8d54019d40) at afr-transaction.c:778 #7 0x00007f8d5840e105 in afr_changelog_post_op_do (frame=0x7f8d4995ec08, this=0x7f8d54019d40) at afr-transaction.c:1442 #8 0x00007f8d5840edcf in afr_changelog_post_op_now (frame=0x7f8d4995ec08, this=0x7f8d54019d40) at afr-transaction.c:1512 #9 0x00007f8d5840ef4c in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2444 #10 0x00007f8d58410866 in afr_transaction_start (local=local@entry=0x7f8d4cd6ed18, this=this@entry=0x7f8d54019d40) at afr-transaction.c:2847 #11 0x00007f8d58410c89 in afr_transaction (frame=frame@entry=0x7f8d4e643068, this=this@entry=0x7f8d54019d40, type=type@entry=AFR_DATA_TRANSACTION) at afr-transaction.c:2918 #12 0x00007f8d583fcb70 in afr_do_writev (frame=frame@entry=0x7f8d4e245608, this=this@entry=0x7f8d54019d40) at afr-inode-write.c:477 #13 0x00007f8d583fd81d in afr_writev (frame=frame@entry=0x7f8d4e245608, this=this@entry=0x7f8d54019d40, fd=fd@entry=0x7f8d499f3758, vector=0x7f8d4e932b40, count=1, offset=1024, flags=32769, iobref=0x7f8d488cb3b0, xdata=0x0) at afr-inode-write.c:555 #14 0x00007f8d5818cbef in dht_writev (frame=frame@entry=0x7f8d4e29c598, this=<optimized out>, fd=0x7f8d499f3758, vector=vector@entry=0x7f8d521be5c0, count=count@entry=1, off=<optimized out>, flags=32769, iobref=0x7f8d488cb3b0, xdata=0x0) at dht-inode-write.c:223 #15 0x00007f8d53df0b77 in wb_fulfill_head (wb_inode=wb_inode@entry=0x7f8d49a25310, head=0x7f8d49bbcb40) at write-behind.c:1156 #16 0x00007f8d53df0dfb in wb_fulfill (wb_inode=wb_inode@entry=0x7f8d49a25310, liabilities=liabilities@entry=0x7f8d521be720) at write-behind.c:1233 #17 0x00007f8d53df21b6 in wb_process_queue (wb_inode=wb_inode@entry=0x7f8d49a25310) at write-behind.c:1784 #18 0x00007f8d53df233f in wb_fulfill_cbk (frame=frame@entry=0x7f8d49cc15a8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=1024, op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7f8d49c7f8c0, postbuf=<optimized out>, xdata=<optimized out>) at write-behind.c:1105 #19 0x00007f8d5818b31e in dht_writev_cbk (frame=0x7f8d498dfa48, cookie=<optimized out>, this=<optimized out>, op_ret=1024, op_errno=0, prebuf=0x7f8d49c7f8c0, postbuf=0x7f8d49c7f958, xdata=0x7f8d4e65a7e8) at dht-inode-write.c:140 #20 0x00007f8d583fc2b7 in afr_writev_unwind (frame=frame@entry=0x7f8d48374db8, this=this@entry=0x7f8d54019d40) at afr-inode-write.c:234 #21 0x00007f8d583fc83e in afr_writev_wind_cbk (frame=0x7f8d4995ec08, cookie=<optimized out>, this=0x7f8d54019d40, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7f8d521be9d0, xdata=0x7f8d4e6e30d8) at afr-inode-write.c:388 #22 0x00007f8d586c4865 in client4_0_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f8d4e621578) at client-rpc-fops_v2.c:685 #23 0x00007f8d5e61c130 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f8d54085540, pollin=pollin@entry=0x7f8d4ea47850) at rpc-clnt.c:755 #24 0x00007f8d5e61c48f in rpc_clnt_notify (trans=0x7f8d54085800, mydata=0x7f8d54085570, event=<optimized out>, data=0x7f8d4ea47850) at rpc-clnt.c:923 #25 0x00007f8d5e618893 in rpc_transport_notify (this=this@entry=0x7f8d54085800, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f8d4ea47850) at rpc-transport.c:525 #26 0x00007f8d59401671 in socket_event_poll_in (notify_handled=true, this=0x7f8d54085800) at socket.c:2504 #27 socket_event_handler (fd=<optimized out>, idx=idx@entry=2, gen=4, data=data@entry=0x7f8d54085800, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=<optimized out>) at socket.c:2905 #28 0x00007f8d5e8ab945 in event_dispatch_epoll_handler (event=0x7f8d521bee8c, event_pool=0x56110317e0b0) at event-epoll.c:591 #29 event_dispatch_epoll_worker (data=0x7f8d5406f7e0) at event-epoll.c:668 #30 0x00007f8d5dacb494 in start_thread (arg=0x7f8d521bf700) at pthread_create.c:333 #31 0x00007f8d5d374acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 All the gluster instances looks similar to the following setup: Type: Distributed-Replicate Volume ID: e9dd963c... Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.10.11.1:/export/data1 Brick2: 10.10.11.2:/export/data1 Brick3: 10.10.11.3:/export/data1 Brick4: 10.10.11.4:/export/data1 Options Reconfigured: cluster.favorite-child-policy: mtime cluster.self-heal-daemon: enable performance.cache-size: 1GB performance.quick-read: on performance.stat-prefetch: on performance.read-ahead: on performance.readdir-ahead: on auth.allow: 10.*.*.* transport.address-family: inet nfs.disable: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 50000 I do not have a reproducer, the gluster instance is 2-5TB of files, mostly small ones, with lots of directories. They reach up to 10M inodes used as reported by df -hi, brick storage is on XFS as recommended. The crash of individual glusterfs process happens once every several days.
One more - it's currently: glusterfs 5.3 installed from https://download.gluster.org/pub/gluster/glusterfs/5/LATEST/Debian/stretch/amd64/apt
(In reply to GCth from comment #12) > One more - it's currently: > > glusterfs 5.3 > > installed from > https://download.gluster.org/pub/gluster/glusterfs/5/LATEST/Debian/stretch/ > amd64/apt We found an issue with the write-behind functionality in 5.3. Please run gluster v set <volname> performance.write-behind off and see if you still see crashes.
Update to 5.4 fixed this issue for me.
(In reply to GCth from comment #14) > Update to 5.4 fixed this issue for me. Excellent. Thanks for letting us know.