Description of problem: glusterfsd crashes in socket.so Version-Release number of selected component (if applicable): 5.1 How reproducible: run volume and wait for crash on one of the nodes Actual results: Without a clear cause, the transport endpoint disappears. A core file is written. glusterd is still running, but "gluster volume status" shows no running daemon on the node. The volume is remains usable. Expected results: No crashes and no need to manually restart glusterfsd after a crash. Additional info: This a dat set in two node cluster that is in the process of being transferred to glusterfs. We started with a single node and added the new one recently. A third will be added once we can declare this gluster cluster stable. gdb core file analysis: Core was generated by `/usr/sbin/glusterfsd -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-'. Program terminated with signal 11, Segmentation fault. #0 0x00007f31692ce62b in ?? () from /usr/lib64/glusterfs/5.1/rpc-transport/socket.so (gdb) bt #0 0x00007f31692ce62b in ?? () from /usr/lib64/glusterfs/5.1/rpc-transport/socket.so #1 0x00007f316e21aaeb in ?? () from /usr/lib64/libglusterfs.so.0 #2 0x00007f316d00b504 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f316c8f319f in clone () from /lib64/libc.so.6 Actual command line options were: -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-glfs-brick -p /var/run/gluster/vols/jf-vol0/10.10.0.177-local.mnt-glfs-brick.pid -S /var/run/gluster/ccdac309d72f1df7.socket --brick-name /local.mnt/glfs/brick -l /var/log/glusterfs/bricks/local.mnt-glfs-brick.log --xlator-option *-posix.glusterd-uuid=ab5f12ae-c203-4299-b5eb-9a7df6abfc1b --process-name brick --brick-port 49152 --xlator-option jf-vol0-server.listen-port=49152 glusterd.log: [2018-11-28 23:40:01.859118] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2018-11-28 23:40:01.859219] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2018-11-28 23:50:01.593857] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2018-11-28 23:50:01.593949] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2018-11-29 00:00:01.159538] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2018-11-29 00:00:09.723224] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152 [2018-11-29 00:00:09.748419] I [MSGID: 106005] [glusterd-handler.c:6194:__glusterd_brick_rpc_notify] 0-management: Brick 10.10.0.177:/local.mnt/glfs/brick has disconnected from glusterd. The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 36 times between [2018-11-29 00:00:01.159538] and [2018-11-29 00:00:28.759673] [2018-11-29 00:00:29.281398] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:00:29.281398] and [2018-11-29 00:02:28.804429] [2018-11-29 00:02:29.293664] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:02:29.293664] and [2018-11-29 00:04:28.849724] [2018-11-29 00:04:29.306508] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:04:29.306508] and [2018-11-29 00:06:28.893840] volume info: Volume Name: jf-vol0 Type: Replicate Volume ID: d6c72c52-24c5-4302-81ed-257507c27c1a Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.10.0.177:/local.mnt/glfs/brick Brick2: 10.10.0.208:/local.mnt/glfs/brick Options Reconfigured: client.event-threads: 3 server.event-threads: 3 cluster.self-heal-daemon: enable diagnostics.client-log-level: WARNING diagnostics.brick-log-level: CRITICAL diagnostics.brick-sys-log-level: CRITICAL disperse.shd-wait-qlength: 2048 cluster.shd-max-threads: 4 performance.cache-size: 4GB performance.cache-max-file-size: 4MB performance.client-io-threads: off nfs.disable: on transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 60 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 60 network.inode-lru-limit: 50000 cluster.lookup-optimize: on cluster.readdir-optimize: on cluster.force-migration: off
this seems to be a glusterfsd (brick) crash?
The crashes might be related to this possible memory leak: https://bugzilla.redhat.com/show_bug.cgi?id=1657202 Although these look like two searate processes (brick and client?)
I'm also getting a somewhat similar error in gluster 5.0 with multiple crashes on different clients. Sometimes it takes a couple of days to crash or it can be within hours. The mount error message is transport endpoint not connected and it's fixed by unmount and mount again. Here is the information on one of the clients with a volume mounted using glusterfuse. gluster setup: Volume Name: tank Type: Distribute Volume ID: 9582685f-07fa-41fd-b9fc-ebab3a6989cf Status: Started Snapshot Count: 0 Number of Bricks: 8 Transport-type: tcp Bricks: Brick1: node-01:/tank/volume1/brick Brick2: node-02:/tank/volume1/brick Brick3: node-03:/tank/volume1/brick Brick4: node-04:/tank/volume1/brick Brick5: node-01:/tank/volume2/brick Brick6: node-02:/tank/volume2/brick Brick7: node-03:/tank/volume2/brick Brick8: node-04:/tank/volume2/brick installed packages: glusterfs.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-api.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-cli.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-client-xlators.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-fuse.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-libs.x86_64 5.0-1.el7 @centos-gluster5 glusterfs-server.x86_64 5.0-1.el7 @centos-gluster5 gdb core file: #0 0x00007ff2c18f0cd9 in wb_fulfill_cbk () from /usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so Missing separate debuginfos, use: debuginfo-install glusterfs-server-5.0-1.el7.x86_64 (gdb) bt #0 0x00007ff2c18f0cd9 in wb_fulfill_cbk () from /usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so #1 0x00007ff2c1b725f9 in dht_writev_cbk () from /usr/lib64/glusterfs/5.0/xlator/cluster/distribute.so #2 0x00007ff2c1e142e5 in client4_0_writev_cbk () from /usr/lib64/glusterfs/5.0/xlator/protocol/client.so #3 0x00007ff2cf71cc70 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #4 0x00007ff2cf71d043 in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #5 0x00007ff2cf718f23 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #6 0x00007ff2c430737b in socket_event_handler () from /usr/lib64/glusterfs/5.0/rpc-transport/socket.so #7 0x00007ff2cf9b45a9 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #8 0x00007ff2ce7b3e25 in start_thread (arg=0x7ff2ab7fe700) at pthread_create.c:308 #9 0x00007ff2ce07cbad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 gluster log: [2018-12-13 10:08:15.916548] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 1597 times between [2018-12-13 10:08:15.916548] and [2018-12-13 10:08:30.786295] [2018-12-13 10:17:56.635788] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 2572 times between [2018-12-13 10:17:56.635788] and [2018-12-13 10:18:04.789341] pending frames: frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-12-13 10:18:09 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.0 /lib64/libglusterfs.so.0(+0x26570)[0x7ff2cf950570] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7ff2cf95aae4] /lib64/libc.so.6(+0x362f0)[0x7ff2cdfb42f0] /usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so(+0x9cd9)[0x7ff2c18f0cd9] /usr/lib64/glusterfs/5.0/xlator/cluster/distribute.so(+0x745f9)[0x7ff2c1b725f9] /usr/lib64/glusterfs/5.0/xlator/protocol/client.so(+0x5e2e5)[0x7ff2c1e142e5] /lib64/libgfrpc.so.0(+0xec70)[0x7ff2cf71cc70] /lib64/libgfrpc.so.0(+0xf043)[0x7ff2cf71d043] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7ff2cf718f23] /usr/lib64/glusterfs/5.0/rpc-transport/socket.so(+0xa37b)[0x7ff2c430737b] /lib64/libglusterfs.so.0(+0x8a5a9)[0x7ff2cf9b45a9] /lib64/libpthread.so.0(+0x7e25)[0x7ff2ce7b3e25] /lib64/libc.so.6(clone+0x6d)[0x7ff2ce07cbad]
Another crash, this time running 5.2. The produced core file shows no valid pointers: Core was generated by `/usr/sbin/glusterfsd -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-'. Program terminated with signal 11, Segmentation fault. #0 0x00007fbbab97b17c in ?? () (gdb) bt #0 0x00007fbbab97b17c in ?? () #1 0x00007fbbab981492 in ?? () #2 0x00000000ffffffff in ?? () #3 0x0000000000000001 in ?? () #4 0x0000000000000000 in ?? ()
Fixed with https://review.gluster.org/#/q/I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5 (in glusterfs-5.4 and beyond, and glusterfs-6.1 and beyond).