1655901 – glusterfsd 5.1 and 5.2 crashes in socket.so

Bug 1655901 - glusterfsd 5.1 and 5.2 crashes in socket.so

Summary: glusterfsd 5.1 and 5.2 crashes in socket.so

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-04 08:58 UTC by Rob de Wit
Modified:	2019-06-17 11:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-6.x, glusterfs-5.5
Clone Of:
Environment:
Last Closed:	2019-06-17 11:26:36 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rob de Wit 2018-12-04 08:58:22 UTC

Description of problem: glusterfsd crashes in socket.so

Version-Release number of selected component (if applicable): 5.1


How reproducible: run volume and wait for crash on one of the nodes


Actual results: 
Without a clear cause, the transport endpoint disappears. A core file is written. glusterd is still running, but "gluster volume status" shows no running daemon on the node. The volume is remains usable.

Expected results:
No crashes and no need to manually restart glusterfsd after a crash.


Additional info:

This a dat set in  two node cluster that is in the process of being transferred to glusterfs. We started with a single node and added the new one recently. A third will be added once we can declare this gluster cluster stable.


gdb core file analysis:

Core was generated by `/usr/sbin/glusterfsd -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f31692ce62b in ?? () from /usr/lib64/glusterfs/5.1/rpc-transport/socket.so
(gdb) bt
#0  0x00007f31692ce62b in ?? () from /usr/lib64/glusterfs/5.1/rpc-transport/socket.so
#1  0x00007f316e21aaeb in ?? () from /usr/lib64/libglusterfs.so.0
#2  0x00007f316d00b504 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f316c8f319f in clone () from /lib64/libc.so.6


Actual command line options were:  -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-glfs-brick -p /var/run/gluster/vols/jf-vol0/10.10.0.177-local.mnt-glfs-brick.pid -S /var/run/gluster/ccdac309d72f1df7.socket --brick-name /local.mnt/glfs/brick -l /var/log/glusterfs/bricks/local.mnt-glfs-brick.log --xlator-option *-posix.glusterd-uuid=ab5f12ae-c203-4299-b5eb-9a7df6abfc1b --process-name brick --brick-port 49152 --xlator-option jf-vol0-server.listen-port=49152


glusterd.log:

[2018-11-28 23:40:01.859118] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2018-11-28 23:40:01.859219] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2018-11-28 23:50:01.593857] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2018-11-28 23:50:01.593949] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2018-11-29 00:00:01.159538] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
[2018-11-29 00:00:09.723224] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152
[2018-11-29 00:00:09.748419] I [MSGID: 106005] [glusterd-handler.c:6194:__glusterd_brick_rpc_notify] 0-management: Brick 10.10.0.177:/local.mnt/glfs/brick has disconnected from glusterd.
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 36 times between [2018-11-29 00:00:01.159538] and [2018-11-29 00:00:28.759673]
[2018-11-29 00:00:29.281398] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:00:29.281398] and [2018-11-29 00:02:28.804429]
[2018-11-29 00:02:29.293664] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:02:29.293664] and [2018-11-29 00:04:28.849724]
[2018-11-29 00:04:29.306508] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 339 times between [2018-11-29 00:04:29.306508] and [2018-11-29 00:06:28.893840]


volume info:
Volume Name: jf-vol0
Type: Replicate
Volume ID: d6c72c52-24c5-4302-81ed-257507c27c1a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.10.0.177:/local.mnt/glfs/brick
Brick2: 10.10.0.208:/local.mnt/glfs/brick
Options Reconfigured:
client.event-threads: 3
server.event-threads: 3
cluster.self-heal-daemon: enable
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: CRITICAL
diagnostics.brick-sys-log-level: CRITICAL
disperse.shd-wait-qlength: 2048
cluster.shd-max-threads: 4
performance.cache-size: 4GB
performance.cache-max-file-size: 4MB
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 60
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 60
network.inode-lru-limit: 50000
cluster.lookup-optimize: on
cluster.readdir-optimize: on
cluster.force-migration: off

Comment 1 Atin Mukherjee 2018-12-05 06:02:32 UTC

this seems to be a glusterfsd (brick) crash?

Comment 2 Rob de Wit 2018-12-07 13:03:06 UTC

The crashes might be related to this possible memory leak: https://bugzilla.redhat.com/show_bug.cgi?id=1657202

Although these look like two searate processes (brick and client?)

Comment 3 joao.bauto 2018-12-13 13:43:18 UTC

I'm also getting a somewhat similar error in gluster 5.0 with multiple crashes on different clients. Sometimes it takes a couple of days to crash or it can be within hours. The mount error message is transport endpoint not connected and it's fixed by unmount and mount again.

Here is the information on one of the clients with a volume mounted using glusterfuse.

gluster setup:

Volume Name: tank
Type: Distribute
Volume ID: 9582685f-07fa-41fd-b9fc-ebab3a6989cf
Status: Started
Snapshot Count: 0
Number of Bricks: 8
Transport-type: tcp
Bricks:
Brick1: node-01:/tank/volume1/brick
Brick2: node-02:/tank/volume1/brick
Brick3: node-03:/tank/volume1/brick
Brick4: node-04:/tank/volume1/brick
Brick5: node-01:/tank/volume2/brick
Brick6: node-02:/tank/volume2/brick
Brick7: node-03:/tank/volume2/brick
Brick8: node-04:/tank/volume2/brick

installed packages:

glusterfs.x86_64                     5.0-1.el7                 @centos-gluster5 
glusterfs-api.x86_64                 5.0-1.el7                 @centos-gluster5 
glusterfs-cli.x86_64                 5.0-1.el7                 @centos-gluster5 
glusterfs-client-xlators.x86_64      5.0-1.el7                 @centos-gluster5 
glusterfs-fuse.x86_64                5.0-1.el7                 @centos-gluster5 
glusterfs-libs.x86_64                5.0-1.el7                 @centos-gluster5 
glusterfs-server.x86_64              5.0-1.el7                 @centos-gluster5

gdb core file:

#0  0x00007ff2c18f0cd9 in wb_fulfill_cbk () from /usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so
Missing separate debuginfos, use: debuginfo-install glusterfs-server-5.0-1.el7.x86_64
(gdb) bt
#0  0x00007ff2c18f0cd9 in wb_fulfill_cbk () from /usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so
#1  0x00007ff2c1b725f9 in dht_writev_cbk () from /usr/lib64/glusterfs/5.0/xlator/cluster/distribute.so
#2  0x00007ff2c1e142e5 in client4_0_writev_cbk () from /usr/lib64/glusterfs/5.0/xlator/protocol/client.so
#3  0x00007ff2cf71cc70 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
#4  0x00007ff2cf71d043 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#5  0x00007ff2cf718f23 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#6  0x00007ff2c430737b in socket_event_handler () from /usr/lib64/glusterfs/5.0/rpc-transport/socket.so
#7  0x00007ff2cf9b45a9 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#8  0x00007ff2ce7b3e25 in start_thread (arg=0x7ff2ab7fe700) at pthread_create.c:308
#9  0x00007ff2ce07cbad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


gluster log:

[2018-12-13 10:08:15.916548] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 1597 times between [2018-12-13 10:08:15.916548] and [2018-12-13 10:08:30.786295]
[2018-12-13 10:17:56.635788] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 2572 times between [2018-12-13 10:17:56.635788] and [2018-12-13 10:18:04.789341]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-12-13 10:18:09
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.0
/lib64/libglusterfs.so.0(+0x26570)[0x7ff2cf950570]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7ff2cf95aae4]
/lib64/libc.so.6(+0x362f0)[0x7ff2cdfb42f0]
/usr/lib64/glusterfs/5.0/xlator/performance/write-behind.so(+0x9cd9)[0x7ff2c18f0cd9]
/usr/lib64/glusterfs/5.0/xlator/cluster/distribute.so(+0x745f9)[0x7ff2c1b725f9]
/usr/lib64/glusterfs/5.0/xlator/protocol/client.so(+0x5e2e5)[0x7ff2c1e142e5]
/lib64/libgfrpc.so.0(+0xec70)[0x7ff2cf71cc70]
/lib64/libgfrpc.so.0(+0xf043)[0x7ff2cf71d043]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7ff2cf718f23]
/usr/lib64/glusterfs/5.0/rpc-transport/socket.so(+0xa37b)[0x7ff2c430737b]
/lib64/libglusterfs.so.0(+0x8a5a9)[0x7ff2cf9b45a9]
/lib64/libpthread.so.0(+0x7e25)[0x7ff2ce7b3e25]
/lib64/libc.so.6(clone+0x6d)[0x7ff2ce07cbad]

Comment 4 Rob de Wit 2018-12-23 09:54:31 UTC

Another crash, this time running 5.2. The produced core file shows no valid pointers:

Core was generated by `/usr/sbin/glusterfsd -s 10.10.0.177 --volfile-id jf-vol0.10.10.0.177.local.mnt-'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fbbab97b17c in ?? ()
(gdb) bt
#0  0x00007fbbab97b17c in ?? ()
#1  0x00007fbbab981492 in ?? ()
#2  0x00000000ffffffff in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()

Comment 5 Amar Tumballi 2019-06-17 11:26:36 UTC

Fixed with https://review.gluster.org/#/q/I911b0e0b2060f7f41ded0b05db11af6f9b7c09c5 (in glusterfs-5.4 and beyond, and glusterfs-6.1 and beyond).

Note You need to log in before you can comment on or make changes to this bug.