Bug 1739884 - glusterfsd process crashes with SIGSEGV
Summary: glusterfsd process crashes with SIGSEGV
Keywords:
Status: CLOSED DUPLICATE of bug 1749625
Alias: None
Product: GlusterFS
Classification: Community
Component: transport
Version: 6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On: 1782495
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-11 15:55 UTC by Chad Feller
Modified: 2020-01-27 11:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-27 11:16:42 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
gdb backtrace from gluster00 crash #1 (79.68 KB, text/plain)
2019-08-14 21:35 UTC, Chad Feller
no flags Details
gdb backtrace from gluster01 crash #1 (75.73 KB, text/plain)
2019-08-14 21:39 UTC, Chad Feller
no flags Details
gdb backtrace from gluster00 crash #2 (65.22 KB, text/plain)
2019-08-14 21:41 UTC, Chad Feller
no flags Details
gdb backtrace from gluster01 crash #2 (56.38 KB, text/plain)
2019-08-14 21:43 UTC, Chad Feller
no flags Details
lvm config (2.07 KB, text/plain)
2019-08-14 21:46 UTC, Chad Feller
no flags Details
gluster volume info (1.00 KB, text/plain)
2019-08-14 21:46 UTC, Chad Feller
no flags Details
gluster volume status client-list (541 bytes, text/plain)
2019-08-14 21:47 UTC, Chad Feller
no flags Details
gluster volume status detail (4.20 KB, text/plain)
2019-08-14 21:49 UTC, Chad Feller
no flags Details
gluster00: gdb glusterfsd core.6563 (2.02 KB, text/plain)
2019-08-21 19:58 UTC, Chad Feller
no flags Details
gluster00: gdb glusterfsd core.31704 (1.93 KB, text/plain)
2019-08-21 19:59 UTC, Chad Feller
no flags Details
gluster01: gdb glusterfsd core.6560 (2.04 KB, text/plain)
2019-08-21 20:00 UTC, Chad Feller
no flags Details
gluster01: gdb glusterfsd core.23686 (1.86 KB, text/plain)
2019-08-21 20:01 UTC, Chad Feller
no flags Details
gdb backtrace from gluster00 crash #1 (full) (45.96 KB, text/plain)
2019-08-21 21:09 UTC, Chad Feller
no flags Details
gdb backtrace from gluster01 crash #1 (full) (38.85 KB, text/plain)
2019-08-21 21:10 UTC, Chad Feller
no flags Details
gdb backtrace from gluster00 crash #2 (full) (29.84 KB, text/plain)
2019-08-21 21:10 UTC, Chad Feller
no flags Details
gdb backtrace from gluster01 crash #2 (full) (23.46 KB, text/plain)
2019-08-21 21:11 UTC, Chad Feller
no flags Details

Description Chad Feller 2019-08-11 15:55:35 UTC
Description of problem:
glusterfsd process crashes with SIGSEGV. A glusterfsd process crashed during an normal workload.

Version-Release number of selected component (if applicable):
glusterfs-6.4-1.el7.x86_64
glusterfs-server-6.4-1.el7.x86_64

How reproducible:
Seldom, but twice in less than 24 hours. 

This Gluster setup had been running reliably for several weeks, but crashed twice for the same reason in less than 24 hours.  
(I captured the data on the first crash, but it crashed a second time before I created the bug report on the first one.)  

Steps to Reproduce:
?


Additional info:
This is a two node cluster, behind the same switch connected via fiber SFPs, 10GE.

First crash:

###########################
# journalctl -u glusterd
-- Logs begin at Thu 2019-08-08 05:48:10 PDT, end at Thu 2019-08-08 17:40:01 PDT. --
Aug 08 05:48:31 gluster00 systemd[1]: Starting GlusterFS, a clustered file-system server...
Aug 08 05:48:31 gluster00 systemd[1]: Started GlusterFS, a clustered file-system server.
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: pending frames:
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: frame : type(1) op(XATTROP)
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: frame : type(1) op(INODELK)
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: frame : type(1) op(XATTROP)
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: frame : type(1) op(FXATTROP)
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: frame : type(1) op(TRUNCATE)
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: patchset: git://git.gluster.org/glusterfs.git
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: signal received: 11
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: time of crash:
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: 2019-08-09 00:05:00
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: configuration details:
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: argp 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: backtrace 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: dlfcn 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: libpthread 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: llistxattr 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: setfsid 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: spinlock 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: epoll.h 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: xattr.h 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: st_atim.tv_nsec 1
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: package-string: glusterfs 6.4
Aug 08 17:05:00 gluster00 export-brick0-srv[6563]: ---------

###########################
# log from first node
[2019-08-09 00:04:58.959057] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-gv0-server: disconnecting connection from CTX_ID:783468e6-b1c9-461d-861a-469c4aba45a6-GRAPH_ID:0-PID:30955-HOST:gluster01-PC_NAME:gv0-client-0-RECON_NO:-0
[2019-08-09 00:04:58.959250] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-gv0-server: Shutting down connection CTX_ID:783468e6-b1c9-461d-861a-469c4aba45a6-GRAPH_ID:0-PID:30955-HOST:gluster01-PC_NAME:gv0-client-0-RECON_NO:-0
[2019-08-09 00:04:58.992241] I [addr.c:54:compare_addr_and_update] 0-/export/brick0/srv: allowed = "*", received addr = "192.168.0.21"                                                                            
[2019-08-09 00:04:58.992294] I [login.c:110:gf_auth] 0-auth/login: allowed user names: ad85e1b1-89f4-44ba-b098-b941f0b0a0bb                                                                                       
[2019-08-09 00:04:58.992304] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gv0-server: accepted client from CTX_ID:f0c57ea3-fd1d-433a-985c-e6e3dfa014f1-GRAPH_ID:0-PID:30974-HOST:gluster01-PC_NAME:gv0-client-0-RECON_NO:-0 (version: 6.4) with subvol /export/brick0/srv
[2019-08-09 00:05:00.953515] E [MSGID: 101064] [event-epoll.c:618:event_dispatch_epoll_handler] 0-epoll: generation mismatch on idx=5, gen=9316, slot->gen=9317, slot->fd=19                                      
[2019-08-09 00:05:00.970841] E [socket.c:1303:socket_event_poll_err] (-->/lib64/libglusterfs.so.0(+0x8b4d6) [0x7fb437d884d6] -->/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa48a) [0x7fb42c0e848a] -->/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0x81fc) [0x7fb42c0e61fc] ) 0-socket: invalid argument: this->private [Invalid argument]                                                                              
pending frames:
frame : type(1) op(XATTROP)
frame : type(1) op(INODELK)
frame : type(1) op(XATTROP)
frame : type(1) op(FXATTROP)
frame : type(1) op(TRUNCATE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-08-09 00:05:00
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.4
/lib64/libglusterfs.so.0(+0x26e00)[0x7fb437d23e00]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb437d2e804]
/lib64/libc.so.6(+0x36340)[0x7fb436363340]
/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa4cc)[0x7fb42c0e84cc]
/lib64/libglusterfs.so.0(+0x8b4d6)[0x7fb437d884d6]
/lib64/libpthread.so.0(+0x7dd5)[0x7fb436b63dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fb43642b02d]
---------

# log from second node (crashed ~5 minutes later)
[2019-08-09 00:09:34.882722] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-gv0-server: disconnecting connection from CTX_ID:3e3f26ce-f682-4631-a058-11fd08414c81-GRAPH_ID:0-PID:31527-HOST:gluster01-PC_NAME:gv0-client-1-RECON_NO:-0
[2019-08-09 00:09:34.882878] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-gv0-server: Shutting down connection CTX_ID:3e3f26ce-f682-4631-a058-11fd08414c81-GRAPH_ID:0-PID:31527-HOST:gluster01-PC_NAME:gv0-client-1-RECON_NO:-0
[2019-08-09 00:09:39.916899] I [addr.c:54:compare_addr_and_update] 0-/export/brick0/srv: allowed = "*", received addr = "192.168.0.21"                                                                            
[2019-08-09 00:09:39.916929] I [login.c:110:gf_auth] 0-auth/login: allowed user names: ad85e1b1-89f4-44ba-b098-b941f0b0a0bb                                                                                       
[2019-08-09 00:09:39.916946] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gv0-server: accepted client from CTX_ID:1401c14a-b2f7-421a-89a1-9acfdaffeda0-GRAPH_ID:0-PID:31660-HOST:gluster01-PC_NAME:gv0-client-1-RECON_NO:-0 (version: 6.4) with subvol /export/brick0/srv
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(1) op(READ)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-08-09 00:10:34
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.4
/lib64/libglusterfs.so.0(+0x26e00)[0x7f50f2b20e00]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f50f2b2b804]
/lib64/libc.so.6(+0x36340)[0x7f50f1160340]
/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa4cc)[0x7f50e6ee54cc]
/lib64/libglusterfs.so.0(+0x8b4d6)[0x7f50f2b854d6]
/lib64/libpthread.so.0(+0x7dd5)[0x7f50f1960dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f50f122802d]
---------

###########################
# backtrace in core dump on first node:
(gdb) bt
#0  0x00007fb42c0e84cc in socket_event_handler () from /usr/lib64/glusterfs/6.4/rpc-transport/socket.so
#1  0x00007fb437d884d6 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007fb436b63dd5 in start_thread (arg=0x7fb413fff700) at pthread_create.c:307
#3  0x00007fb43642b02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

###########################
# backtrace in core dump on second node:
(gdb) bt
#0  0x00007f50e6ee54cc in socket_event_handler () from /usr/lib64/glusterfs/6.4/rpc-transport/socket.so
#1  0x00007f50f2b854d6 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f50f1960dd5 in start_thread (arg=0x7f50dd365700) at pthread_create.c:307
#3  0x00007f50f122802d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Comment 1 Chad Feller 2019-08-11 16:09:31 UTC
Less than 24 hours later, a second crash nearly identical with the exception that this time around, the glusterfsd process on both nodes crashed within a few seconds of each other. I had just added four new bricks and was about 8 hours into a rebalance when it crashed:

###########################
# log from first node:
[2019-08-09 17:06:48.926301] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-gv0-server: disconnecting connection from CTX_ID:b36e17ac-62e5-4171-bfdc-1707aa5ec5a9-GRAPH_ID:0-PID:31299-HOST:gluster01-PC_NAME:gv0-client-2-RECON_NO:-0
[2019-08-09 17:06:48.926480] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-gv0-server: Shutting down connection CTX_ID:b36e17ac-62e5-4171-bfdc-1707aa5ec5a9-GRAPH_ID:0-PID:31299-HOST:gluster01-PC_NAME:gv0-client-2-RECON_NO:-0
[2019-08-09 17:06:48.956140] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.4/xlator/features/worm.so(+0x7241) [0x7f5f80443241] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1c219) [0x7f5f80669219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f5f900860b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:49.038307] I [dict.c:541:dict_get] (-->/lib64/libglusterfs.so.0(default_fremovexattr+0xe7) [0x7f5f901251f7] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1d5dd) [0x7f5f8066a5dd] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f5f900860b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:49.099221] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.4/xlator/features/worm.so(+0x7241) [0x7f5f80443241] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1c219) [0x7f5f80669219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f5f900860b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
pending frames:
frame : type(1) op(INODELK)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-08-09 17:06:49
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.4
/lib64/libglusterfs.so.0(+0x26e00)[0x7f5f90091e00]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f5f9009c804]
/lib64/libc.so.6(+0x36340)[0x7f5f8e6d1340]
/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa4cc)[0x7f5f844564cc]
/lib64/libglusterfs.so.0(+0x8b4d6)[0x7f5f900f64d6]
/lib64/libpthread.so.0(+0x7dd5)[0x7f5f8eed1dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f5f8e79902d]
---------

###########################
# log from second node:
[2019-08-09 17:06:48.926010] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-gv0-server: disconnecting connection from CTX_ID:b36e17ac-62e5-4171-bfdc-1707aa5ec5a9-GRAPH_ID:0-PID:31299-HOST:gluster01-PC_NAME:gv0-client-3-RECON_NO:-0
[2019-08-09 17:06:48.926231] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-gv0-server: Shutting down connection CTX_ID:b36e17ac-62e5-4171-bfdc-1707aa5ec5a9-GRAPH_ID:0-PID:31299-HOST:gluster01-PC_NAME:gv0-client-3-RECON_NO:-0
[2019-08-09 17:06:48.955802] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.4/xlator/features/worm.so(+0x7241) [0x7f47b8c64241] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1c219) [0x7f47b8e8a219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f47c88a70b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:49.038024] I [dict.c:541:dict_get] (-->/lib64/libglusterfs.so.0(default_fremovexattr+0xe7) [0x7f47c89461f7] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1d5dd) [0x7f47b8e8b5dd] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f47c88a70b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:49.098930] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.4/xlator/features/worm.so(+0x7241) [0x7f47b8c64241] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1c219) [0x7f47b8e8a219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f47c88a70b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:50.042220] I [MSGID: 115072] [server-rpc-fops_v2.c:1681:server4_setattr_cbk] 0-gv0-server: 349203: SETATTR /munin/var/lib (418e15b7-70d5-4f11-86ac-c1d8aeb6aa12), client: CTX_ID:d6150421-3ac2-4dbc-a665-d619f13dbb15-GRAPH_ID:0-PID:538-HOST:munin-PC_NAME:gv0-client-3-RECON_NO:-0, error-xlator: gv0-access-control [Operation not permitted]
[2019-08-09 17:06:50.962416] I [MSGID: 115072] [server-rpc-fops_v2.c:1681:server4_setattr_cbk] 0-gv0-server: 349225: SETATTR /munin/var/lib (418e15b7-70d5-4f11-86ac-c1d8aeb6aa12), client: CTX_ID:d6150421-3ac2-4dbc-a665-d619f13dbb15-GRAPH_ID:0-PID:538-HOST:munin-PC_NAME:gv0-client-3-RECON_NO:-0, error-xlator: gv0-access-control [Operation not permitted]
[2019-08-09 17:06:52.073700] I [dict.c:541:dict_get] (-->/lib64/libglusterfs.so.0(default_fremovexattr+0xe7) [0x7f47c89461f7] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1d5dd) [0x7f47b8e8b5dd] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f47c88a70b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-08-09 17:06:52.222678] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.4/xlator/features/worm.so(+0x7241) [0x7f47b8c64241] -->/usr/lib64/glusterfs/6.4/xlator/features/locks.so(+0x1c219) [0x7f47b8e8a219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f47c88a70b4] ) 0-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-08-09 17:06:52
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.4
/lib64/libglusterfs.so.0(+0x26e00)[0x7f47c88b2e00]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f47c88bd804]
/lib64/libc.so.6(+0x36340)[0x7f47c6ef2340]
/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa4cc)[0x7f47bcc774cc]
/lib64/libglusterfs.so.0(+0x8b4d6)[0x7f47c89174d6]
/lib64/libpthread.so.0(+0x7dd5)[0x7f47c76f2dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f47c6fba02d]
---------


###########################
# backtrace in core dump on first node:
(gdb) bt
#0  0x00007f5f844564cc in socket_event_handler () from /usr/lib64/glusterfs/6.4/rpc-transport/socket.so
#1  0x00007f5f900f64d6 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f5f8eed1dd5 in start_thread (arg=0x7f5f66ffd700) at pthread_create.c:307
#3  0x00007f5f8e79902d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 


###########################
# backtrace in core dump on second node:
(gdb) bt
#0  0x00007f47bcc774cc in socket_event_handler () from /usr/lib64/glusterfs/6.4/rpc-transport/socket.so
#1  0x00007f47c89174d6 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f47c76f2dd5 in start_thread (arg=0x7f47b1053700) at pthread_create.c:307
#3  0x00007f47c6fba02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Comment 2 Amar Tumballi 2019-08-13 05:37:36 UTC
Thanks for the report. If you still have the core, I request you to post the output of '(gdb) thread apply all bt full'. That gives more details on the crash.

Also, if it possible for letting us know a pattern of access (so that we can try to reproduce the same) it would be great.

Comment 3 Chad Feller 2019-08-14 21:31:04 UTC
Hi, I still have the core dumps, so I will include the output of your commands as attachments.  

I'm also going to attach the cluster configuration as well. Native clients are mounted with the 'reader-thread-count=4' option.

Only additional notes are that we had switched from Copper 1GbE to Fiber 10GbE about a week before the first crash. At the same time we also added 18 additional disks per node, which would eventually comprise two additional bricks per node. 

I've been using custom Ansible playbooks to manage not only the system, but Gluster as well. When I used Ansible (https://docs.ansible.com/ansible/latest/modules/gluster_volume_module.html) to add the two additional bricks at the same time, it incorrectly paired them (bug?).

Before the incorrect addition, my cluster configuration was as follows:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv

After Ansible incorrectly paired the disks, it was something like this, IIRC:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv
  Brick3: gluster00:/export/brick1/srv
  Brick4: gluster00:/export/brick2/srv
  Brick5: gluster01:/export/brick1/srv
  Brick6: gluster01:/export/brick2/srv

After adding the bricks, I issued a rebalance command (not realizing the incorrect pairing) but about a minute into the rebalance I realized that something was amiss. I immediately realized what happened and issued a:

  gluster volume remove-brick gv0 gluster01:/export/brick2/srv gluster01:/export/brick1/srv gluster00:/export/brick2/srv gluster00:/export/brick1/srv start

After the remove completed, I did a commit to confirm the remove-brick command.  After the commit I was back to the original configuration:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv

While the data was intact, my directory permissions and and file ownership were wiped out due to a bug that may have been related to Bug #1716848. After correcting the directory permissions and ownership, the cluster ran fine for several hours, and I had planned to reattempt the brick add (via Ansible) but with one brick pair at a time so I didn't end up with mismatched brick pairs again. At the end of the day however, before I was able to re-add the brick pair, Gluster crashed with the first core dump. It was still in the two brick setup, as I had not yet re-attempted the brick add. 

(Note: I reformatted the bricks before attempting to re-use them.)

I rebooted the cluster, and upon coming back up, the self heal daemon resync'd everything. After examining the volume, I was happy with everything so I went ahead and added a brick pair via Ansible. It worked and everything was paired correctly. I then added the next pair to Ansible and ran the playbook again. Again, everything paired correctly. At this point I had the correct brick setup:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv
  Brick3: gluster00:/export/brick1/srv
  Brick4: gluster01:/export/brick1/srv
  Brick5: gluster00:/export/brick2/srv
  Brick6: gluster01:/export/brick2/srv

From here I issued a rebalance command and watched. Everything was working fine for about 10 hours, which was when the second crash happened. That is, the second crash happened in the middle of a rebalance. After everything came back up, the self heal daemon did its thing. I examined the volume, saw no issues and went ahead and started the rebalance again. This time the rebalance ran to completion (took somewhere between 1-2 days). I've had zero crashes since then.

Not sure if there is any pattern of access that caused any of this, but the timing of it around some administrative work is interesting and why I covered it in such detail above.

I should also note that I'm also using both munin-glusterfs (https://github.com/burner1024/munin-gluster) and gluster-prometheus (https://github.com/gluster/gluster-prometheus) plugins on the nodes for monitoring (although Munin is legacy at this point, and will be going away once Prometheus is fully built out).

Comment 4 Chad Feller 2019-08-14 21:35:44 UTC
Created attachment 1603901 [details]
gdb backtrace from gluster00 crash #1

gluster00 crash #1 core dump

Comment 5 Chad Feller 2019-08-14 21:37:52 UTC
Comment on attachment 1603901 [details]
gdb backtrace from gluster00 crash #1

Output of '(gdb) thread apply all bt full' from crash #1 on gluster00

Comment 6 Chad Feller 2019-08-14 21:39:47 UTC
Created attachment 1603902 [details]
gdb backtrace from gluster01 crash #1

Output of '(gdb) thread apply all bt full' from crash #1 on gluster01

Comment 7 Chad Feller 2019-08-14 21:41:51 UTC
Created attachment 1603903 [details]
gdb backtrace from gluster00 crash #2

Output of '(gdb) thread apply all bt full' from crash #2 on gluster00

Comment 8 Chad Feller 2019-08-14 21:43:22 UTC
Created attachment 1603904 [details]
gdb backtrace from gluster01 crash #2

Output of '(gdb) thread apply all bt full' from crash #2 on gluster01

Comment 9 Chad Feller 2019-08-14 21:46:09 UTC
Created attachment 1603905 [details]
lvm config

LVM configuration

Comment 10 Chad Feller 2019-08-14 21:46:42 UTC
Created attachment 1603906 [details]
gluster volume info

output of 'gluster volume info'

Comment 11 Chad Feller 2019-08-14 21:47:36 UTC
Created attachment 1603907 [details]
gluster volume status client-list

output of 'gluster volume status gv0 client-list'

Comment 12 Chad Feller 2019-08-14 21:49:29 UTC
Created attachment 1603909 [details]
gluster volume status detail

output of 'gluster volume status detail'

Comment 13 Susant Kumar Palai 2019-08-21 09:42:22 UTC
From the current report, it is not clear which thread crashed and for what reason.

Please do the following.

# Install gluster-debuginfo package of the same version as your glusterfs server.

# type the following command into terminal :
   "gdb glusterfsd $core-file" and copy the text from the terminal which should have the crashed thread bt. It looks something like this.

  
Core was generated by `/usr/local/sbin/glusterfs --process-name fuse --volfile-server=vm3 --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f12be8c5d9b in dht_create (frame=0x7f12a4001e48, this=0x0, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, params=0x7f12ac0020c8) at dht-common.c:8677
8677	    conf = this->private;
[Current thread is 1 (Thread 0x7f12ce3ff700 (LWP 30689))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26-28.fc27.x86_64 openssl-libs-1.1.0h-3.fc27.x86_64 pcre2-10.31-7.fc27.x86_64 sssd-client-1.16.2-4.fc27.x86_64
(gdb) bt
#0  0x00007f12be8c5d9b in dht_create (frame=0x7f12a4001e48, this=0x0, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, params=0x7f12ac0020c8) at dht-common.c:8677
#1  0x00007f12be633d8a in gf_utime_create (frame=0x7f12a4006cb8, this=0x7f12b0010b70, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at utime-autogen-fops.c:172
#2  0x00007f12ce11b86c in default_create (frame=0x7f12a4006cb8, this=0x7f12b0012890, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at defaults.c:2601
#3  0x00007f12be201846 in ra_create (frame=0x7f12a4001608, this=0x7f12b0014540, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at read-ahead.c:194
#4  0x00007f12ce11b86c in default_create (frame=0x7f12a4001608, this=0x7f12b0016120, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at defaults.c:2601
#5  0x00007f12bddd896f in ioc_create (frame=0x7f12a4008ff8, this=0x7f12b0018240, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at io-cache.c:919
#6  0x00007f12ce11b86c in default_create (frame=0x7f12a4008ff8, this=0x7f12b0019e30, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at defaults.c:2601
#7  0x00007f12ce11b86c in default_create (frame=0x7f12a4008ff8, this=0x7f12b001ba30, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at defaults.c:2601
#8  0x00007f12bd798698 in mdc_create (frame=0x7f12a4000cb8, this=0x7f12b001d610, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at md-cache.c:1969
#9  0x00007f12ce1102ed in default_create_resume (frame=0x7f12b8008328, this=0x7f12b001f210, loc=0x7f12b8009130, flags=34881, mode=33188, umask=0, fd=0x7f12b8001948, xdata=0x7f12ac0020c8) at defaults.c:1873
#10 0x00007f12ce05fbf8 in call_resume_wind (stub=0x7f12b80090e8) at call-stub.c:2033
#11 0x00007f12ce071bee in call_resume (stub=0x7f12b80090e8) at call-stub.c:2555
#12 0x00007f12bd5809c6 in iot_worker (data=0x7f12b002efc0) at io-threads.c:232
#13 0x00007f12ccbf550b in start_thread () from /lib64/libpthread.so.0
#14 0x00007f12cc4a516f in clone () from /lib64/libc.so.6


And then get another report with "thread apply all bt". 

Once we check that info we will figure out the next steps.

Regards,
Susant

Comment 14 Chad Feller 2019-08-21 15:50:43 UTC
I had run

  debuginfo-install glusterfs-server

But it looks like it only installed the debuginfo packages for glusterfs dependencies and not glusterfs itself.  Is this a gluster packaging bug?

I'll rerun the backtrace with the glusterfs-debuginfo package installed.

Comment 15 Chad Feller 2019-08-21 19:55:45 UTC
I wasn't able to pull the gluster-debuginfo package via yum, even with --enablerepo=centos-storage-debuginfo.  The baseurl in the CentOS-Storage-common.repo file looks correct, so I'm not sure if there is a problem with the repodata file or something else... Anyway, I manually downloaded the glusterfs-debuginfo file installed it and am attaching the output of:

gdb glusterfsd $core-file

As well as the 

thread apply all bt 

commands

Comment 16 Chad Feller 2019-08-21 19:58:00 UTC
Created attachment 1606731 [details]
gluster00: gdb glusterfsd core.6563

Comment 17 Chad Feller 2019-08-21 19:59:33 UTC
Created attachment 1606732 [details]
gluster00: gdb glusterfsd core.31704

Comment 18 Chad Feller 2019-08-21 20:00:19 UTC
Created attachment 1606733 [details]
gluster01: gdb glusterfsd core.6560

Comment 19 Chad Feller 2019-08-21 20:01:06 UTC
Created attachment 1606734 [details]
gluster01: gdb glusterfsd core.23686

Comment 20 Chad Feller 2019-08-21 21:09:22 UTC
Created attachment 1606755 [details]
gdb backtrace from gluster00 crash #1 (full)

Comment 21 Chad Feller 2019-08-21 21:10:13 UTC
Created attachment 1606757 [details]
gdb backtrace from gluster01 crash #1 (full)

Comment 22 Chad Feller 2019-08-21 21:10:50 UTC
Created attachment 1606759 [details]
gdb backtrace from gluster00 crash #2 (full)

Comment 23 Chad Feller 2019-08-21 21:11:25 UTC
Created attachment 1606760 [details]
gdb backtrace from gluster01 crash #2 (full)

Comment 24 Susant Kumar Palai 2019-08-22 08:34:09 UTC
Thanks for the report.

Here is the crashed thread.

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfsd -s gluster00 --volfile-id gv0.gluster00.export-brick1-srv'.
Program terminated with signal 11, Segmentation fault.
#0  socket_is_connected (this=0x7f5f500af3c0) at socket.c:2619
2619	    if (priv->use_ssl) {
[?1034h(gdb) quit


Form this it looks like "priv" is NULL.  Can you print the value of priv for clarity (command to type after opening core with gdb: "p priv").


Assigning this to Mohit to take it further.

Susant

Comment 25 Chad Feller 2019-08-22 17:10:47 UTC
Hi,

I checked all four core dumps, and they each have the same value for priv:

  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  Core was generated by `/usr/sbin/glusterfsd -s gluster00 --volfile-id gv0.gluster00.export-brick0-srv'.
  Program terminated with signal 11, Segmentation fault.
  #0  socket_is_connected (this=0x7fb40c0ef460) at socket.c:2619
  2619	      if (priv->use_ssl) {
  (gdb) p priv
  $1 = (socket_private_t *) 0x0
  (gdb)

Comment 26 David Sainty 2019-09-15 22:49:45 UTC
Hi, we can also reproduce this crash, caused by a security scanner probing the brick.

#0  socket_is_connected (this=0x7f2d5000a490) at socket.c:2619
2619        if (priv->use_ssl) {
 
(gdb) print ((rpc_transport_t *)data)->private
$4 = (void *) 0x0

Comment 27 Mohit Agrawal 2019-09-16 02:56:28 UTC
Hi,

 Can you please share the dump of /var/log/glusterfs directory from the node on that 
 glusterfsd is crashed?

Thanks,
Mohit Agrawal

Comment 28 David Sainty 2019-09-16 03:18:15 UTC
(In reply to Mohit Agrawal from comment #27)
> Hi,
> 
>  Can you please share the dump of /var/log/glusterfs directory from the node
> on that 
>  glusterfsd is crashed?

I'll try first sharing snippets, just so I can eyeball them for details that shouldn't be made public.  This is the brick crash, showing some error messages indicating that the service is receiving security probe-related garbage.

[2019-09-11 02:39:52.615174] E [socket.c:2252:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1929576541) received from 10.234.96.36:7933
[2019-09-11 02:39:52.618865] E [socket.c:2252:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1996685661) received from 10.234.96.36:7936
[2019-09-11 02:39:52.623556] E [socket.c:2252:__socket_read_frag] 0-rpc: wrong MSG-TYPE (8192) received from 10.234.96.36:7939
[2019-09-11 02:39:52.627840] E [socket.c:2252:__socket_read_frag] 0-rpc: wrong MSG-TYPE (8192) received from 10.234.96.36:7942
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-09-11 02:39:52
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.4
/lib64/libglusterfs.so.0(+0x26e00)[0x7f2d6c6bde00]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f2d6c6c8804]
/lib64/libc.so.6(+0x36340)[0x7f2d6acfd340]
/usr/lib64/glusterfs/6.4/rpc-transport/socket.so(+0xa4cc)[0x7f2d60a824cc]
/lib64/libglusterfs.so.0(+0x8b4d6)[0x7f2d6c7224d6]
/lib64/libpthread.so.0(+0x7dd5)[0x7f2d6b4fddd5]
/lib64/libc.so.6(clone+0x6d)[0x7f2d6adc502d]
---------

Comment 29 Mohit Agrawal 2019-11-29 13:06:07 UTC
Hi,

 Is it possible for you to backport the patch(https://review.gluster.org/#/c/glusterfs/+/23775/) on top of your current version and share the result?



Thanks,
Mohit Agrawal

Comment 30 chad@irisonboard.com 2019-12-02 19:04:29 UTC
Hi,

Hi, I'm not willing to try a patch on a production system. Since I initially created this report we've retired our old NAS systems and are running 100% off of gluster. Furthermore, I've upgraded the production current cluster to 6.6 to get a fix from another bug I was working around. 

I could apply the patch to a test VM but the configuration, although similar, would not be the same as what I'm using in production. Also, I use the pre-built binary RPMs, so I'm not sure if you'd want me to build glusterfs from source if I went that route. If I did build from source, I'd need to match the build flags any everything else that comes with the CentOS RPMs, no?

Comment 31 Xavi Hernandez 2019-12-11 18:12:34 UTC
We have identified the problem. There's a patch [1] to fix it. It will be available on next release.

[1] https://review.gluster.org/c/glusterfs/+/23861

Comment 32 Xavi Hernandez 2020-01-27 11:16:42 UTC
The patch is already present on v6.7

*** This bug has been marked as a duplicate of bug 1749625 ***


Note You need to log in before you can comment on or make changes to this bug.