Bug 1662557

Summary: glusterfs process crashes, causing "Transport endpoint not connected".
Product: [Community] GlusterFS Reporter: Rob de Wit <rob.dewit>
Component: replicateAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 5CC: atumball, bugs, nbalacha, pasik, ravishankar, rob.dewit
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.x Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-17 11:37:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
DEBUG log none

Description Rob de Wit 2018-12-29 21:57:16 UTC
Created attachment 1517403 [details]
DEBUG log

Description of problem:
Sometimes even after only a couple of minutes, the glusterfs process crashes. This leaves all processes connected to that volume, so to remount the volume I have to kill all those running processes :-(

Version-Release number of selected component (if applicable): 
5.2

How reproducible: 
I have not found out what triggers the crash, but it seems to be related the kind of load it gets. Two people working interactively on the volume got numerous crashes in a day. Some pretty heavy automated usage on another client-host crashes only once in a while, although that seems to increase the last few days.

Actual results:
glusterfs crashes

Expected results:
the mountpoint stays up until I 'umount' it.

Additional info:

* 1 volume with millions of small files
* 3-way cluster, all running 5.2
* all three nodes also mount the volume
* another host also mounts the volume
* some of the work-load of the interactive users is doing 'find' over large numbers of files and directories.


Volume info:

Volume Name: jf-vol0
Type: Replicate
Volume ID: d6c72c52-24c5-4302-81ed-257507c27c1a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.10.0.177:/local.mnt/glfs/brick
Brick2: 10.10.0.208:/local.mnt/glfs/brick
Brick3: 10.10.0.25:/local.mnt/glfs/brick
Options Reconfigured:
performance.readdir-ahead: off
cluster.force-migration: off
cluster.readdir-optimize: on
cluster.lookup-optimize: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 60
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 60
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
performance.cache-max-file-size: 4MB
performance.cache-size: 4GB
cluster.shd-max-threads: 4
disperse.shd-wait-qlength: 2048
diagnostics.brick-sys-log-level: CRITICAL
diagnostics.brick-log-level: CRITICAL
diagnostics.client-log-level: WARNING
cluster.self-heal-daemon: enable
server.event-threads: 3
client.event-threads: 3
server.statedump-path: /local.mnt/glfs/

I set the log-levels to CRITICAL and WARNING because that seemed to dramatically improve performance, however I did set the client to DEBUG for a short while until it crashed again. Tail of that log (full log in attachment):

[2018-12-29 15:08:44.476156] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.476542] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-0: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.476814] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-1: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.477011] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-2: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.478211] D [write-behind.c:750:__wb_fulfill_request] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x77e9)[0x7fe0a03407e9] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] ))))) 0-jf-vol0-write-behind: (unique=15185, fop=WRITE, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=0): request fulfilled. removing the request from liability queue? = yes
[2018-12-29 15:08:44.478367] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7838)[0x7fe0a0340838] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] ))))) 0-jf-vol0-write-behind: (unique = 15185, fop=WRITE, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=0): destroying request, removing from all queues
[2018-12-29 15:08:44.478517] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xea07)[0x7fe0a0347a07] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x2721e)[0x7fe0a082d21e] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.478634] D [write-behind.c:750:__wb_fulfill_request] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x77e9)[0x7fe0a03407e9] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] ))))) 0-jf-vol0-write-behind: (unique=15187, fop=WRITE, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=1): request fulfilled. removing the request from liability queue? = yes
[2018-12-29 15:08:44.478784] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7838)[0x7fe0a0340838] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] ))))) 0-jf-vol0-write-behind: (unique = 15187, fop=WRITE, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=1): destroying request, removing from all queues
[2018-12-29 15:08:44.478826] D [MSGID: 0] [write-behind.c:1710:__wb_pick_winds] 0-jf-vol0-write-behind: (unique=15188, fop=FLUSH, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=2): picking the request for winding
[2018-12-29 15:08:44.478892] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.478916] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.478931] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.478946] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.478986] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xea07)[0x7fe0a0347a07] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x2721e)[0x7fe0a082d21e] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.479030] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479066] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479096] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479122] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479189] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479219] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479244] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479269] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.479531] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5347)[0x7fe0a033e347] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x99a4)[0x7fe0a03429a4] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9a9c)[0x7fe0a0342a9c] ))))) 0-jf-vol0-write-behind: (unique = 15188, fop=FLUSH, gfid=6c5088a1-d6d7-4d2f-b445-9999ab7a7a8c, gen=2): destroying request, removing from all queues
[2018-12-29 15:08:44.479636] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-0: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.479912] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-1: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.480019] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-2: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.481063] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe537)[0x7fe0a0347537] (--> /usr/lib64/glusterfs/5.2/xlator/performance/read-ahead.so(+0x7126)[0x7fe0a0130126] (--> /usr/lib64/glusterfs/5.2/xlator/performance/io-cache.so(+0x7b98)[0x7fe09bdf0b98] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.481097] D [MSGID: 0] [write-behind.c:1710:__wb_pick_winds] 0-jf-vol0-write-behind: (unique=15191, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=0): picking the request for winding
[2018-12-29 15:08:44.481116] D [MSGID: 0] [write-behind.c:1299:__wb_pick_unwinds] 0-jf-vol0-write-behind: (unique=15191, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=0): added req to liability queue. inode-generation-number=1
[2018-12-29 15:08:44.481170] D [MSGID: 0] [stack.h:499:copy_frame] 0-stack: groups is null (ngrps: 0) [Invalid argument]
[2018-12-29 15:08:44.481639] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe537)[0x7fe0a0347537] (--> /usr/lib64/glusterfs/5.2/xlator/performance/read-ahead.so(+0x7126)[0x7fe0a0130126] (--> /usr/lib64/glusterfs/5.2/xlator/performance/io-cache.so(+0x7b98)[0x7fe09bdf0b98] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.481674] D [MSGID: 0] [write-behind.c:1666:__wb_pick_winds] 0-jf-vol0-write-behind: (unique=15193, fop=WRITE, gen=1, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c): ordering.go is not set, hence not winding
[2018-12-29 15:08:44.481692] D [MSGID: 0] [write-behind.c:1299:__wb_pick_unwinds] 0-jf-vol0-write-behind: (unique=15193, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=1): added req to liability queue. inode-generation-number=2
[2018-12-29 15:08:44.482006] D [write-behind.c:750:__wb_fulfill_request] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x77e9)[0x7fe0a03407e9] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] ))))) 0-jf-vol0-write-behind: (unique=15191, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=0): request fulfilled. removing the request from liability queue? = yes
[2018-12-29 15:08:44.482184] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7838)[0x7fe0a0340838] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] ))))) 0-jf-vol0-write-behind: (unique = 15191, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=0): destroying request, removing from all queues
[2018-12-29 15:08:44.482434] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe537)[0x7fe0a0347537] (--> /usr/lib64/glusterfs/5.2/xlator/performance/read-ahead.so(+0x7126)[0x7fe0a0130126] (--> /usr/lib64/glusterfs/5.2/xlator/performance/io-cache.so(+0x7b98)[0x7fe09bdf0b98] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.482450] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe537)[0x7fe0a0347537] (--> /usr/lib64/glusterfs/5.2/xlator/performance/read-ahead.so(+0x7126)[0x7fe0a0130126] (--> /usr/lib64/glusterfs/5.2/xlator/performance/io-cache.so(+0x7b98)[0x7fe09bdf0b98] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.482614] D [write-behind.c:750:__wb_fulfill_request] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x77e9)[0x7fe0a03407e9] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9118)[0x7fe0a0342118] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9a5d)[0x7fe0a0342a5d] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xea07)[0x7fe0a0347a07] ))))) 0-jf-vol0-write-behind: (unique=15195, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=2): request fulfilled. removing the request from liability queue? = no
[2018-12-29 15:08:44.482649] D [MSGID: 0] [write-behind.c:1710:__wb_pick_winds] 0-jf-vol0-write-behind: (unique=15193, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=1): picking the request for winding
[2018-12-29 15:08:44.482868] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5347)[0x7fe0a033e347] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x871e)[0x7fe0a034171e] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9a91)[0x7fe0a0342a91] ))))) 0-jf-vol0-write-behind: (unique = 15195, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=2): destroying request, removing from all queues
[2018-12-29 15:08:44.482937] D [MSGID: 0] [stack.h:499:copy_frame] 0-stack: groups is null (ngrps: 0) [Invalid argument]
[2018-12-29 15:08:44.483764] D [write-behind.c:750:__wb_fulfill_request] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x77e9)[0x7fe0a03407e9] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] ))))) 0-jf-vol0-write-behind: (unique=15193, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=1): request fulfilled. removing the request from liability queue? = yes
[2018-12-29 15:08:44.483950] D [write-behind.c:419:__wb_request_unref] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x5042)[0x7fe0a033e042] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7838)[0x7fe0a0340838] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x7b90)[0x7fe0a0340b90] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xe9ff)[0x7fe0a03479ff] ))))) 0-jf-vol0-write-behind: (unique = 15193, fop=WRITE, gfid=f3e395b1-c592-4e57-b97e-2800d06e757c, gen=1): destroying request, removing from all queues
[2018-12-29 15:08:44.484123] D [write-behind.c:1764:wb_process_queue] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13f)[0x7fe0a6ae412f] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0x9adc)[0x7fe0a0342adc] (--> /usr/lib64/glusterfs/5.2/xlator/performance/write-behind.so(+0xea07)[0x7fe0a0347a07] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/distribute.so(+0x73e86)[0x7fe0a05c3e86] (--> /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x2721e)[0x7fe0a082d21e] ))))) 0-jf-vol0-write-behind: processing queues
[2018-12-29 15:08:44.985883] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.985944] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.985963] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.985977] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986055] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986075] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986090] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986103] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986168] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986189] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-2' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986204] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-1' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986218] D [MSGID: 101016] [glusterfs3.h:746:dict_to_xdr] 0-dict: key 'trusted.afr.jf-vol0-client-0' would not be sent on wire in the future [Invalid argument]
[2018-12-29 15:08:44.986844] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-0: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.986919] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-1: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.986934] D [MSGID: 0] [client-rpc-fops_v2.c:1619:client4_0_fxattrop_cbk] 0-jf-vol0-client-2: resetting op_ret to 0 from 0
[2018-12-29 15:08:44.987647] D [logging.c:1805:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2018-12-29 15:08:44.987683] D [logging.c:1808:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git:/FILE/glusterfs.git
signal received: 11
time of crash: 
2018-12-29 15:08:44
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.2
/usr/lib64/libglusterfs.so.0(+0x26537)[0x7fe0a6ae1537]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x2fe)[0x7fe0a6aeb24e]
/lib64/libc.so.6(+0x35d10)[0x7fe0a5145d10]
/lib64/libpthread.so.0(pthread_mutex_lock+0x0)[0x7fe0a5936e30]
/usr/lib64/libglusterfs.so.0(__gf_free+0x145)[0x7fe0a6b0c795]
/usr/lib64/libglusterfs.so.0(+0x1a1ee)[0x7fe0a6ad51ee]
/usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x60124)[0x7fe0a0866124]
/usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x39ee1)[0x7fe0a083fee1]
/usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x3d7f1)[0x7fe0a08437f1]
/usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so(+0x3e562)[0x7fe0a0844562]
/usr/lib64/glusterfs/5.2/xlator/protocol/client.so(+0x731d0)[0x7fe0a0b101d0]
/usr/lib64/libgfrpc.so.0(+0xe534)[0x7fe0a68ae534]
/usr/lib64/libgfrpc.so.0(+0xee77)[0x7fe0a68aee77]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fe0a68aaf13]
/usr/lib64/glusterfs/5.2/rpc-transport/socket.so(+0xaa23)[0x7fe0a19c2a23]
/usr/lib64/libglusterfs.so.0(+0x88aeb)[0x7fe0a6b43aeb]
/lib64/libpthread.so.0(+0x7504)[0x7fe0a5934504]
/lib64/libc.so.6(clone+0x3f)[0x7fe0a521c19f]
---------


gdb output:

Core was generated by `/usr/sbin/glusterfs --use-readdirp=off --attribute-timeout=600 --entry-timeout='.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fe0a5936e30 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fe0a5936e30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fe0a6b0c795 in __gf_free () from /usr/lib64/libglusterfs.so.0
#2  0x00007fe0a6ad51ee in ?? () from /usr/lib64/libglusterfs.so.0
#3  0x00007fe0a0866124 in ?? () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#4  0x00007fe0a083fee1 in ?? () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#5  0x00007fe0a08437f1 in ?? () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#6  0x00007fe0a0844562 in ?? () from /usr/lib64/glusterfs/5.2/xlator/cluster/replicate.so
#7  0x00007fe0a0b101d0 in ?? () from /usr/lib64/glusterfs/5.2/xlator/protocol/client.so
#8  0x00007fe0a68ae534 in ?? () from /usr/lib64/libgfrpc.so.0
#9  0x00007fe0a68aee77 in ?? () from /usr/lib64/libgfrpc.so.0
#10 0x00007fe0a68aaf13 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#11 0x00007fe0a19c2a23 in ?? () from /usr/lib64/glusterfs/5.2/rpc-transport/socket.so
#12 0x00007fe0a6b43aeb in ?? () from /usr/lib64/libglusterfs.so.0
#13 0x00007fe0a5934504 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fe0a521c19f in clone () from /lib64/libc.so.6

Comment 1 Rob de Wit 2018-12-31 09:59:38 UTC
To keep things going, I've added a script to detect failure and remount the volume. It seems that when a mount crashes, it also crashes on other clients:

My scripts runs every 5 minutes, and this is what it logs:

Dec 30 16:05:01 HOST-1 logger: Forcing remount of jf-vol0
Dec 31 00:15:06 HOST-2 logger: Forcing remount of jf-vol0
Dec 31 00:20:06 HOST-1 logger: Forcing remount of jf-vol0
Dec 31 00:20:06 HOST-2 logger: Forcing remount of jf-vol0
Dec 31 00:20:07 HOST-4 logger: Forcing remount of jf-vol0
Dec 31 00:25:01 HOST-1 logger: Forcing remount of jf-vol0

So it looks to me like this is caused by something in the volume or cluster-communication. Not just something local to the client.

Comment 2 Nithya Balachandran 2019-01-03 11:12:44 UTC
Can you try installing the debuginfo packages for the gluster version you are running and rerun bt on the core dump?

Comment 3 Rob de Wit 2019-01-03 11:41:23 UTC
(gdb) bt
#0  0x00007fe0a5936e30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fe0a6b0c795 in __gf_free (free_ptr=0x7fe0843ac610) at mem-pool.c:333
#2  0x00007fe0a6ad51ee in dict_destroy (this=0x7fe0843abe78) at dict.c:701
#3  0x00007fe0a6ad5315 in dict_unref (this=<optimized out>) at dict.c:753
#4  0x00007fe0a0866124 in afr_local_cleanup (local=0x7fe0843ade18, this=<optimized out>) at afr-common.c:2091
#5  0x00007fe0a083fee1 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:369
#6  0x00007fe0a08437f1 in afr_unlock_common_cbk (frame=frame@entry=0x7fe0843ac7b8, this=this@entry=0x7fe09c0110c0, op_ret=op_ret@entry=0, xdata=<optimized out>, 
    op_errno=<optimized out>, cookie=<optimized out>) at afr-lk-common.c:243
#7  0x00007fe0a0844562 in afr_unlock_inodelk_cbk (frame=0x7fe0843ac7b8, cookie=<optimized out>, this=0x7fe09c0110c0, op_ret=0, op_errno=<optimized out>, 
    xdata=<optimized out>) at afr-lk-common.c:281
#8  0x00007fe0a0b101d0 in client4_0_finodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=<optimized out>)
    at client-rpc-fops_v2.c:1398
#9  0x00007fe0a68ae534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fe09c053bd0, pollin=pollin@entry=0x7fe09c115750) at rpc-clnt.c:755
#10 0x00007fe0a68aee77 in rpc_clnt_notify (trans=0x7fe09c053e90, mydata=0x7fe09c053c00, event=<optimized out>, data=0x7fe09c115750) at rpc-clnt.c:923
#11 0x00007fe0a68aaf13 in rpc_transport_notify (this=this@entry=0x7fe09c053e90, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fe09c115750)
    at rpc-transport.c:525
#12 0x00007fe0a19c2a23 in socket_event_poll_in (notify_handled=true, this=0x7fe09c053e90) at socket.c:2504
#13 socket_event_handler (fd=-1676585136, idx=1, gen=4, data=0x7fe09c053e90, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#14 0x00007fe0a6b43aeb in event_dispatch_epoll_handler (event=0x7fe0a1531ed0, event_pool=0x17f40b0) at event-epoll.c:591
#15 event_dispatch_epoll_worker (data=0x1830840) at event-epoll.c:668
#16 0x00007fe0a5934504 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fe0a521c19f in clone () from /lib64/libc.so.6

Somehow version in this bug report has been reset to 3.12, but this is actually version 5.2

Comment 4 Nithya Balachandran 2019-01-03 12:16:10 UTC
Assigning this to AFR team.

Comment 5 Ravishankar N 2019-01-03 12:20:11 UTC
Quick question: Is the back trace identical to what is shared in comment #3 for all crashes?

Comment 6 Ravishankar N 2019-01-03 12:23:26 UTC
Also, please attach the core file to the bug.

Comment 7 Rob de Wit 2019-01-03 12:32:51 UTC
Good question! It turns out this is not always the case. I checked some other coredumps:

coredump 1 - same backtrace
coredump 2 - untraceable
coredump 3 - A different backtrace:
(gdb) bt
#0  0x00007f2f2a1d32c0 in ?? () from /lib64/libuuid.so.1
#1  0x00007f2f2a1d24e0 in uuid_compare () from /lib64/libuuid.so.1
#2  0x00007f2f2aa57691 in gf_uuid_compare (u2=0x7f2f115e67f8 "\270x\274\226Z\301F\006\256\221\230\005\031\321N\342\001", 
    u1=0x7f2eff90 <Address 0x7f2eff90 out of bounds>) at compat-uuid.h:25
#3  __inode_find (table=table@entry=0x7f2f20063b80, gfid=gfid@entry=0x7f2f115e67f8 "\270x\274\226Z\301F\006\256\221\230\005\031\321N\342\001") at inode.c:892
#4  0x00007f2f2aa57d79 in inode_find (table=table@entry=0x7f2f20063b80, gfid=gfid@entry=0x7f2f115e67f8 "\270x\274\226Z\301F\006\256\221\230\005\031\321N\342\001")
    at inode.c:917
#5  0x00007f2f24a1ae72 in unserialize_rsp_direntp_v2 (this=0x7f2f2000e980, fd=<optimized out>, rsp=rsp@entry=0x7f2f1e164a70, entries=0x7f2f1e164aa0)
    at client-helpers.c:338
#6  0x00007f2f24a59005 in client_post_readdirp_v2 (this=<optimized out>, rsp=0x7f2f1e164a70, fd=<optimized out>, entries=<optimized out>, xdata=0x7f2f1e164a68)
    at client-common.c:3533
#7  0x00007f2f24a6b226 in client4_0_readdirp_cbk (req=<optimized out>, iov=0x7f2f0b99d508, count=<optimized out>, myframe=0x7f2ef4a691f8) at client-rpc-fops_v2.c:2333
#8  0x00007f2f2a814534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f2f2004f530, pollin=pollin@entry=0x7f2f114a8290) at rpc-clnt.c:755
#9  0x00007f2f2a814e77 in rpc_clnt_notify (trans=0x7f2f2004f860, mydata=0x7f2f2004f560, event=<optimized out>, data=0x7f2f114a8290) at rpc-clnt.c:923
#10 0x00007f2f2a810f13 in rpc_transport_notify (this=this@entry=0x7f2f2004f860, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f2f114a8290)
    at rpc-transport.c:525
#11 0x00007f2f25928a23 in socket_event_poll_in (notify_handled=true, this=0x7f2f2004f860) at socket.c:2504
#12 socket_event_handler (fd=290095760, idx=2, gen=4, data=0x7f2f2004f860, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#13 0x00007f2f2aaa9aeb in event_dispatch_epoll_handler (event=0x7f2f1e164ed0, event_pool=0x7510b0) at event-epoll.c:591
#14 event_dispatch_epoll_worker (data=0x7f2f2004f310) at event-epoll.c:668
#15 0x00007f2f2989a504 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f2f2918219f in clone () from /lib64/libc.so.6

coredump 4 - yet another backtrace:
(gdb) bt
#0  0x00007ff2249a58a4 in _int_free () from /lib64/libc.so.6
#1  0x00007ff2249aac9e in free () from /lib64/libc.so.6
#2  0x00007ff22631d6af in __gf_free (free_ptr=<optimized out>) at mem-pool.c:356
#3  0x00007ff223bf1410 in free_fuse_state (state=0x7ff1f4760430) at fuse-helpers.c:81
#4  0x00007ff223bf70a9 in fuse_err_cbk (frame=0x7ff1f471b1d8, cookie=<optimized out>, this=0x18dddb0, op_ret=0, op_errno=0, xdata=<optimized out>)
    at fuse-bridge.c:1434
#5  0x00007ff21aebc29d in io_stats_flush_cbk (frame=0x7ff206a0b088, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, xdata=0x0) at io-stats.c:2286
#6  0x00007ff226385b29 in default_flush_cbk (frame=0x7ff1f4737f58, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, xdata=0x0) at defaults.c:1159
#7  0x00007ff21b926f77 in ra_flush_cbk (frame=0x7ff1f4737238, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, xdata=0x0) at read-ahead.c:539
#8  0x00007ff21bb390dd in wb_flush_helper (frame=0x7ff2071e7488, this=<optimized out>, fd=<optimized out>, xdata=0x0) at write-behind.c:1987
#9  0x00007ff22631a055 in call_resume_keep_stub (stub=0x7ff1f4744da8) at call-stub.c:2563
#10 0x00007ff21bb3c999 in wb_do_winds (wb_inode=wb_inode@entry=0x7ff1f4742730, tasks=tasks@entry=0x7ff220d42640) at write-behind.c:1737
#11 0x00007ff21bb3ca9c in wb_process_queue (wb_inode=wb_inode@entry=0x7ff1f4742730) at write-behind.c:1778
#12 0x00007ff21bb41a07 in wb_fulfill_cbk (frame=frame@entry=0x7ff21d48e7c8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=123, 
    op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7ff21d4ac610, postbuf=postbuf@entry=0x7ff21d4ac6a8, xdata=xdata@entry=0x7ff21d490168) at write-behind.c:1105
#13 0x00007ff21bdbde86 in dht_writev_cbk (frame=frame@entry=0x7ff21c08d7c8, cookie=<optimized out>, this=<optimized out>, op_ret=123, op_errno=0, 
    prebuf=prebuf@entry=0x7ff21d4ac610, postbuf=postbuf@entry=0x7ff21d4ac6a8, xdata=0x7ff21d490168) at dht-inode-write.c:140
#14 0x00007ff22003e21e in afr_writev_unwind (frame=frame@entry=0x7ff21d4a3ee8, this=this@entry=0x7ff21c0110c0) at afr-inode-write.c:234
#15 0x00007ff22003e7e6 in afr_writev_wind_cbk (this=0x7ff21c0110c0, frame=0x7ff21d49ab08, cookie=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    prebuf=<optimized out>, postbuf=<optimized out>, xdata=<optimized out>) at afr-inode-write.c:388
#16 afr_writev_wind_cbk (frame=0x7ff21d49ab08, cookie=<optimized out>, this=0x7ff21c0110c0, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, 
    postbuf=0x7ff220d42980, xdata=0x7ff21d49ae58) at afr-inode-write.c:354
#17 0x00007ff220313748 in client4_0_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7ff21d483a58) at client-rpc-fops_v2.c:685
#18 0x00007ff2260bf534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7ff21c04f530, pollin=pollin@entry=0x7ff21d49e650) at rpc-clnt.c:755
#19 0x00007ff2260bfe77 in rpc_clnt_notify (trans=0x7ff21c04f860, mydata=0x7ff21c04f560, event=<optimized out>, data=0x7ff21d49e650) at rpc-clnt.c:923
#20 0x00007ff2260bbf13 in rpc_transport_notify (this=this@entry=0x7ff21c04f860, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7ff21d49e650)
    at rpc-transport.c:525
#21 0x00007ff2211d3a23 in socket_event_poll_in (notify_handled=true, this=0x7ff21c04f860) at socket.c:2504
#22 socket_event_handler (fd=491382352, idx=2, gen=4, data=0x7ff21c04f860, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#23 0x00007ff226354aeb in event_dispatch_epoll_handler (event=0x7ff220d42ed0, event_pool=0x18d70b0) at event-epoll.c:591
#24 event_dispatch_epoll_worker (data=0x1913840) at event-epoll.c:668
#25 0x00007ff225145504 in start_thread () from /lib64/libpthread.so.0
#26 0x00007ff224a2d19f in clone () from /lib64/libc.so.6

coredump 5 - Another one:
(gdb) bt
#0  0x00007fad93d3ce30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fad94eea73e in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:273
#2  0x00007fad94eea8df in gf_log_disable_suppression_before_exit (ctx=0x1334010) at logging.c:444
#3  0x00007fad94ef0f94 in gf_print_trace (signum=11, ctx=0x1334010) at common-utils.c:922
#4  <signal handler called>
#5  0x00007fad94f0fd52 in fd_destroy (bound=true, fd=0x7fad64f216c8) at fd.c:478
#6  fd_unref (fd=0x7fad64f216c8) at fd.c:529
#7  0x00007fad8eeba0e8 in client_local_wipe (local=local@entry=0x7fad8a924358) at client-helpers.c:124
#8  0x00007fad8ef161e0 in client4_0_finodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=<optimized out>)
    at client-rpc-fops_v2.c:1398
#9  0x00007fad94cb4534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fad8804f530, pollin=pollin@entry=0x7fad8a917950) at rpc-clnt.c:755
#10 0x00007fad94cb4e77 in rpc_clnt_notify (trans=0x7fad8804f860, mydata=0x7fad8804f560, event=<optimized out>, data=0x7fad8a917950) at rpc-clnt.c:923
#11 0x00007fad94cb0f13 in rpc_transport_notify (this=this@entry=0x7fad8804f860, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fad8a917950)
    at rpc-transport.c:525
#12 0x00007fad8fdc8a23 in socket_event_poll_in (notify_handled=true, this=0x7fad8804f860) at socket.c:2504
#13 socket_event_handler (fd=-1970177712, idx=2, gen=4, data=0x7fad8804f860, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#14 0x00007fad94f49aeb in event_dispatch_epoll_handler (event=0x7fad8f937ed0, event_pool=0x136b0b0) at event-epoll.c:591
#15 event_dispatch_epoll_worker (data=0x13a7840) at event-epoll.c:668
#16 0x00007fad93d3a504 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fad9362219f in clone () from /lib64/libc.so.6

coredump 6 - And another:
(gdb) bt
#0  0x00007f3c6caace30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f3c6dc82795 in __gf_free (free_ptr=0x7f3c39e43cb0) at mem-pool.c:333
#2  0x00007f3c6dc65d90 in __inode_ctx_free (inode=inode@entry=0x7f3c39e435a8) at inode.c:322
#3  0x00007f3c6dc66e12 in __inode_destroy (inode=0x7f3c39e435a8) at inode.c:338
#4  inode_table_prune (table=table@entry=0x7f3c58010950) at inode.c:1535
#5  0x00007f3c6dc671ec in inode_unref (inode=0x7f3c39e435a8) at inode.c:542
#6  0x00007f3c679dbf97 in afr_local_cleanup (local=0x7f3c39e1f3e8, this=<optimized out>) at afr-common.c:1995
#7  0x00007f3c679b5ee1 in afr_transaction_done (frame=<optimized out>, this=<optimized out>) at afr-transaction.c:369
#8  0x00007f3c679b97f1 in afr_unlock_common_cbk (frame=frame@entry=0x7f3c3a11d168, this=this@entry=0x7f3c600110c0, op_ret=op_ret@entry=0, xdata=0x0, 
    op_errno=<optimized out>, cookie=<optimized out>) at afr-lk-common.c:243
#9  0x00007f3c679b98ae in afr_unlock_entrylk_cbk (frame=0x7f3c3a11d168, cookie=<optimized out>, this=0x7f3c600110c0, op_ret=0, op_errno=<optimized out>, 
    xdata=<optimized out>) at afr-lk-common.c:366
#10 0x00007f3c67c857bd in client4_0_entrylk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=<optimized out>) at client-rpc-fops_v2.c:1446
#11 0x00007f3c6da24534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f3c60058e20, pollin=pollin@entry=0x7f3c5a5898e0) at rpc-clnt.c:755
#12 0x00007f3c6da24e77 in rpc_clnt_notify (trans=0x7f3c600590e0, mydata=0x7f3c60058e50, event=<optimized out>, data=0x7f3c5a5898e0) at rpc-clnt.c:923
#13 0x00007f3c6da20f13 in rpc_transport_notify (this=this@entry=0x7f3c600590e0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f3c5a5898e0)
    at rpc-transport.c:525
#14 0x00007f3c68b38a23 in socket_event_poll_in (notify_handled=true, this=0x7f3c600590e0) at socket.c:2504
#15 socket_event_handler (fd=1515755744, idx=4, gen=1, data=0x7f3c600590e0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#16 0x00007f3c6dcb9aeb in event_dispatch_epoll_handler (event=0x7f3c65c04ed0, event_pool=0x81a0b0) at event-epoll.c:591
#17 event_dispatch_epoll_worker (data=0x7f3c60043ad0) at event-epoll.c:668
#18 0x00007f3c6caaa504 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f3c6c39219f in clone () from /lib64/libc.so.6

coredump 7 - ...
(gdb) bt
#0  0x00007f916b526b88 in list_add (head=0x7f91389ba228, new=0x7f91389b9d78) at ../../../../libglusterfs/src/list.h:31
#1  wb_set_invalidate (wb_inode=0x7f91389b9d10, set=<optimized out>) at write-behind.c:246
#2  wb_fulfill_cbk (frame=frame@entry=0x7f91617a2208, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=811, op_errno=op_errno@entry=0, 
    prebuf=prebuf@entry=0x7f91617ade00, postbuf=postbuf@entry=0x7f91617ade98, xdata=xdata@entry=0x7f9160484c38) at write-behind.c:1095
#3  0x00007f916b7a2e86 in dht_writev_cbk (frame=frame@entry=0x7f91617b8838, cookie=<optimized out>, this=<optimized out>, op_ret=811, op_errno=0, 
    prebuf=prebuf@entry=0x7f91617ade00, postbuf=postbuf@entry=0x7f91617ade98, xdata=0x7f9160484c38) at dht-inode-write.c:140
#4  0x00007f916ba0c21e in afr_writev_unwind (frame=frame@entry=0x7f916100d918, this=this@entry=0x7f91640110c0) at afr-inode-write.c:234
#5  0x00007f916ba0c7e6 in afr_writev_wind_cbk (this=0x7f91640110c0, frame=0x7f91604865e8, cookie=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    prebuf=<optimized out>, postbuf=<optimized out>, xdata=<optimized out>) at afr-inode-write.c:388
#6  afr_writev_wind_cbk (frame=0x7f91604865e8, cookie=<optimized out>, this=0x7f91640110c0, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, 
    postbuf=0x7f916946c980, xdata=0x7f91614414b8) at afr-inode-write.c:354
#7  0x00007f916bce1748 in client4_0_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f915cfa2f98) at client-rpc-fops_v2.c:685
#8  0x00007f9171a8d534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f9164050110, pollin=pollin@entry=0x7f9160481290) at rpc-clnt.c:755
#9  0x00007f9171a8de77 in rpc_clnt_notify (trans=0x7f91640503d0, mydata=0x7f9164050140, event=<optimized out>, data=0x7f9160481290) at rpc-clnt.c:923
#10 0x00007f9171a89f13 in rpc_transport_notify (this=this@entry=0x7f91640503d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f9160481290)
    at rpc-transport.c:525
#11 0x00007f916cba1a23 in socket_event_poll_in (notify_handled=true, this=0x7f91640503d0) at socket.c:2504
#12 socket_event_handler (fd=1615336080, idx=2, gen=4, data=0x7f91640503d0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#13 0x00007f9171d22aeb in event_dispatch_epoll_handler (event=0x7f916946ced0, event_pool=0x24db0b0) at event-epoll.c:591
#14 event_dispatch_epoll_worker (data=0x7f9164048dc0) at event-epoll.c:668
#15 0x00007f9170b13504 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f91703fb19f in clone () from /lib64/libc.so.6

coredump 8 - (gdb) bt
#0  0x00007f24560fbe30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f24572a973e in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:273
#2  0x00007f24572a98df in gf_log_disable_suppression_before_exit (ctx=0x840010) at logging.c:444
#3  0x00007f24572aff94 in gf_print_trace (signum=11, ctx=0x840010) at common-utils.c:922
#4  <signal handler called>
#5  0x00007f24572ced52 in fd_destroy (bound=true, fd=0x7f24380d3f98) at fd.c:478
#6  fd_unref (fd=0x7f24380d3f98) at fd.c:529
#7  0x00007f24512790e8 in client_local_wipe (local=local@entry=0x7f243c0ad548) at client-helpers.c:124
#8  0x00007f24512d51e0 in client4_0_finodelk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=<optimized out>)
    at client-rpc-fops_v2.c:1398
#9  0x00007f2457073534 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f244c050110, pollin=pollin@entry=0x7f2444165d30) at rpc-clnt.c:755
#10 0x00007f2457073e77 in rpc_clnt_notify (trans=0x7f244c0503d0, mydata=0x7f244c050140, event=<optimized out>, data=0x7f2444165d30) at rpc-clnt.c:923
#11 0x00007f245706ff13 in rpc_transport_notify (this=this@entry=0x7f244c0503d0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f2444165d30)
    at rpc-transport.c:525
#12 0x00007f2452187a23 in socket_event_poll_in (notify_handled=true, this=0x7f244c0503d0) at socket.c:2504
#13 socket_event_handler (fd=1142316336, idx=2, gen=4, data=0x7f244c0503d0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0) at socket.c:2905
#14 0x00007f2457308aeb in event_dispatch_epoll_handler (event=0x7f244b1b7ed0, event_pool=0x8770b0) at event-epoll.c:591
#15 event_dispatch_epoll_worker (data=0x7f244c043ad0) at event-epoll.c:668
#16 0x00007f24560f9504 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f24559e119f in clone () from /lib64/libc.so.6



If you really need the info - I have some 20 more coredumps, I suspect they all have different traces...

Comment 8 Rob de Wit 2019-01-03 12:41:21 UTC
Original core file of the 1st backtrace: https://www.dropbox.com/s/a8feic6hvho413o/core?dl=0

Comment 9 Amar Tumballi 2019-06-17 11:30:56 UTC
Hi robdewit,

We did fix some of the crashes by glusterfs-5.5 (from 5.3->5.5), please upgrade and let us know.

Comment 10 Rob de Wit 2019-06-17 11:37:19 UTC
Hi Amar,

We've been running the 6.1 release for some time now and there have been no crashes since then.

Thanks!