Bug 1767264 - glusterfs client process coredump
Summary: glusterfs client process coredump
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On: 1785208
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-31 02:37 UTC by zhou lin
Modified: 2020-01-10 05:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1785208 (view as bug list)
Environment:
Last Closed: 2020-01-10 05:37:21 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
glusterfs process trace level log (14.40 MB, application/zip)
2019-10-31 02:37 UTC, zhou lin
no flags Details
glusterfs 7 client crash logs (7.90 KB, text/plain)
2019-12-16 16:12 UTC, Arie Skliarouk
no flags Details


Links
System ID Priority Status Summary Last Updated
Gluster.org Gerrit 23983 None Merged timer: fix event destruction race 2020-01-10 05:37:20 UTC

Description zhou lin 2019-10-31 02:37:58 UTC
Created attachment 1630869 [details]
glusterfs process trace level log

Description of problem:

glusterfs client process coredump
Version-Release number of selected component (if applicable):

glusterfs 7
How reproducible:


Steps to Reproduce:
1.user begin to do io by ior tool
2.trigger statedump for glusterfs process each 60 seconds
3.glusterfs process coredump

Actual results:


Expected results:


Additional info:
1>gdb info

[New LWP 6471]
[New LWP 6472]
[New LWP 6464]
[New LWP 6494]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --acl --process-name fuse --volfile-server=mn-0.local --vol'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f71904b157f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f718efd1700 (LWP 6465))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-33.wf30.x86_64 libgcc-8.3.1-2.wf30.x86_64 libtirpc-1.1.4-0.wf30.x86_64 libuuid-2.33.2-2.wf30.x86_64 zlib-1.2.11-15.wf30.x86_64
(gdb) bt
#0  0x00007f71904b157f in raise () from /lib64/libc.so.6
#1  0x00007f719049b895 in abort () from /lib64/libc.so.6
#2  0x00007f71904f49d7 in __libc_message () from /lib64/libc.so.6
#3  0x00007f71904f4a9a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007f719064f98d in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#5  0x00007f7190729911 in gf_timer_proc (data=0x55aa65bc1f20) at timer.c:140
#6  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#7  0x00007f7190576703 in clone () from /lib64/libc.so.6
(gdb) apply all bt
Undefined command: "apply".  Try "help".
(gdb) thread apply all bt

Thread 12 (Thread 0x7f718a894700 (LWP 6494)):
#0  0x00007f719056d15f in readv () from /lib64/libc.so.6
#1  0x00007f7190746759 in sys_readv (fd=<optimized out>, iov=iov@entry=0x7f718a893150, iovcnt=iovcnt@entry=2) at syscall.c:328
#2  0x00007f718eff4a93 in fuse_thread_proc (data=0x55aa65bb0eb0) at fuse-bridge.c:5957
#3  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f718ffed480 (LWP 6464)):
#0  0x00007f719064aa6d in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x00007f719077b4c7 in event_dispatch_epoll (event_pool=0x55aa65ba6760) at event-epoll.c:840
#2  0x000055aa656fc5d0 in main (argc=<optimized out>, argv=<optimized out>) at glusterfsd.c:2919

Thread 10 (Thread 0x7f7183fff700 (LWP 6472)):
#0  0x00007f7190652b4c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f719064bda8 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f719071d6df in gf_log_glusterlog (ctx=0x55aa65b70260, domain=domain@entry=0x7f718b63800f "stack-trace", file=file@entry=0x7f718b63a027 "afr-transaction.c", function=function@entry=0x7f718b63a820 <__FUNCTION__.21496> "afr_changelog_do", line=line@entry=1959, level=level@entry=GF_LOG_TRACE, errnum=0, msgid=0, appmsgstr=0x7f7183ffd698, callstr=0x0, tv=..., graph_id=0, fmt=gf_logformat_withmsgid) at logging.c:1364
#3  0x00007f719071dadd in gf_log_print_plain_fmt (ctx=ctx@entry=0x55aa65b70260, domain=domain@entry=0x7f718b63800f "stack-trace", file=file@entry=0x7f718b63a027 "afr-transaction.c", function=function@entry=0x7f718b63a820 <__FUNCTION__.21496> "afr_changelog_do", line=line@entry=1959, level=level@entry=GF_LOG_TRACE, errnum=<optimized out>, msgid=<optimized out>, appmsgstr=<optimized out>, callstr=<optimized out>, tv=..., graph_id=<optimized out>, fmt=<optimized out>) at logging.c:1586
#4  0x00007f719071c23a in _gf_msg_internal (graph_id=0, callstr=<optimized out>, appmsgstr=0x7f7183ffd698, msgid=0, errnum=0, level=GF_LOG_TRACE, line=1959, function=0x7f718b63a820 <__FUNCTION__.21496> "afr_changelog_do", file=0x7f718b63a027 "afr-transaction.c", domain=0x7f718b63800f "stack-trace") at logging.c:1926
#5  _gf_msg (domain=domain@entry=0x7f718b63800f "stack-trace", file=file@entry=0x7f718b63a027 "afr-transaction.c", function=function@entry=0x7f718b63a820 <__FUNCTION__.21496> "afr_changelog_do", line=line@entry=1959, level=level@entry=GF_LOG_TRACE, errnum=errnum@entry=0, trace=0, msgid=0, fmt=0x7f718b6380c8 "stack-address: %p, winding from %s to %s") at logging.c:2004
#6  0x00007f718b5fb109 in afr_changelog_do (frame=frame@entry=0x7f7184061498, this=this@entry=0x7f717c0115a0, xattr=xattr@entry=0x7f717c0c2a78, changelog_resume=changelog_resume@entry=0x7f718b5f9090 <afr_changelog_post_op_done>, op=op@entry=AFR_TRANSACTION_POST_OP) at afr-transaction.c:1956
#7  0x00007f718b5fca44 in afr_changelog_post_op_do (frame=0x7f7184061498, this=0x7f717c0115a0) at afr-transaction.c:1562
#8  0x00007f718b5fd3ff in afr_changelog_post_op_now (frame=0x7f7184061498, this=0x7f717c0115a0) at afr-transaction.c:1593
#9  0x00007f718b5fd580 in afr_delayed_changelog_wake_up_cbk (data=data@entry=0x7f716c008748) at afr-transaction.c:2502
#10 0x00007f718b621645 in afr_delayed_changelog_wake_resume (this=this@entry=0x7f717c0115a0, inode=0x7f717c0bb8e8, stub=0x7f717c0c46f8) at afr-common.c:3673
#11 0x00007f718b626a9c in afr_flush (frame=frame@entry=0x7f717c0d3948, this=this@entry=0x7f717c0115a0, fd=fd@entry=0x7f7174013e58, xdata=xdata@entry=0x0) at afr-common.c:3701
#12 0x00007f71907a7de6 in default_flush (frame=frame@entry=0x7f717c0d3948, this=this@entry=0x7f717c013cd0, fd=fd@entry=0x7f7174013e58, xdata=xdata@entry=0x0) at defaults.c:2531
#13 0x00007f71907a7de6 in default_flush (frame=frame@entry=0x7f717c0d3948, this=<optimized out>, fd=fd@entry=0x7f7174013e58, xdata=xdata@entry=0x0) at defaults.c:2531
#14 0x00007f718b4efbea in wb_flush_helper (frame=0x7f71740418b8, this=0x7f717c017730, fd=0x7f7174013e58, xdata=0x0) at write-behind.c:1996
#15 0x00007f7190741435 in call_resume_keep_stub (stub=0x7f7174010fb8) at call-stub.c:2621
#16 0x00007f718b4f3845 in wb_do_winds (wb_inode=wb_inode@entry=0x7f716c0032e0, tasks=tasks@entry=0x7f7183ffdae0) at write-behind.c:1744
#17 0x00007f718b4f397d in wb_process_queue (wb_inode=wb_inode@entry=0x7f716c0032e0) at write-behind.c:1781
#18 0x00007f718b4f3b14 in wb_fulfill_cbk (frame=frame@entry=0x7f717c0bf358, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=1000, op_errno=op_errno@entry=0, prebuf=prebuf@entry=0x7f716c015118--Type <RET> for more, q to quit, c to continue without paging--
, postbuf=<optimized out>, xdata=<optimized out>) at write-behind.c:1108
#19 0x00007f718bf9bb92 in gf_utime_writev_cbk (frame=0x7f717c08ca08, cookie=<optimized out>, this=<optimized out>, op_ret=1000, op_errno=0, prebuf=0x7f716c015118, postbuf=0x7f716c0151b8, xdata=0x7f718404fe98) at utime-autogen-fops.c:63
#20 0x00007f718b5ea41d in afr_writev_unwind (frame=frame@entry=0x7f716c0185b8, this=this@entry=0x7f717c0115a0) at afr-inode-write.c:228
#21 0x00007f718b5ea9d4 in afr_writev_wind_cbk (cookie=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7f7183ffdd80, xdata=0x7f717c0d44d8, this=0x7f717c0115a0, frame=0x7f7184061498) at afr-inode-write.c:382
#22 afr_writev_wind_cbk (frame=0x7f7184061498, cookie=<optimized out>, this=0x7f717c0115a0, op_ret=<optimized out>, op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7f7183ffdd80, xdata=0x7f717c0d44d8) at afr-inode-write.c:348
#23 0x00007f718b6af308 in client4_0_writev_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f718406a7d8) at client-rpc-fops_v2.c:684
#24 0x00007f71906c1e9e in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f717c05a460, pollin=pollin@entry=0x7f717c062ab0) at rpc-clnt.c:770
#25 0x00007f71906c2283 in rpc_clnt_notify (trans=0x7f717c05a720, mydata=0x7f717c05a490, event=<optimized out>, data=0x7f717c062ab0) at rpc-clnt.c:938
#26 0x00007f71906bee23 in rpc_transport_notify (this=this@entry=0x7f717c05a720, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f717c062ab0) at rpc-transport.c:545
#27 0x00007f718bfaee38 in socket_event_poll_in_async (xl=<optimized out>, async=0x7f717c062bd8) at socket.c:2601
#28 0x00007f718bfb7b1c in gf_async (cbk=0x7f718bfaee10 <socket_event_poll_in_async>, xl=<optimized out>, async=<optimized out>) at ../../../../libglusterfs/src/glusterfs/async.h:189
#29 socket_event_poll_in (notify_handled=true, this=0x7f717c05a720) at socket.c:2642
#30 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x7f717c05a720, gen=4, idx=3, fd=2080746960) at socket.c:3033
#31 socket_event_handler (fd=fd@entry=7, idx=idx@entry=3, gen=gen@entry=4, data=data@entry=0x7f717c05a720, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2960
#32 0x00007f719077be0b in event_dispatch_epoll_handler (event=0x7f7183ffe154, event_pool=0x55aa65ba6760) at event-epoll.c:642
#33 event_dispatch_epoll_worker (data=0x55aa65bf02d0) at event-epoll.c:755
#34 0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#35 0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f718beed700 (LWP 6471)):
#0  0x00007f7190652b4c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f719064bda8 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f719071d6df in gf_log_glusterlog (ctx=0x55aa65b70260, domain=domain@entry=0x7f718b6cc03f "stack-trace", file=file@entry=0x7f718b6d10b0 "client-rpc-fops_v2.c", function=function@entry=0x7f718b6d1b80 <__FUNCTION__.23859> "client4_0_create_cbk", line=line@entry=2119, level=level@entry=GF_LOG_TRACE, errnum=0, msgid=0, appmsgstr=0x7f718beeba08, callstr=0x0, tv=..., graph_id=0, fmt=gf_logformat_withmsgid) at logging.c:1364
#3  0x00007f719071dadd in gf_log_print_plain_fmt (ctx=ctx@entry=0x55aa65b70260, domain=domain@entry=0x7f718b6cc03f "stack-trace", file=file@entry=0x7f718b6d10b0 "client-rpc-fops_v2.c", function=function@entry=0x7f718b6d1b80 <__FUNCTION__.23859> "client4_0_create_cbk", line=line@entry=2119, level=level@entry=GF_LOG_TRACE, errnum=<optimized out>, msgid=<optimized out>, appmsgstr=<optimized out>, callstr=<optimized out>, tv=..., graph_id=<optimized out>, fmt=<optimized out>) at logging.c:1586
#4  0x00007f719071c23a in _gf_msg_internal (graph_id=0, callstr=<optimized out>, appmsgstr=0x7f718beeba08, msgid=0, errnum=0, level=GF_LOG_TRACE, line=2119, function=0x7f718b6d1b80 <__FUNCTION__.23859> "client4_0_create_cbk", file=0x7f718b6d10b0 "client-rpc-fops_v2.c", domain=0x7f718b6cc03f "stack-trace") at logging.c:1926
#5  _gf_msg (domain=domain@entry=0x7f718b6cc03f "stack-trace", file=file@entry=0x7f718b6d10b0 "client-rpc-fops_v2.c", function=function@entry=0x7f718b6d1b80 <__FUNCTION__.23859> "client4_0_create_cbk", line=line@entry=2119, level=level@entry=GF_LOG_TRACE, errnum=errnum@entry=0, trace=0, msgid=0, fmt=0x7f718b6cc8e8 "stack-address: %p, %s returned %d") at logging.c:2004
#6  0x00007f718b6b110a in client4_0_create_cbk (req=0x7f7184021a38, iov=<optimized out>, count=<optimized out>, myframe=0x7f718403fb18) at client-rpc-fops_v2.c:2117
#7  0x00007f71906c1e9e in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f717c05dd60, pollin=pollin@entry=0x7f718401e760) at rpc-clnt.c:770
#8  0x00007f71906c2283 in rpc_clnt_notify (trans=0x7f717c05e020, mydata=0x7f717c05dd90, event=<optimized out>, data=0x7f718401e760) at rpc-clnt.c:938
#9  0x00007f71906bee23 in rpc_transport_notify (this=this@entry=0x7f717c05e020, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f718401e760) at rpc-transport.c:545
--Type <RET> for more, q to quit, c to continue without paging--
#10 0x00007f718bfaee38 in socket_event_poll_in_async (xl=<optimized out>, async=0x7f718401e888) at socket.c:2601
#11 0x00007f718bfb7b1c in gf_async (cbk=0x7f718bfaee10 <socket_event_poll_in_async>, xl=<optimized out>, async=<optimized out>) at ../../../../libglusterfs/src/glusterfs/async.h:189
#12 socket_event_poll_in (notify_handled=true, this=0x7f717c05e020) at socket.c:2642
#13 socket_event_handler (event_thread_died=0 '\000', poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x7f717c05e020, gen=1, idx=4, fd=2080761552) at socket.c:3033
#14 socket_event_handler (fd=fd@entry=10, idx=idx@entry=4, gen=gen@entry=1, data=data@entry=0x7f717c05e020, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2960
#15 0x00007f719077be0b in event_dispatch_epoll_handler (event=0x7f718beec154, event_pool=0x55aa65ba6760) at event-epoll.c:642
#16 event_dispatch_epoll_worker (data=0x55aa65bf0130) at event-epoll.c:755
#17 0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#18 0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f718a093700 (LWP 6495)):
#0  0x00007f719064f6ec in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f718efdbf73 in timed_response_loop (data=<optimized out>) at fuse-bridge.c:4913
#2  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f718d7ce700 (LWP 6468)):
#0  0x00007f719064fa3b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7190759006 in syncenv_task (proc=proc@entry=0x55aa65bc2440) at syncop.c:517
#2  0x00007f7190759c50 in syncenv_processor (thdata=0x55aa65bc2440) at syncop.c:584
#3  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f718c7cc700 (LWP 6470)):
#0  0x00007f719056dd5f in select () from /lib64/libc.so.6
#1  0x00007f71907937cd in runner (arg=0x55aa65bc65f0) at ../../contrib/timer-wheel/timer-wheel.c:186
#2  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f718cfcd700 (LWP 6469)):
#0  0x00007f719064fa3b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7190759006 in syncenv_task (proc=proc@entry=0x55aa65bc2820) at syncop.c:517
#2  0x00007f7190759c50 in syncenv_processor (thdata=0x55aa65bc2820) at syncop.c:584
#3  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f718dfcf700 (LWP 6467)):
#0  0x00007f7190542878 in nanosleep () from /lib64/libc.so.6
#1  0x00007f719054277e in sleep () from /lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#2  0x00007f71907449ad in pool_sweeper (arg=<optimized out>) at mem-pool.c:446
#3  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f7189892700 (LWP 6497)):
#0  0x00007f719064f6ec in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f718efdb15b in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:4828
#2  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f718e7d0700 (LWP 6466)):
#0  0x00007f71904b234c in sigtimedwait () from /lib64/libc.so.6
#1  0x00007f7190653bbc in sigwait () from /lib64/libpthread.so.0
#2  0x000055aa656fcbe3 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2416
#3  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7190576703 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f718efd1700 (LWP 6465)):
#0  0x00007f71904b157f in raise () from /lib64/libc.so.6
#1  0x00007f719049b895 in abort () from /lib64/libc.so.6
#2  0x00007f71904f49d7 in __libc_message () from /lib64/libc.so.6
#3  0x00007f71904f4a9a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007f719064f98d in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#5  0x00007f7190729911 in gf_timer_proc (data=0x55aa65bc1f20) at timer.c:140
#6  0x00007f719064958b in start_thread () from /lib64/libpthread.so.0
#7  0x00007f7190576703 in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 






2> trace log of glusterfs process
[2019-10-31 01:56:19.727981] T [socket.c:2993:socket_event_handler] 0-ccs-client-1: client (sock:7) in:1, out:0, err:0
[2019-10-31 01:56:19.727992] T [socket.c:3019:socket_event_handler] 0-ccs-client-1: Client socket (7) is already connected
[2019-10-31 01:56:19.728000] T [socket.c:574:__socket_ssl_readv] 0-ccs-client-1: ***** reading over non-SSL
[2019-10-31 01:56:19.728010] T [socket.c:574:__socket_ssl_readv] 0-ccs-client-1: ***** reading over non-SSL
[2019-10-31 01:56:19.728037] T [rpc-clnt.c:663:rpc_clnt_reply_init] 0-ccs-client-1: received rpc message (RPC XID: 0x1888d Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 13) from rpc-transport (ccs-client-1)
[2019-10-31 01:56:19.728058] T [MSGID: 0] [client-rpc-fops_v2.c:686:client4_0_writev_cbk] 0-stack-trace: stack-address: 0x7f716c011d78, ccs-client-1 returned 1000
[2019-10-31 01:56:19.728091] T [MSGID: 0] [afr-inode-write.c:230:afr_writev_unwind] 0-stack-trace: stack-address: 0x7f716c030bd8, ccs-replicate-0 returned 1000
[2019-10-31 01:56:19.728104] T [MSGID: 0] [utime-autogen-fops.c:63:gf_utime_writev_cbk] 0-stack-trace: stack-address: 0x7f716c030bd8, ccs-utime returned 1000
[2019-10-31 01:56:19.728204] D [write-behind.c:754:__wb_fulfill_request] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x12f)[0x7f719071cccf] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0x87cb)[0x7f718b4f17cb] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0x8b50)[0x7f718b4f1b50] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0xaca8)[0x7f718b4f3ca8] (--> /usr/lib64/glusterfs/7.0/xlator/features/utime.so(+0x3b92)[0x7f718bf9bb92] ))))) 0-ccs-write-behind: (unique=88403, fop=WRITE, gfid=dff7c9b9-460e-48a0-b814-90f1344e5383, gen=0): request fulfilled. removing the request from liability queue? = yes
[2019-10-31 01:56:19.728338] D [write-behind.c:419:__wb_request_unref] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x12f)[0x7f719071cccf] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0x367a)[0x7f718b4ec67a] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0x8816)[0x7f718b4f1816] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0x8b50)[0x7f718b4f1b50] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0xaca8)[0x7f718b4f3ca8] ))))) 0-ccs-write-behind: (unique = 88403, fop=WRITE, gfid=dff7c9b9-460e-48a0-b814-90f1344e5383, gen=0): destroying request, removing from all queues
[2019-10-31 01:56:19.728445] D [write-behind.c:1765:wb_process_queue] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x12f)[0x7f719071cccf] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0xa9cc)[0x7f718b4f39cc] (--> /usr/lib64/glusterfs/7.0/xlator/performance/write-behind.so(+0xab14)[0x7f718b4f3b14] (--> /usr/lib64/glusterfs/7.0/xlator/features/utime.so(+0x3b92)[0x7f718bf9bb92] (--> /usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x2741d)[0x7f718b5ea41d] ))))) 0-ccs-write-behind: processing queues
[2019-10-31 01:56:19.728528] D [MSGID: 0] [write-behind.c:1717:__wb_pick_winds] 0-ccs-write-behind: (unique=88405, fop=FLUSH, gfid=dff7c9b9-460e-48a0-b814-90f1344e5383, gen=1): picking the request for winding
[2019-10-31 01:56:19.728547] T [MSGID: 0] [write-behind.c:1997:wb_flush_helper] 0-stack-trace: stack-address: 0x7f717c0617b8, winding from ccs-write-behind to ccs-utime
[2019-10-31 01:56:19.728560] T [MSGID: 0] [defaults.c:2533:default_flush] 0-stack-trace: stack-address: 0x7f717c0617b8, winding from ccs-utime to ccs-dht
[2019-10-31 01:56:19.728572] T [MSGID: 0] [defaults.c:2533:default_flush] 0-stack-trace: stack-address: 0x7f717c0617b8, winding from ccs-dht to ccs-replicate-0
[2019-10-31 01:56:19.728604] T [MSGID: 0] [afr-transaction.c:1959:afr_changelog_do] 0-stack-trace: stack-address: 0x7f716c011d78, winding from ccs-replicate-0 to ccs-client-0
[2019-10-31 01:56:19.728620] D [MSGID: 101016] [glusterfs3.h:781:dict_to_xdr] 0-dict: key 'trusted.afr.dirty' would not be sent on wire in the future [Invalid argument]
[2019-10-31 01:56:19.728635] D [MSGID: 101016] [glusterfs3.h:781:dict_to_xdr] 0-dict: key 'trusted.afr.ccs-client-2' would not be sent on wire in the future [Invalid argument]
[2019-10-31 01:56:19.728667] D [logging.c:1690:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2019-10-31 01:56:19.728661] D [MSGID: 101016] [glusterfs3.h:781:dict_to_xdr] 0-dict: key 'trusted.afr.ccs-client-1' would not be sent on wire in the future [Invalid argument]
[2019-10-31 01:56:19.728742] D [MSGID: 101016] [glusterfs3.h:781:dict_to_xdr] 0-dict: key 'trusted.afr.ccs-client-0' would not be sent on wire in the future [Invalid argument]
[2019-10-31 01:56:19.728766] T [rpc-clnt.c:1451:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 336, payload: 252, rpc hdr: 84
[2019-10-31 01:56:19.728487] T [rpc-clnt.c:1738:rpc_clnt_submit] 0-rpc-clnt: submitted request (unique: 88406, XID: 0x1891b, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 23) to rpc-transport (ccs-client-0)
[2019-10-31 01:56:19.728788] T [MSGID: 0] [afr-dir-write.c:422:afr_create_wind] 0-stack-trace: stack-address: 0x7f717803afb8, winding from ccs-replicate-0 to ccs-client-1
[2019-10-31 01:56:19.728807] T [rpc-clnt.c:1451:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 180, payload: 92, rpc hdr: 88
[2019-10-31 01:56:19.728821] D [logging.c:1693:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
pending frames:
frame : type(1) op(CREATE)
frame : type(1) op(CREATE)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
[2019-10-31 01:56:19.728833] T [rpc-clnt.c:1738:rpc_clnt_submit] 0-rpc-clnt: submitted request (unique: 88406, XID: 0x1888e, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 23) to rpc-transport (ccs-client-1)
[2019-10-31 01:56:19.729249] T [MSGID: 0] [afr-dir-write.c:422:afr_create_wind] 0-stack-trace: stack-address: 0x7f717803afb8, winding from ccs-replicate-0 to ccs-client-2
[2019-10-31 01:56:19.729273] T [rpc-clnt.c:1451:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 180, payload: 92, rpc hdr: 88
[2019-10-31 01:56:19.729299] T [rpc-clnt.c:1738:rpc_clnt_submit] 0-rpc-clnt: submitted request (unique: 88406, XID: 0x188cd, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 23) to rpc-transport (ccs-client-2)
[2019-10-31 01:56:19.729315] T [socket.c:3037:socket_event_handler] 0-ccs-client-1: (sock:7) socket_event_poll_in returned 0
[2019-10-31 01:56:19.729326] T [socket.c:2993:socket_event_handler] 0-ccs-client-0: client (sock:10) in:1, out:0, err:0
[2019-10-31 01:56:19.729334] T [socket.c:3019:socket_event_handler] 0-ccs-client-0: Client socket (10) is already connected
[2019-10-31 01:56:19.729341] T [socket.c:574:__socket_ssl_readv] 0-ccs-client-0: ***** reading over non-SSL
[2019-10-31 01:56:19.729405] T [socket.c:574:__socket_ssl_readv] 0-ccs-client-0: ***** reading over non-SSL
[2019-10-31 01:56:19.728901] T [rpc-clnt.c:1738:rpc_clnt_submit] 0-rpc-clnt: submitted request (unique: 88404, XID: 0x1891c, Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 34) to rpc-transport (ccs-client-0)
time of crash:
2019-10-31 01:56:19
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 7.0
[2019-10-31 01:56:19.729478] T [rpc-clnt.c:663:rpc_clnt_reply_init] 0-ccs-client-0: received rpc message (RPC XID: 0x1891b Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 23) from rpc-transport (ccs-client-0)
/lib64/libglusterfs.so.0(+0x27c80)[0x7f719071ac80]
/lib64/libglusterfs.so.0(gf_print_trace+0x323)[0x7f7190725483]
/lib64/libc.so.6(+0x38600)[0x7f71904b1600]
/lib64/libc.so.6(gsignal+0x10f)[0x7f71904b157f]
/lib64/libc.so.6(abort+0x127)[0x7f719049b895]
/lib64/libc.so.6(+0x7b9d7)[0x7f71904f49d7]
/lib64/libc.so.6(__libc_fatal+0x2a)[0x7f71904f4a9a]
/lib64/libpthread.so.0(pthread_cond_timedwait+0x1ad)[0x7f719064f98d]
/lib64/libglusterfs.so.0(+0x36911)[0x7f7190729911]
/lib64/libpthread.so.0(+0x858b)[0x7f719064958b]
/lib64/libc.so.6(clone+0x43)[0x7f7190576703]

Comment 1 zhou lin 2019-11-01 06:40:16 UTC
i think this problem is probably introduced by the following patch, after i revert this patch the problem is gone:
SHA-1: 7cc20c1eb09c041b331c4add6c24212fbdcda3b4

* core: improve timer accuracy

Also fixed some issues on test ec-1468261.t.

Change-Id: If156f86af986d9eed13cdd1f15c5a7214cd11706
Updates: bz#1193929
Signed-off-by: Xavier Hernandez <jahernan@redhat.com>

Comment 2 Nithya Balachandran 2019-11-04 03:32:09 UTC
Hi Xavi,

Can you please take a look at this?

Regards,
Nithya

Comment 3 Xavi Hernandez 2019-11-20 12:39:17 UTC
Sorry for the delay.

Could you install debug info for glibc and analyze the core again ? it seems that it failed because something was in bad state (it's an abort, not a segmentation fault).

With the debug info we'll see what is exactly failing.

Thanks

Comment 4 zhou lin 2019-12-09 09:21:38 UTC
hi, sorry for late response,
what info do you need?
following is the gdb info


Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --acl --process-name fuse --volfile-server=mn-0.local --vol'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffff7c5e57f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7ffff677e700 (LWP 1919))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-40.wf30.x86_64 libgcc-8.3.1-2.wf30.x86_64 libtirpc-1.1.4-0.wf30.x86_64 libuuid-2.33.2-2.wos1.wf30.x86_64 nss-mdns-0.14.1-3.wos1.wf30.x86_64 openssl-libs-1.1.1d-1.wf30.x86_64 zlib-1.2.11-15.wf30.x86_64
(gdb) bt
#0  0x00007ffff7c5e57f in raise () from /lib64/libc.so.6
#1  0x00007ffff7c48895 in abort () from /lib64/libc.so.6
#2  0x00007ffff7ca19d7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff7ca1a9a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007ffff7dfc8ad in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#5  0x00007ffff7ed5911 in gf_timer_proc (data=0x5555555e7f70) at timer.c:140
#6  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6
(gdb) frame 5
(gdb) print *(gf_timer_registry_t *)data
$2 = {active = {next = 0x7fffcc000b60, prev = 0x7fffec026ea0}, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}, cond = {
    __data = {{__wseq = 3919, __wseq32 = {__low = 3919, __high = 0}}, {__g1_start = 3915, __g1_start32 = {__low = 3915, __high = 0}}, __g_refs = {0, 2}, 
      __g_size = {0, 0}, __g1_orig_size = 4, __wrefs = 10, __g_signals = {0, 0}}, 
    __size = "O\017\000\000\000\000\000\000K\017\000\000\000\000\000\000\000\000\000\000\002", '\000' <repeats 11 times>, "\004\000\000\000\n\000\000\000\000\000\000\000\000\000\000", __align = 3919}, th = 140737328441088, fin = 0 '\000'}
(gdb)


(gdb) thread apply all bt

Thread 12 (Thread 0x7ffff3680700 (LWP 1929)):
#0  0x00007ffff7d24717 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ffff7f27d0c in event_dispatch_epoll_worker (data=0x555555616950) at event-epoll.c:745
#2  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7ffff4f7b700 (LWP 1923)):
#0  0x00007ffff7dfc95b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7f05006 in syncenv_task (proc=proc@entry=0x5555555e8490) at syncop.c:517
#2  0x00007ffff7f05c50 in syncenv_processor (thdata=0x5555555e8490) at syncop.c:584
#3  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7ffff577c700 (LWP 1922)):
#0  0x00007ffff7cf0718 in nanosleep () from /lib64/libc.so.6
#1  0x00007ffff7cf061e in sleep () from /lib64/libc.so.6
#2  0x00007ffff7ef09ad in pool_sweeper (arg=<optimized out>) at mem-pool.c:446
#3  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7ffff2e7f700 (LWP 1930)):
#0  0x00007ffff7d24717 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ffff7f27d0c in event_dispatch_epoll_worker (data=0x555555633330) at event-epoll.c:745
#2  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7ffff1019700 (LWP 1970)):
#0  0x00007ffff7dfc60c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff6788f73 in timed_response_loop (data=<optimized out>) at fuse-bridge.c:4913
#2  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7ffff181a700 (LWP 1969)):
#0  0x00007ffff7d1adaf in readv () from /lib64/libc.so.6
#1  0x00007ffff7ef2759 in sys_readv (fd=<optimized out>, iov=iov@entry=0x7ffff1819150, iovcnt=iovcnt@entry=2) at syscall.c:328
#2  0x00007ffff67a1a93 in fuse_thread_proc (data=0x5555555d6ef0) at fuse-bridge.c:5957
#3  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7ffff477a700 (LWP 1924)):
#0  0x00007ffff7dfc95b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7f05006 in syncenv_task (proc=proc@entry=0x5555555e8870) at syncop.c:517
#2  0x00007ffff7f05c50 in syncenv_processor (thdata=0x5555555e8870) at syncop.c:584
#3  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7ffff5f7d700 (LWP 1921)):
#0  0x00007ffff7c5f34c in sigtimedwait () from /lib64/libc.so.6
#1  0x00007ffff7e009bc in sigwait () from /lib64/libpthread.so.0
#2  0x000055555555cbe3 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2416
#3  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7ffff0818700 (LWP 1971)):
#0  0x00007ffff7dfc60c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff678815b in notify_kernel_loop (data=<optimized out>) at fuse-bridge.c:4828
#2  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7ffff779a480 (LWP 1915)):
#0  0x00007ffff7df798d in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007ffff7f274c7 in event_dispatch_epoll (event_pool=0x5555555cc760) at event-epoll.c:840
#2  0x000055555555c5d0 in main (argc=<optimized out>, argv=<optimized out>) at glusterfsd.c:2919

Thread 2 (Thread 0x7ffff3f79700 (LWP 1925)):
#0  0x00007ffff7d1b9af in select () from /lib64/libc.so.6
#1  0x00007ffff7f3f7cd in runner (arg=0x5555555ec640) at ../../contrib/timer-wheel/timer-wheel.c:186
#2  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff677e700 (LWP 1919)):
#0  0x00007ffff7c5e57f in raise () from /lib64/libc.so.6
#1  0x00007ffff7c48895 in abort () from /lib64/libc.so.6
#2  0x00007ffff7ca19d7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff7ca1a9a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007ffff7dfc8ad in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#5  0x00007ffff7ed5911 in gf_timer_proc (data=0x5555555e7f70) at timer.c:140
#6  0x00007ffff7df64a7 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff7d243e3 in clone () from /lib64/libc.so.6
(gdb) 
(gdb)

Comment 5 zhou lin 2019-12-09 09:22:21 UTC
after i revert that patch, this issue disappears.

Comment 6 zhou lin 2019-12-09 09:27:39 UTC
(gdb) frame 5
#5  0x00007ffff7ed5911 in gf_timer_proc (data=0x5555555e7f70) at timer.c:140
140	             * system more responsive in most cases, but doesn't
(gdb) print TS(now)
No symbol "TS" in current context.
(gdb) print now
$1 = {tv_sec = 706, tv_nsec = 348142970}
(gdb) print event->at
$2 = {tv_sec = 7596555718327886697, tv_nsec = -1149262114398638491}
(gdb)

Comment 7 Xavi Hernandez 2019-12-09 10:39:32 UTC
Can you show the address of 'event' and it's complete contents ?

This would also be useful:

(gdb) x/256xb (char *)event - 128

Comment 8 zhou lin 2019-12-10 01:06:17 UTC
warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/usr/sbin/glusterfs --acl --process-name fuse --volfile-server=mn-0.local --vol'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	  return ret;
[Current thread is 1 (Thread 0x7ffff677e700 (LWP 1919))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c48895 in __GI_abort () at abort.c:79
#2  0x00007ffff7ca19d7 in __libc_message (action=action@entry=(do_abort | do_backtrace), fmt=fmt@entry=0x7ffff7daee96 "%s") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7ca1a9a in __GI___libc_fatal (message=message@entry=0x7ffff7e03000 "The futex facility returned an unexpected error code.\n")
    at ../sysdeps/posix/libc_fatal.c:191
#4  0x00007ffff7dfc8ad in futex_fatal_error () at ../sysdeps/nptl/futex-internal.h:200
#5  futex_reltimed_wait_cancelable (private=<optimized out>, reltime=<optimized out>, expected=<optimized out>, futex_word=<optimized out>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:159
#6  __pthread_cond_wait_common (abstime=0x7fffec001ce0, mutex=0x5555555e7f80, cond=0x5555555e7fa8) at pthread_cond_wait.c:533
#7  __pthread_cond_timedwait (cond=cond@entry=0x5555555e7fa8, mutex=mutex@entry=0x5555555e7f80, abstime=abstime@entry=0x7fffec001ce0) at pthread_cond_wait.c:667
#8  0x00007ffff7ed5911 in gf_timer_proc (data=0x5555555e7f70) at timer.c:140
#9  0x00007ffff7df64a7 in start_thread (arg=<optimized out>) at pthread_create.c:479
#10 0x00007ffff7d243e3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Comment 9 zhou lin 2019-12-10 01:07:04 UTC
(gdb) frame 8
#8  0x00007ffff7ed5911 in gf_timer_proc (data=0x5555555e7f70) at timer.c:140
(gdb) x/256xb (char*)event -128
0x7fffec001c50:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c58:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c60:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c68:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c70:	0x49	0x4f	0x52	0x46	0x49	0x4c	0x45	0x5f
0x7fffec001c78:	0x31	0x37	0x5f	0x37	0x00	0x0d	0xf0	0xad
0x7fffec001c80:	0xba	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c88:	0x75	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001c90:	0x30	0x8e	0x06	0xec	0xff	0x7f	0x00	0x00
0x7fffec001c98:	0x1e	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001ca0:	0x00	0x16	0x05	0xe4	0xff	0x7f	0x00	0x00
0x7fffec001ca8:	0xbe	0xba	0xfe	0xca	0x00	0x00	0x00	0x00
0x7fffec001cb0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001cb8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001cc0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001cc8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001cd0:	0x74	0x72	0x75	0x73	0x74	0x65	0x64	0x2e
0x7fffec001cd8:	0x61	0x66	0x72	0x2e	0x73	0x65	0x72	0x76
0x7fffec001ce0:	0x69	0x63	0x65	0x73	0x2d	0x63	0x6c	0x69
0x7fffec001ce8:	0x65	0x6e	0x74	0x2d	0x32	0x00	0x0d	0xf0
0x7fffec001cf0:	0xad	0xba	0x58	0xf2	0xff	0x7f	0x00	0x00
0x7fffec001cf8:	0x21	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001d00:	0xb0	0x96	0x02	0xec	0xff	0x7f	0x00	0x00
0x7fffec001d08:	0xd0	0xa1	0x06	0xec	0xff	0x7f	0x00	0x00
0x7fffec001d10:	0x20	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001d18:	0x94	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001d20:	0xbe	0xba	0xfe	0xca	0x00	0x00	0x00	0x00
0x7fffec001d28:	0xa0	0x7e	0x02	0xec	0xff	0x7f	0x00	0x00
0x7fffec001d30:	0xc0	0x0b	0x00	0xec	0xff	0x7f	0x00	0x00
0x7fffec001d38:	0x07	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x7fffec001d40:	0xa0	0x4f	0x08	0xe4	0xff	0x7f	0x00	0x00
--Type <RET> for more, q to quit, c to continue without paging--
0x7fffec001d48:	0xd0	0xb4	0x00	0xec	0xff	0x7f	0x00	0x00

Comment 10 zhou lin 2019-12-10 01:30:32 UTC
(gdb) print *(gf_timer_t *)event
$2 = {{list = {next = 0x2e64657473757274, prev = 0x767265732e726661}, {next = 0x2e64657473757274, prev = 0x767265732e726661}}, at = {
    tv_sec = 7596555718327886697, tv_nsec = -1149262114398638491}, callbk = 0x7ffff258baad <afr_changelog_thin_arbiter_post_op+285>, data = 0x21, 
  xl = 0x7fffec0296b0, fired = 208}
(gdb) print struct timespec(event->at)
A syntax error in expression, near `event->at)'.
(gdb) print (struct timespec)(event->at)
$3 = {tv_sec = 7596555718327886697, tv_nsec = -1149262114398638491}
(gdb) quit

Comment 11 zhou lin 2019-12-10 03:07:59 UTC
seems this event is corrupted.

Comment 12 Xavi Hernandez 2019-12-12 12:27:28 UTC
Yes, the event object is invalid. In fact it's not an event object anymore. It has been destroyed and the same address reused for another object.

Could you share the coredump to analyze it ? (I will also need the exact packages installed so that I can get the correct symbols while debugging)

In the steps to reproduce the issue, you say that you do a statedump every 60 seconds. Is that necessary to trigger the crash or it also happens without that ? if it's needed, the crash happens just when a statedump is generated ?

Comment 13 Arie Skliarouk 2019-12-16 16:12:36 UTC
Created attachment 1645626 [details]
glusterfs 7 client crash logs

Not sure whether my problem has same roots as the one under discussion.
Here is our gluster setup. We don't need speed, but we need HA.
Bricks on gl1 and gl2 sit on SSD, bricks on gl3 machine sit on magnetic storage. The idea is that gl3 is used mostly as backup (1 full copy) while gl1 and gl2 keep half copy each (using the smallish arbiter brick).

Volume Name: nafsha
Type: Distributed-Replicate
Volume ID: fdbf91e6-0607-46c8-b538-103180b5ab19
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: gl1:/mnt/nafshab1/BRICK
Brick2: gl3:/mnt/nafshab1/BRICK
Brick3: gl2:/mnt/nafshab1/BRICK (arbiter)
Brick4: gl2:/mnt/nafshab2/BRICK
Brick5: gl3:/mnt/nafshab2/BRICK
Brick6: gl1:/mnt/nafshab2/BRICK (arbiter)
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off

The problem is that when we start rsync of many small files onto glusterfs client (on a 4th machine), the client reproduceably crashes, after several minutes of operation. See the attached file with details of the different resulting crashes.

Comment 14 Xavi Hernandez 2019-12-17 07:14:20 UTC
Hi Arie,

based on the logs, it could be the same problem, but there's not enough information there to be sure. Is it possible to share the coredump so that I can analyze it ? I also need your operating system version and gluster packages to be able to correctly analyze the coredump.

Comment 15 Arie Skliarouk 2019-12-17 08:52:56 UTC
We use glusterfs packages version 7.0-ubuntu1~bionic1 on ubuntu 18.04.3 amd64 (add-apt-repository ppa:gluster/glusterfs-7).

The core is 447MB, compressed - 50MB, uploaded here:
http://skliarie.meshanet.com/tmp/glusterfs7-client.core.gz

Comment 16 Xavi Hernandez 2019-12-18 19:12:42 UTC
Thanks. I've checked the coredump and it is the same failure. I'm analyzing it.

Comment 17 Xavi Hernandez 2019-12-19 12:06:19 UTC
I think I've found the problem. Is it possible for you to build gluster with the patch [1] and see if it works for you ?

[1] https://review.gluster.org/c/glusterfs/+/23908

Comment 18 Arie Skliarouk 2019-12-26 12:27:49 UTC
I would rather install deb package that includes that patch..

Comment 19 Arie Skliarouk 2019-12-30 09:22:15 UTC
FYI:
The fix made its way into glusterfs-client 7.1, available on the ppa:gluster/glusterfs-7
https://github.com/gluster/glusterfs/compare/v7.1...master

Installed, testing..

Comment 20 Worker Ant 2020-01-09 11:11:10 UTC
REVIEW: https://review.gluster.org/23983 (timer: fix event destruction race) posted (#1) for review on release-7 by Xavi Hernandez

Comment 21 Xavi Hernandez 2020-01-09 11:12:23 UTC
Official 7.1 version doesn't include the patch. I've just backported it so that it can be present in next release.

Comment 22 Worker Ant 2020-01-10 05:37:21 UTC
REVIEW: https://review.gluster.org/23983 (timer: fix event destruction race) merged (#2) on release-7 by Rinku Kothiya


Note You need to log in before you can comment on or make changes to this bug.