Description of problem: ======================= glusterd crashed while doing in-service upgrade from 7.4+3.3.1_async to 7.5+3.4.0 Version-Release number of selected component (if applicable): 3.12.2-4.el7rhgs.x86_64 How reproducible: Started in-service upgrade on a six node setup, glusterd crashed during node 4 upgrade Steps to Reproduce: =================== 1) Have a 7.4+3.3.1_async 6 node machine. 2) Create 30 volumes (x2,x3, Distributed-Disperse) and enable brick-mux 3) Mount the volumes on clients and start IO from few mount points. 4) Start in-service upgrade. Actual results: =============== On the node where upgrade is in-progress, glusterd crashed. Expected results: ================= No crashes Output snips: ================ * Seen below message on the console, Cleanup : libgcc-4.8.5-16.el7.x86_64 752/752 /var/tmp/rpm-tmp.BkBr28: line 26: 2146 Segmentation fault (core dumped) glusterd --xlator-option *.upgrade=on -N rhel-75repo/productid | 1.6 kB 00:00:00 Verifying : device-mapper-persistent-data-0.7.3-3.el7.x86_64 1/752 * glusterd.log =============== ->/lib64/libglusterfs.so.0(gf_timer_call_after+0x229) [0x7f79dd92d4a9] ) 0-timer: Either ctx is NULL or ctx cleanup started [Invalid argument] pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-02-14 06:32:24 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f79dd91f3f0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79dd929334] /lib64/libc.so.6(+0x36280)[0x7f79dbf7f280] /lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7f79d1e18126] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x8c0cd)[0x7f79d24130cd] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x8c765)[0x7f79d2413765] /usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x75aca)[0x7f79d23fcaca] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f79dd6e7d90] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7f79dd6e8077] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f79dd6e3ea3] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x74c6)[0x7f79cf5f44c6] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9a6c)[0x7f79cf5f6a6c] /lib64/libglusterfs.so.0(+0x88344)[0x7f79dd97c344] /lib64/libpthread.so.0(+0x7dd5)[0x7f79dc77edd5] /lib64/libc.so.6(clone+0x6d)[0x7f79dc047b3d] --------- * Full BT ============= (gdb) bt full #0 0x00007f79d1e18126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1 No symbol table info available. #1 0x00007f79d24130cd in glusterd_mgmt_handshake (this=this@entry=0x5571dc67b0b0, peerctx=peerctx@entry=0x5571dcde0410) at glusterd-handshake.c:2067 frame = <optimized out> req = {hndsk = {hndsk_len = 57, hndsk_val = 0x7f79b8009a90 ""}} peerinfo = <optimized out> req_dict = 0x7f79b8009440 ret = 0 __FUNCTION__ = "glusterd_mgmt_handshake" #2 0x00007f79d2413765 in __glusterd_peer_dump_version_cbk (req=req@entry=0x7f79b8005170, iov=iov@entry=0x7f79b80051b0, count=count@entry=1, myframe=myframe@entry=0x7f79b8004200) at glusterd-handshake.c:2222 ret = 368 rsp = {gfs_id = 0, op_ret = 0, op_errno = 0, prog = 0x7f79b80065b0} this = 0x5571dc67b0b0 trav = 0x0 next = 0x0 frame = 0x7f79b8004200 peerinfo = 0x5571dcdd6fe0 peerctx = 0x5571dcde0410 conf = 0x7f79ddc76050 msg = '\000' <repeats 1023 times> __FUNCTION__ = "__glusterd_peer_dump_version_cbk" #3 0x00007f79d23fcaca in glusterd_big_locked_cbk (req=0x7f79b8005170, iov=0x7f79b80051b0, count=1, myframe=0x7f79b8004200, fn=0x7f79d2413610 <__glusterd_peer_dump_version_cbk>) at glusterd-rpc-ops.c:223 priv = 0x7f79ddc76050 ret = -1 #4 0x00007f79dd6e7d90 in rpc_clnt_handle_reply (clnt=clnt@entry=0x5571dcde1350, pollin=pollin@entry=0x7f79b8007e80) at rpc-clnt.c:778 conn = 0x5571dcde1380 saved_frame = <optimized out> ret = 0 req = 0x7f79b8005170 xid = 2 __FUNCTION__ = "rpc_clnt_handle_reply" #5 0x00007f79dd6e8077 in rpc_clnt_notify (trans=<optimized out>, mydata=0x5571dcde1380, event=<optimized out>, data=0x7f79b8007e80) at rpc-clnt.c:971 conn = 0x5571dcde1380 clnt = 0x5571dcde1350 ret = -1 req_info = 0x0 pollin = 0x7f79b8007e80 clnt_mydata = 0x0 old_THIS = 0x5571dc67b0b0 __FUNCTION__ = "rpc_clnt_notify" #6 0x00007f79dd6e3ea3 in rpc_transport_notify (this=this@entry=0x5571dcde1580, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f79b8007e80) at rpc-transport.c:538 ret = -1 __FUNCTION__ = "rpc_transport_notify" #7 0x00007f79cf5f44c6 in socket_event_poll_in (this=this@entry=0x5571dcde1580, notify_handled=<optimized out>) at socket.c:2315 ret = <optimized out> pollin = <optimized out> priv = 0x5571dcde1fb0 ctx = <optimized out> #8 0x00007f79cf5f6a6c in socket_event_handler (fd=14, idx=5, gen=1, data=0x5571dcde1580, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467 this = 0x5571dcde1580 ---Type <return> to continue, or q <return> to quit--- priv = 0x5571dcde1fb0 ret = 0 ctx = 0x5571dc633010 socket_closed = _gf_false notify_handled = _gf_false __FUNCTION__ = "socket_event_handler" #9 0x00007f79dd97c344 in event_dispatch_epoll_handler (event=0x7f79cd43ee80, event_pool=0x5571dc66aa30) at event-epoll.c:583 handler = 0x7f79cf5f68c0 <socket_event_handler> gen = 1 slot = 0x5571dc6ad6d8 data = 0x5571dcde1580 ret = -1 fd = 14 ev_data = 0x7f79cd43ee84 idx = 5 handled_error_previously = _gf_false #10 event_dispatch_epoll_worker (data=0x5571dcde3f80) at event-epoll.c:659 event = {events = 1, data = {ptr = 0x100000005, fd = 5, u32 = 5, u64 = 4294967301}} ret = <optimized out> ev_data = 0x5571dcde3f80 event_pool = 0x5571dc66aa30 myindex = 1 timetodie = 0 __FUNCTION__ = "event_dispatch_epoll_worker" #11 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #12 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 No symbol table info available. Seeing below Warnings in glusterd logs, [2018-02-13 11:39:13.218043] I [MSGID: 106502] [glusterd-handler.c:2834:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-02-13 11:39:13.246752] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f4985910-8e61-42f7-951f-f614b32dac46 [2018-02-13 11:39:14.509711] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f4985910-8e61-42f7-951f-f614b32dac46 [2018-02-14 05:35:49.442684] W [MSGID: 101095] [xlator.c:147:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.8.4/xlator/features/ganesha.so: cannot open shared object file: No such file or directory [2018-02-14 05:37:33.807570] W [socket.c:595:__socket_rwv] 0-management: readv on 10.70.42.173:24007 failed (No data available) [2018-02-14 05:37:33.812968] I [MSGID: 106004] [glusterd-handler.c:6276:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp42-173.lab.eng.blr.redhat.com> (<07f51608-cc82-45db-9681-f78beac254ad>), in state <Peer in Cluster>, has disconnected from glusterd. [2018-02-14 05:37:33.817654] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep10 not held [2018-02-14 05:37:33.818250] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep10 [2018-02-14 05:37:33.818487] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep11 not held [2018-02-14 05:37:33.818566] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep11 [2018-02-14 05:37:33.818755] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep12 not held [2018-02-14 05:37:33.818816] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep12 [2018-02-14 05:37:33.819061] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep13 not held [2018-02-14 05:37:33.819126] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep13
(gdb) t a a bt Thread 8 (Thread 0x7f79d3f1d700 (LWP 2149)): #0 0x00007f79dc00e4fd in nanosleep () from /lib64/libc.so.6 #1 0x00007f79dc00e394 in sleep () from /lib64/libc.so.6 #2 0x00007f79dd94791d in pool_sweeper (arg=<optimized out>) at mem-pool.c:481 #3 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f79d371c700 (LWP 2150)): #0 0x00007f79dc782cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f79dd95a548 in syncenv_task (proc=proc@entry=0x5571dc672900) at syncop.c:603 #2 0x00007f79dd95b410 in syncenv_processor (thdata=0x5571dc672900) at syncop.c:695 #3 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f79d2f1b700 (LWP 2151)): #0 0x00007f79dc782cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f79dd95a548 in syncenv_task (proc=proc@entry=0x5571dc672cc0) at syncop.c:603 #2 0x00007f79dd95b410 in syncenv_processor (thdata=0x5571dc672cc0) at syncop.c:695 #3 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f79cdc40700 (LWP 2162)): #0 0x00007f79dc782945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f79d246996b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:529 #2 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f79d4f1f700 (LWP 2147)): #0 0x00007f79dc785eed in nanosleep () from /lib64/libpthread.so.0 #1 0x00007f79dd92d126 in gf_timer_proc (data=0x5571dc6720e0) at timer.c:165 #2 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f79dde01780 (LWP 2146)): #0 0x00007f79dc77ff47 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f79dd97c978 in event_dispatch_epoll (event_pool=0x5571dc66aa30) at event-epoll.c:746 #2 0x00005571daf9a097 in main (argc=4, argv=<optimized out>) at glusterfsd.c:2570 Thread 2 (Thread 0x7f79d471e700 (LWP 2148)): #0 0x00007f79dbfc61d0 in _IO_default_setbuf () from /lib64/libc.so.6 #1 0x00007f79dbfc2769 in __GI__IO_file_setbuf () from /lib64/libc.so.6 #2 0x00007f79dbfc6b52 in _IO_cleanup () from /lib64/libc.so.6 #3 0x00007f79dbf82b1b in __run_exit_handlers () from /lib64/libc.so.6 #4 0x00007f79dbf82bb7 in exit () from /lib64/libc.so.6 #5 0x00005571daf9d30b in cleanup_and_exit (signum=15) at glusterfsd.c:1444 #6 0x00005571daf9d405 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2165 #7 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f79dc047b3d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f79cd43f700 (LWP 2163)): #0 0x00007f79d1e18126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1 ---Type <return> to continue, or q <return> to quit--- #1 0x00007f79d24130cd in glusterd_mgmt_handshake (this=this@entry=0x5571dc67b0b0, peerctx=peerctx@entry=0x5571dcde0410) at glusterd-handshake.c:2067 #2 0x00007f79d2413765 in __glusterd_peer_dump_version_cbk (req=req@entry=0x7f79b8005170, iov=iov@entry=0x7f79b80051b0, count=count@entry=1, myframe=myframe@entry=0x7f79b8004200) at glusterd-handshake.c:2222 #3 0x00007f79d23fcaca in glusterd_big_locked_cbk (req=0x7f79b8005170, iov=0x7f79b80051b0, count=1, myframe=0x7f79b8004200, fn=0x7f79d2413610 <__glusterd_peer_dump_version_cbk>) at glusterd-rpc-ops.c:223 #4 0x00007f79dd6e7d90 in rpc_clnt_handle_reply (clnt=clnt@entry=0x5571dcde1350, pollin=pollin@entry=0x7f79b8007e80) at rpc-clnt.c:778 #5 0x00007f79dd6e8077 in rpc_clnt_notify (trans=<optimized out>, mydata=0x5571dcde1380, event=<optimized out>, data=0x7f79b8007e80) at rpc-clnt.c:971 #6 0x00007f79dd6e3ea3 in rpc_transport_notify (this=this@entry=0x5571dcde1580, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f79b8007e80) at rpc-transport.c:538 #7 0x00007f79cf5f44c6 in socket_event_poll_in (this=this@entry=0x5571dcde1580, notify_handled=<optimized out>) at socket.c:2315 #8 0x00007f79cf5f6a6c in socket_event_handler (fd=14, idx=5, gen=1, data=0x5571dcde1580, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467 #9 0x00007f79dd97c344 in event_dispatch_epoll_handler (event=0x7f79cd43ee80, event_pool=0x5571dc66aa30) at event-epoll.c:583 #10 event_dispatch_epoll_worker (data=0x5571dcde3f80) at event-epoll.c:659 #11 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f79dc047b3d in clone () from /lib64/libc.so.6