Bug 1545045 - [Glusterd] glusterd crashed while doing in-service upgrade from 7.4+3.3.1_async to 7.5+3.4.0
Summary: [Glusterd] glusterd crashed while doing in-service upgrade from 7.4+3.3.1_asy...
Keywords:
Status: CLOSED DUPLICATE of bug 1238067
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Atin Mukherjee
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-14 06:58 UTC by Prasad Desala
Modified: 2018-02-14 12:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-14 12:10:29 UTC
Embargoed:


Attachments (Terms of Use)

Description Prasad Desala 2018-02-14 06:58:09 UTC
Description of problem:
=======================
glusterd crashed while doing in-service upgrade from 7.4+3.3.1_async to 7.5+3.4.0

Version-Release number of selected component (if applicable):
3.12.2-4.el7rhgs.x86_64

How reproducible:
Started in-service upgrade on a six node setup, glusterd crashed during node 4 upgrade 

Steps to Reproduce:
===================
1) Have a 7.4+3.3.1_async 6 node machine.
2) Create 30 volumes (x2,x3, Distributed-Disperse) and enable brick-mux
3) Mount the volumes on clients and start IO from few mount points.
4) Start in-service upgrade.

Actual results:
===============
On the node where upgrade is in-progress, glusterd crashed.

Expected results:
=================
No crashes

Output snips:
================
* Seen below message on the console,

  Cleanup    : libgcc-4.8.5-16.el7.x86_64                                                                                                                                             752/752 
/var/tmp/rpm-tmp.BkBr28: line 26:  2146 Segmentation fault      (core dumped) glusterd --xlator-option *.upgrade=on -N
rhel-75repo/productid                                                  | 1.6 kB  00:00:00     
  Verifying  : device-mapper-persistent-data-0.7.3-3.el7.x86_64                         1/752 

* glusterd.log
===============

->/lib64/libglusterfs.so.0(gf_timer_call_after+0x229) [0x7f79dd92d4a9] ) 0-timer: Either ctx is NULL or ctx cleanup started [Invalid argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-02-14 06:32:24
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f79dd91f3f0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79dd929334]
/lib64/libc.so.6(+0x36280)[0x7f79dbf7f280]
/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7f79d1e18126]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x8c0cd)[0x7f79d24130cd]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x8c765)[0x7f79d2413765]
/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x75aca)[0x7f79d23fcaca]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f79dd6e7d90]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7f79dd6e8077]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f79dd6e3ea3]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x74c6)[0x7f79cf5f44c6]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9a6c)[0x7f79cf5f6a6c]
/lib64/libglusterfs.so.0(+0x88344)[0x7f79dd97c344]
/lib64/libpthread.so.0(+0x7dd5)[0x7f79dc77edd5]
/lib64/libc.so.6(clone+0x6d)[0x7f79dc047b3d]
---------

* Full BT
=============
(gdb) bt full
#0  0x00007f79d1e18126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1
No symbol table info available.
#1  0x00007f79d24130cd in glusterd_mgmt_handshake (this=this@entry=0x5571dc67b0b0, peerctx=peerctx@entry=0x5571dcde0410) at glusterd-handshake.c:2067
        frame = <optimized out>
        req = {hndsk = {hndsk_len = 57, hndsk_val = 0x7f79b8009a90 ""}}
        peerinfo = <optimized out>
        req_dict = 0x7f79b8009440
        ret = 0
        __FUNCTION__ = "glusterd_mgmt_handshake"
#2  0x00007f79d2413765 in __glusterd_peer_dump_version_cbk (req=req@entry=0x7f79b8005170, iov=iov@entry=0x7f79b80051b0, count=count@entry=1, myframe=myframe@entry=0x7f79b8004200)
    at glusterd-handshake.c:2222
        ret = 368
        rsp = {gfs_id = 0, op_ret = 0, op_errno = 0, prog = 0x7f79b80065b0}
        this = 0x5571dc67b0b0
        trav = 0x0
        next = 0x0
        frame = 0x7f79b8004200
        peerinfo = 0x5571dcdd6fe0
        peerctx = 0x5571dcde0410
        conf = 0x7f79ddc76050
        msg = '\000' <repeats 1023 times>
        __FUNCTION__ = "__glusterd_peer_dump_version_cbk"
#3  0x00007f79d23fcaca in glusterd_big_locked_cbk (req=0x7f79b8005170, iov=0x7f79b80051b0, count=1, myframe=0x7f79b8004200, fn=0x7f79d2413610 <__glusterd_peer_dump_version_cbk>)
    at glusterd-rpc-ops.c:223
        priv = 0x7f79ddc76050
        ret = -1
#4  0x00007f79dd6e7d90 in rpc_clnt_handle_reply (clnt=clnt@entry=0x5571dcde1350, pollin=pollin@entry=0x7f79b8007e80) at rpc-clnt.c:778
        conn = 0x5571dcde1380
        saved_frame = <optimized out>
        ret = 0
        req = 0x7f79b8005170
        xid = 2
        __FUNCTION__ = "rpc_clnt_handle_reply"
#5  0x00007f79dd6e8077 in rpc_clnt_notify (trans=<optimized out>, mydata=0x5571dcde1380, event=<optimized out>, data=0x7f79b8007e80) at rpc-clnt.c:971
        conn = 0x5571dcde1380
        clnt = 0x5571dcde1350
        ret = -1
        req_info = 0x0
        pollin = 0x7f79b8007e80
        clnt_mydata = 0x0
        old_THIS = 0x5571dc67b0b0
        __FUNCTION__ = "rpc_clnt_notify"
#6  0x00007f79dd6e3ea3 in rpc_transport_notify (this=this@entry=0x5571dcde1580, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f79b8007e80) at rpc-transport.c:538
        ret = -1
        __FUNCTION__ = "rpc_transport_notify"
#7  0x00007f79cf5f44c6 in socket_event_poll_in (this=this@entry=0x5571dcde1580, notify_handled=<optimized out>) at socket.c:2315
        ret = <optimized out>
        pollin = <optimized out>
        priv = 0x5571dcde1fb0
        ctx = <optimized out>
#8  0x00007f79cf5f6a6c in socket_event_handler (fd=14, idx=5, gen=1, data=0x5571dcde1580, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
        this = 0x5571dcde1580
---Type <return> to continue, or q <return> to quit---
        priv = 0x5571dcde1fb0
        ret = 0
        ctx = 0x5571dc633010
        socket_closed = _gf_false
        notify_handled = _gf_false
        __FUNCTION__ = "socket_event_handler"
#9  0x00007f79dd97c344 in event_dispatch_epoll_handler (event=0x7f79cd43ee80, event_pool=0x5571dc66aa30) at event-epoll.c:583
        handler = 0x7f79cf5f68c0 <socket_event_handler>
        gen = 1
        slot = 0x5571dc6ad6d8
        data = 0x5571dcde1580
        ret = -1
        fd = 14
        ev_data = 0x7f79cd43ee84
        idx = 5
        handled_error_previously = _gf_false
#10 event_dispatch_epoll_worker (data=0x5571dcde3f80) at event-epoll.c:659
        event = {events = 1, data = {ptr = 0x100000005, fd = 5, u32 = 5, u64 = 4294967301}}
        ret = <optimized out>
        ev_data = 0x5571dcde3f80
        event_pool = 0x5571dc66aa30
        myindex = 1
        timetodie = 0
        __FUNCTION__ = "event_dispatch_epoll_worker"
#11 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#12 0x00007f79dc047b3d in clone () from /lib64/libc.so.6
No symbol table info available.

Seeing below Warnings in glusterd logs,

[2018-02-13 11:39:13.218043] I [MSGID: 106502] [glusterd-handler.c:2834:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-02-13 11:39:13.246752] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f4985910-8e61-42f7-951f-f614b32dac46
[2018-02-13 11:39:14.509711] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f4985910-8e61-42f7-951f-f614b32dac46
[2018-02-14 05:35:49.442684] W [MSGID: 101095] [xlator.c:147:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.8.4/xlator/features/ganesha.so: cannot open shared object file: No such file or directory
[2018-02-14 05:37:33.807570] W [socket.c:595:__socket_rwv] 0-management: readv on 10.70.42.173:24007 failed (No data available)
[2018-02-14 05:37:33.812968] I [MSGID: 106004] [glusterd-handler.c:6276:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp42-173.lab.eng.blr.redhat.com> (<07f51608-cc82-45db-9681-f78beac254ad>), in state <Peer in Cluster>, has disconnected from glusterd.
[2018-02-14 05:37:33.817654] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep10 not held
[2018-02-14 05:37:33.818250] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep10
[2018-02-14 05:37:33.818487] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep11 not held
[2018-02-14 05:37:33.818566] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep11
[2018-02-14 05:37:33.818755] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep12 not held
[2018-02-14 05:37:33.818816] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep12
[2018-02-14 05:37:33.819061] W [glusterd-locks.c:675:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x226aa) [0x7f0d1098a6aa] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x2c458) [0x7f0d10994458] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdccba) [0x7f0d10a44cba] ) 0-management: Lock for vol distrep13 not held
[2018-02-14 05:37:33.819126] W [MSGID: 106118] [glusterd-handler.c:6301:__glusterd_peer_rpc_notify] 0-management: Lock not released for distrep13

Comment 2 Prasad Desala 2018-02-14 07:33:36 UTC
(gdb) t a a bt

Thread 8 (Thread 0x7f79d3f1d700 (LWP 2149)):
#0  0x00007f79dc00e4fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f79dc00e394 in sleep () from /lib64/libc.so.6
#2  0x00007f79dd94791d in pool_sweeper (arg=<optimized out>) at mem-pool.c:481
#3  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f79d371c700 (LWP 2150)):
#0  0x00007f79dc782cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f79dd95a548 in syncenv_task (proc=proc@entry=0x5571dc672900) at syncop.c:603
#2  0x00007f79dd95b410 in syncenv_processor (thdata=0x5571dc672900) at syncop.c:695
#3  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f79d2f1b700 (LWP 2151)):
#0  0x00007f79dc782cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f79dd95a548 in syncenv_task (proc=proc@entry=0x5571dc672cc0) at syncop.c:603
#2  0x00007f79dd95b410 in syncenv_processor (thdata=0x5571dc672cc0) at syncop.c:695
#3  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f79cdc40700 (LWP 2162)):
#0  0x00007f79dc782945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f79d246996b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:529
#2  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f79d4f1f700 (LWP 2147)):
#0  0x00007f79dc785eed in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f79dd92d126 in gf_timer_proc (data=0x5571dc6720e0) at timer.c:165
#2  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f79dde01780 (LWP 2146)):
#0  0x00007f79dc77ff47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f79dd97c978 in event_dispatch_epoll (event_pool=0x5571dc66aa30) at event-epoll.c:746
#2  0x00005571daf9a097 in main (argc=4, argv=<optimized out>) at glusterfsd.c:2570

Thread 2 (Thread 0x7f79d471e700 (LWP 2148)):
#0  0x00007f79dbfc61d0 in _IO_default_setbuf () from /lib64/libc.so.6
#1  0x00007f79dbfc2769 in __GI__IO_file_setbuf () from /lib64/libc.so.6
#2  0x00007f79dbfc6b52 in _IO_cleanup () from /lib64/libc.so.6
#3  0x00007f79dbf82b1b in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007f79dbf82bb7 in exit () from /lib64/libc.so.6
#5  0x00005571daf9d30b in cleanup_and_exit (signum=15) at glusterfsd.c:1444
#6  0x00005571daf9d405 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2165
#7  0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f79dc047b3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f79cd43f700 (LWP 2163)):
#0  0x00007f79d1e18126 in rcu_read_unlock_bp () from /lib64/liburcu-bp.so.1
---Type <return> to continue, or q <return> to quit---
#1  0x00007f79d24130cd in glusterd_mgmt_handshake (this=this@entry=0x5571dc67b0b0, peerctx=peerctx@entry=0x5571dcde0410) at glusterd-handshake.c:2067
#2  0x00007f79d2413765 in __glusterd_peer_dump_version_cbk (req=req@entry=0x7f79b8005170, iov=iov@entry=0x7f79b80051b0, count=count@entry=1, myframe=myframe@entry=0x7f79b8004200)
    at glusterd-handshake.c:2222
#3  0x00007f79d23fcaca in glusterd_big_locked_cbk (req=0x7f79b8005170, iov=0x7f79b80051b0, count=1, myframe=0x7f79b8004200, fn=0x7f79d2413610 <__glusterd_peer_dump_version_cbk>)
    at glusterd-rpc-ops.c:223
#4  0x00007f79dd6e7d90 in rpc_clnt_handle_reply (clnt=clnt@entry=0x5571dcde1350, pollin=pollin@entry=0x7f79b8007e80) at rpc-clnt.c:778
#5  0x00007f79dd6e8077 in rpc_clnt_notify (trans=<optimized out>, mydata=0x5571dcde1380, event=<optimized out>, data=0x7f79b8007e80) at rpc-clnt.c:971
#6  0x00007f79dd6e3ea3 in rpc_transport_notify (this=this@entry=0x5571dcde1580, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f79b8007e80) at rpc-transport.c:538
#7  0x00007f79cf5f44c6 in socket_event_poll_in (this=this@entry=0x5571dcde1580, notify_handled=<optimized out>) at socket.c:2315
#8  0x00007f79cf5f6a6c in socket_event_handler (fd=14, idx=5, gen=1, data=0x5571dcde1580, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#9  0x00007f79dd97c344 in event_dispatch_epoll_handler (event=0x7f79cd43ee80, event_pool=0x5571dc66aa30) at event-epoll.c:583
#10 event_dispatch_epoll_worker (data=0x5571dcde3f80) at event-epoll.c:659
#11 0x00007f79dc77edd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f79dc047b3d in clone () from /lib64/libc.so.6


Note You need to log in before you can comment on or make changes to this bug.