Bug 1294487

Summary: glusterfsd crash while bouncing the bricks
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: spandura
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: spandura
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, asrivast, bsrirama, byarlaga, nlevinki, pkarampu, sankarshan, spandura, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.5-14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-01 06:06:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description spandura 2015-12-28 12:23:30 UTC
Description of problem:
=========================
On a tiered volume, was bringing down bricks and bringing it back online. Observed the following brick process crash. 

[2015-12-28 11:13:54.432880] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f9d81ddca66] (--> /lib64/libgfrpc.so.0(sav
ed_frames_unwind+0x1de)[0x7f9d81ba79ce] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9d81ba7ade] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[
0x7f9d81ba949c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f9d81ba9ca8] ))))) 0-glusterfs: forced unwinding frame type(Gluster Portmap) op(SIGNIN(4)) called at 
2015-12-28 11:13:50.630275 (xid=0x4)
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-12-28 11:13:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f9d81ddb002]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f9d81df748d]
/lib64/libc.so.6(+0x35670)[0x7f9d804c9670]
/usr/sbin/glusterfsd(emancipate+0x8)[0x7f9d822adb68]
/usr/sbin/glusterfsd(+0xe7df)[0x7f9d822b37df]
/lib64/libgfrpc.so.0(saved_frames_unwind+0x205)[0x7f9d81ba79f5]
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9d81ba7ade]
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f9d81ba949c]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f9d81ba9ca8]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f9d81ba5913]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc352)[0x7f9d76a73352]
/lib64/libglusterfs.so.0(+0x878ca)[0x7f9d81e3c8ca]
/lib64/libpthread.so.0(+0x7dc5)[0x7f9d80c43dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f9d8058a1cd]


Version-Release number of selected component (if applicable):
===============================================================
glusterfs-server-3.7.5-13.el7rhgs.x86_64

How reproducible:
===================
Just observed once.

Steps to Reproduce:
=======================
Not sure of the exact test steps as this was observed while running a test script. (testing data self-heal while bricks goes offline and comes back online)

Actual results:
==============
Core was generated by `/usr/sbin/glusterfsd -s rhsauto019.lab.eng.blr.redhat.com --volfile-id testvol.'.
Program terminated with signal 11, Segmentation fault.
#0  emancipate (ctx=ctx@entry=0x0, ret=-1) at glusterfsd.c:1329
1329	        if (ctx->daemon_pipe[1] != -1) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 openssl-libs-1.0.1e-42.el7_1.9.x86_64 pcre-8.32-15.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) bt full
#0  emancipate (ctx=ctx@entry=0x0, ret=-1) at glusterfsd.c:1329
No locals.
#1  0x00007f9d822b37df in mgmt_pmap_signin_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f9d7f8e706c) at glusterfsd-mgmt.c:2261
        rsp = {op_ret = -1, op_errno = 22}
        frame = 0x7f9d7f8e706c
        ret = <optimized out>
        emancipate_ret = <optimized out>
        pmap_req = {brick = 0x0, port = 0}
        cmd_args = <optimized out>
        ctx = <optimized out>
        brick_name = '\000' <repeats 4095 times>
        __FUNCTION__ = "mgmt_pmap_signin_cbk"
#2  0x00007f9d81ba79f5 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7f9d68000920) at rpc-clnt.c:366
        trav = 0x7f9d83047b1c
        tmp = 0x7f9d68000928
        timestr = "2015-12-28 11:13:50.630275", '\000' <repeats 997 times>
        iov = {iov_base = 0x0, iov_len = 0}
        __FUNCTION__ = "saved_frames_unwind"
#3  0x00007f9d81ba7ade in saved_frames_destroy (frames=0x7f9d68000920) at rpc-clnt.c:383
No locals.
#4  0x00007f9d81ba949c in rpc_clnt_connection_cleanup (conn=conn@entry=0x7f9d83046460) at rpc-clnt.c:536
        saved_frames = 0x7f9d68000920
        clnt = 0x7f9d83046430
#5  0x00007f9d81ba9ca8 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f9d83046460, event=RPC_TRANSPORT_DISCONNECT, data=0x7f9d83047f90) at rpc-clnt.c:856
        conn = 0x7f9d83046460
        clnt = 0x7f9d83046430
        ret = -1
        req_info = 0x0
        pollin = 0x0
        clnt_mydata = 0x0
        old_THIS = 0x7f9d820801e0 <global_xlator>
        __FUNCTION__ = "rpc_clnt_notify"
#6  0x00007f9d81ba5913 in rpc_transport_notify (this=this@entry=0x7f9d83047f90, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f9d83047f90)
    at rpc-transport.c:545
        ret = -1
        __FUNCTION__ = "rpc_transport_notify"
#7  0x00007f9d76a73352 in socket_event_poll_err (this=0x7f9d83047f90) at socket.c:1151
        priv = 0x7f9d83048c30
        ret = -1
#8  socket_event_handler (fd=fd@entry=9, idx=idx@entry=1, data=0x7f9d83047f90, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2356
        this = 0x7f9d83047f90
        priv = 0x7f9d83048c30
---Type <return> to continue, or q <return> to quit---
        ret = -1
        __FUNCTION__ = "socket_event_handler"
#9  0x00007f9d81e3c8ca in event_dispatch_epoll_handler (event=0x7f9d74d97e80, event_pool=0x7f9d82ffdc90) at event-epoll.c:575
        handler = 0x7f9d76a73240 <socket_event_handler>
        gen = 4
        slot = 0x7f9d8303a190
        data = <optimized out>
        ret = -1
        fd = 9
        ev_data = 0x7f9d74d97e84
        idx = 1
#10 event_dispatch_epoll_worker (data=0x7f9d8304aa00) at event-epoll.c:678
        event = {events = 25, data = {ptr = 0x400000001, fd = 1, u32 = 1, u64 = 17179869185}}
        ret = <optimized out>
        ev_data = 0x7f9d8304aa00
        event_pool = 0x7f9d82ffdc90
        myindex = 1
        timetodie = 0
        __FUNCTION__ = "event_dispatch_epoll_worker"
#11 0x00007f9d80c43dc5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#12 0x00007f9d8058a1cd in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 2 Atin Mukherjee 2015-12-28 12:28:42 UTC
I actually took a look at Shweta's setup and it seems like downstream doesn't have patch [1] which causes this crash. We need to cherry pick it.

[1] http://review.gluster.org/#/c/12311/

Comment 3 Atin Mukherjee 2015-12-28 14:54:52 UTC
Although I don't have permission to set a blocker flag but I propose it as a blocker since the crash is at the port map sign in path and the same code flow is hit at every brick (re)start.

Comment 7 Byreddy 2016-01-05 05:18:28 UTC
Shweta,

You have any inputs on this bug verification before moving it to verified state.

Comment 8 spandura 2016-01-05 07:01:00 UTC
The bug can be easily recreated while running the automation suite. I will run the test on new build and if it passes i will move it to verified state

Comment 9 Byreddy 2016-01-07 05:31:06 UTC
Changing the QA contact based on Comment 8.

Comment 10 spandura 2016-01-22 09:30:42 UTC
Verified the bug with the build: glusterfs-server-3.7.5-14.el7rhgs.x86_64
. I am not able to recreate the issue. Bug is fixed. Moving the bug to verified state.

Comment 12 errata-xmlrpc 2016-03-01 06:06:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html