Description of problem: ========================= On a tiered volume, was bringing down bricks and bringing it back online. Observed the following brick process crash. [2015-12-28 11:13:54.432880] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f9d81ddca66] (--> /lib64/libgfrpc.so.0(sav ed_frames_unwind+0x1de)[0x7f9d81ba79ce] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9d81ba7ade] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[ 0x7f9d81ba949c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f9d81ba9ca8] ))))) 0-glusterfs: forced unwinding frame type(Gluster Portmap) op(SIGNIN(4)) called at 2015-12-28 11:13:50.630275 (xid=0x4) pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-12-28 11:13:54 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.5 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f9d81ddb002] /lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f9d81df748d] /lib64/libc.so.6(+0x35670)[0x7f9d804c9670] /usr/sbin/glusterfsd(emancipate+0x8)[0x7f9d822adb68] /usr/sbin/glusterfsd(+0xe7df)[0x7f9d822b37df] /lib64/libgfrpc.so.0(saved_frames_unwind+0x205)[0x7f9d81ba79f5] /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9d81ba7ade] /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f9d81ba949c] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f9d81ba9ca8] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f9d81ba5913] /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc352)[0x7f9d76a73352] /lib64/libglusterfs.so.0(+0x878ca)[0x7f9d81e3c8ca] /lib64/libpthread.so.0(+0x7dc5)[0x7f9d80c43dc5] /lib64/libc.so.6(clone+0x6d)[0x7f9d8058a1cd] Version-Release number of selected component (if applicable): =============================================================== glusterfs-server-3.7.5-13.el7rhgs.x86_64 How reproducible: =================== Just observed once. Steps to Reproduce: ======================= Not sure of the exact test steps as this was observed while running a test script. (testing data self-heal while bricks goes offline and comes back online) Actual results: ============== Core was generated by `/usr/sbin/glusterfsd -s rhsauto019.lab.eng.blr.redhat.com --volfile-id testvol.'. Program terminated with signal 11, Segmentation fault. #0 emancipate (ctx=ctx@entry=0x0, ret=-1) at glusterfsd.c:1329 1329 if (ctx->daemon_pipe[1] != -1) { Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 openssl-libs-1.0.1e-42.el7_1.9.x86_64 pcre-8.32-15.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) bt full #0 emancipate (ctx=ctx@entry=0x0, ret=-1) at glusterfsd.c:1329 No locals. #1 0x00007f9d822b37df in mgmt_pmap_signin_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f9d7f8e706c) at glusterfsd-mgmt.c:2261 rsp = {op_ret = -1, op_errno = 22} frame = 0x7f9d7f8e706c ret = <optimized out> emancipate_ret = <optimized out> pmap_req = {brick = 0x0, port = 0} cmd_args = <optimized out> ctx = <optimized out> brick_name = '\000' <repeats 4095 times> __FUNCTION__ = "mgmt_pmap_signin_cbk" #2 0x00007f9d81ba79f5 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7f9d68000920) at rpc-clnt.c:366 trav = 0x7f9d83047b1c tmp = 0x7f9d68000928 timestr = "2015-12-28 11:13:50.630275", '\000' <repeats 997 times> iov = {iov_base = 0x0, iov_len = 0} __FUNCTION__ = "saved_frames_unwind" #3 0x00007f9d81ba7ade in saved_frames_destroy (frames=0x7f9d68000920) at rpc-clnt.c:383 No locals. #4 0x00007f9d81ba949c in rpc_clnt_connection_cleanup (conn=conn@entry=0x7f9d83046460) at rpc-clnt.c:536 saved_frames = 0x7f9d68000920 clnt = 0x7f9d83046430 #5 0x00007f9d81ba9ca8 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f9d83046460, event=RPC_TRANSPORT_DISCONNECT, data=0x7f9d83047f90) at rpc-clnt.c:856 conn = 0x7f9d83046460 clnt = 0x7f9d83046430 ret = -1 req_info = 0x0 pollin = 0x0 clnt_mydata = 0x0 old_THIS = 0x7f9d820801e0 <global_xlator> __FUNCTION__ = "rpc_clnt_notify" #6 0x00007f9d81ba5913 in rpc_transport_notify (this=this@entry=0x7f9d83047f90, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f9d83047f90) at rpc-transport.c:545 ret = -1 __FUNCTION__ = "rpc_transport_notify" #7 0x00007f9d76a73352 in socket_event_poll_err (this=0x7f9d83047f90) at socket.c:1151 priv = 0x7f9d83048c30 ret = -1 #8 socket_event_handler (fd=fd@entry=9, idx=idx@entry=1, data=0x7f9d83047f90, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2356 this = 0x7f9d83047f90 priv = 0x7f9d83048c30 ---Type <return> to continue, or q <return> to quit--- ret = -1 __FUNCTION__ = "socket_event_handler" #9 0x00007f9d81e3c8ca in event_dispatch_epoll_handler (event=0x7f9d74d97e80, event_pool=0x7f9d82ffdc90) at event-epoll.c:575 handler = 0x7f9d76a73240 <socket_event_handler> gen = 4 slot = 0x7f9d8303a190 data = <optimized out> ret = -1 fd = 9 ev_data = 0x7f9d74d97e84 idx = 1 #10 event_dispatch_epoll_worker (data=0x7f9d8304aa00) at event-epoll.c:678 event = {events = 25, data = {ptr = 0x400000001, fd = 1, u32 = 1, u64 = 17179869185}} ret = <optimized out> ev_data = 0x7f9d8304aa00 event_pool = 0x7f9d82ffdc90 myindex = 1 timetodie = 0 __FUNCTION__ = "event_dispatch_epoll_worker" #11 0x00007f9d80c43dc5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #12 0x00007f9d8058a1cd in clone () from /lib64/libc.so.6 No symbol table info available.
I actually took a look at Shweta's setup and it seems like downstream doesn't have patch [1] which causes this crash. We need to cherry pick it. [1] http://review.gluster.org/#/c/12311/
Although I don't have permission to set a blocker flag but I propose it as a blocker since the crash is at the port map sign in path and the same code flow is hit at every brick (re)start.
Shweta, You have any inputs on this bug verification before moving it to verified state.
The bug can be easily recreated while running the automation suite. I will run the test on new build and if it passes i will move it to verified state
Changing the QA contact based on Comment 8.
Verified the bug with the build: glusterfs-server-3.7.5-14.el7rhgs.x86_64 . I am not able to recreate the issue. Bug is fixed. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html