Bug 815027 - SIGSEGV in wb_sync_cbk during disconnect
Summary: SIGSEGV in wb_sync_cbk during disconnect
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 1.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Amar Tumballi
QA Contact:
URL:
Whiteboard:
Depends On: 767359
Blocks: 811632
TreeView+ depends on / blocked
 
Reported: 2012-04-22 07:16 UTC by Scott Haines
Modified: 2015-01-22 15:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 767359
Environment:
Last Closed: 2012-05-01 11:17:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0538 0 normal SHIPPED_LIVE Red Hat Storage Software Appliance 3.2 bug fix update. 2012-05-01 15:15:12 UTC

Description Scott Haines 2012-04-22 07:16:12 UTC
+++ This bug was initially created as a clone of Bug #767359 +++

Description of problem:

Program received signal SIGSEGV, Segmentation fault.
0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, 
    this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0)
    at write-behind.c:375
375	        file = local->file;
(gdb) bt
#0  0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, 
    this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0)
    at write-behind.c:375
#1  0x00007fb751de5df9 in client3_1_writev (frame=0x7fb754a3cdbc, this=0x22ed860, 
    data=0x7fffd4332e10) at client3_1-fops.c:3587
#2  0x00007fb751dcfd0f in client_writev (frame=0x7fb754a3cdbc, this=0x22ed860, 
    fd=0x7fb75082219c, vector=0x22f8380, count=1, off=7995392, iobref=0x22f7cc0)
    at client.c:820
#3  0x00007fb751bb5123 in wb_sync (frame=0x7fb7547c35b0, file=0x22f8d40, 
    winds=0x7fffd43330a0) at write-behind.c:548
#4  0x00007fb751bbb6e9 in wb_do_ops (frame=0x7fb7547c35b0, file=0x22f8d40, 
    winds=0x7fffd43330a0, unwinds=0x7fffd4333090, other_requests=0x7fffd4333080)
    at write-behind.c:1859
#5  0x00007fb751bbbf5d in wb_process_queue (frame=0x7fb7547c35b0, file=0x22f8d40)
    at write-behind.c:2048
#6  0x00007fb751bb4854 in wb_sync_cbk (frame=0x7fb7547c35b0, cookie=0x7fb754a3c2fc, 
    this=0x22eeb40, op_ret=-1, op_errno=107, prebuf=0x7fffd4333210, 
    postbuf=0x7fffd43331a0) at write-behind.c:405
#7  0x00007fb751dda0e4 in client3_1_writev_cbk (req=0x7fb750eea02c, 
    iov=0x7fffd43333e0, count=1, myframe=0x7fb754a3c2fc) at client3_1-fops.c:692
#8  0x00007fb755bd89e1 in saved_frames_unwind (saved_frames=0x22e90f0)
    at rpc-clnt.c:385
#9  0x00007fb755bd8a90 in saved_frames_destroy (frames=0x22e90f0) at rpc-clnt.c:403
#10 0x00007fb755bd9005 in rpc_clnt_connection_cleanup (conn=0x22f6d40)
    at rpc-clnt.c:559
#11 0x00007fb755bd9ae6 in rpc_clnt_notify (trans=0x22f6e60, mydata=0x22f6d40, 
    event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-clnt.c:863
#12 0x00007fb755bd5d5c in rpc_transport_notify (this=0x22f6e60, 
    event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-transport.c:498
#13 0x00007fb752c1d213 in socket_event_poll_err (this=0x22f6e60) at socket.c:694
#14 0x00007fb752c21849 in socket_event_handler (fd=7, idx=1, data=0x22f6e60, 
    poll_in=1, poll_out=0, poll_err=16) at socket.c:1797
#15 0x00007fb755e2a6c4 in event_dispatch_epoll_handler (event_pool=0x22e3d90, 
    events=0x22e8040, i=0) at event.c:794
#16 0x00007fb755e2a8e7 in event_dispatch_epoll (event_pool=0x22e3d90) at event.c:856
#17 0x00007fb755e2ac72 in event_dispatch (event_pool=0x22e3d90) at event.c:956
#18 0x0000000000407a5e in main ()


How reproducible:

Mount a regular single-brick volume.  Start I/O (I used iozone).  Simulate a brick failure by stopping the glusterfsd process.  More than half the time, the client will crash as above.


Actual results:

See above.

Expected results:

I/O stoppage and/or errors on the client, followed by normal operation after the glusterfsd process is resumed and allows reconnection.

Additional info:

This was first observed using the SSL patch, but I wanted to reproduce it on mainline before reporting and I was able to do so on the first try.

--- Additional comment from jdarcy on 2011-12-14 00:31:16 EST ---

*** Bug 767367 has been marked as a duplicate of this bug. ***

--- Additional comment from jdarcy on 2011-12-14 00:38:00 EST ---

I think I found the real problem, and it's not in write-behind at all.  It's in rpc-clnt instead.  What happens is that client3_1_writev calls client_submit_vec_request, which in turn calls rpc_clnt_submit.  If the last of these fails due to a broken connection, we end up unwinding twice - once near the end of rpc_clnt_submit, and again near the end of client3_1_writev.  With only one of these unwinds enabled, I was able to get through a half dozen disconnect/reconnect cycles whereas previously I could hardly even get through one.

--- Additional comment from amarts on 2011-12-19 01:23:23 EST ---

The patch Jeff sent should solve the issue... Assigning the bug to him instead of me.

--- Additional comment from aavati on 2011-12-20 00:40:38 EST ---

CHANGE: http://review.gluster.com/784 (Fix local==NULL crash in wb_sync_cbk during disconnect.) merged in master by Vijay Bellur (vijay)

--- Additional comment from aavati on 2012-02-20 03:48:49 EST ---

CHANGE: http://review.gluster.com/2770 (protocol/client: writev(): don't unwind if rpc_submit failed) merged in release-3.2 by Vijay Bellur (vijay)

--- Additional comment from rgowdapp on 2012-03-07 22:58:38 EST ---

*** Bug 768348 has been marked as a duplicate of this bug. ***

--- Additional comment from aavati on 2012-03-18 04:52:11 EDT ---

CHANGE: http://review.gluster.com/2896 (rpc: don't unwind the fop in caller if client_submit_request fails) merged in master by Anand Avati (avati)

--- Additional comment from aavati on 2012-03-18 04:52:45 EDT ---

CHANGE: http://review.gluster.com/2897 (protocol/client: replace STACK_UNWIND_STRICT macro with CLIENT_STACK_UNWIND, which does appropraite cleanup before unwinding.) merged in master by Anand Avati (avati)

--- Additional comment from jdarcy on 2012-04-11 11:56:08 EDT ---

I don't have s specific regression test for this, but I often run other (manual) tests that exercise the modified code paths, and I have been unable to reproduce this in quite some time.

Comment 3 errata-xmlrpc 2012-05-01 11:17:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0538.html


Note You need to log in before you can comment on or make changes to this bug.