+++ This bug was initially created as a clone of Bug #767359 +++ Description of problem: Program received signal SIGSEGV, Segmentation fault. 0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0) at write-behind.c:375 375 file = local->file; (gdb) bt #0 0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0) at write-behind.c:375 #1 0x00007fb751de5df9 in client3_1_writev (frame=0x7fb754a3cdbc, this=0x22ed860, data=0x7fffd4332e10) at client3_1-fops.c:3587 #2 0x00007fb751dcfd0f in client_writev (frame=0x7fb754a3cdbc, this=0x22ed860, fd=0x7fb75082219c, vector=0x22f8380, count=1, off=7995392, iobref=0x22f7cc0) at client.c:820 #3 0x00007fb751bb5123 in wb_sync (frame=0x7fb7547c35b0, file=0x22f8d40, winds=0x7fffd43330a0) at write-behind.c:548 #4 0x00007fb751bbb6e9 in wb_do_ops (frame=0x7fb7547c35b0, file=0x22f8d40, winds=0x7fffd43330a0, unwinds=0x7fffd4333090, other_requests=0x7fffd4333080) at write-behind.c:1859 #5 0x00007fb751bbbf5d in wb_process_queue (frame=0x7fb7547c35b0, file=0x22f8d40) at write-behind.c:2048 #6 0x00007fb751bb4854 in wb_sync_cbk (frame=0x7fb7547c35b0, cookie=0x7fb754a3c2fc, this=0x22eeb40, op_ret=-1, op_errno=107, prebuf=0x7fffd4333210, postbuf=0x7fffd43331a0) at write-behind.c:405 #7 0x00007fb751dda0e4 in client3_1_writev_cbk (req=0x7fb750eea02c, iov=0x7fffd43333e0, count=1, myframe=0x7fb754a3c2fc) at client3_1-fops.c:692 #8 0x00007fb755bd89e1 in saved_frames_unwind (saved_frames=0x22e90f0) at rpc-clnt.c:385 #9 0x00007fb755bd8a90 in saved_frames_destroy (frames=0x22e90f0) at rpc-clnt.c:403 #10 0x00007fb755bd9005 in rpc_clnt_connection_cleanup (conn=0x22f6d40) at rpc-clnt.c:559 #11 0x00007fb755bd9ae6 in rpc_clnt_notify (trans=0x22f6e60, mydata=0x22f6d40, event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-clnt.c:863 #12 0x00007fb755bd5d5c in rpc_transport_notify (this=0x22f6e60, event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-transport.c:498 #13 0x00007fb752c1d213 in socket_event_poll_err (this=0x22f6e60) at socket.c:694 #14 0x00007fb752c21849 in socket_event_handler (fd=7, idx=1, data=0x22f6e60, poll_in=1, poll_out=0, poll_err=16) at socket.c:1797 #15 0x00007fb755e2a6c4 in event_dispatch_epoll_handler (event_pool=0x22e3d90, events=0x22e8040, i=0) at event.c:794 #16 0x00007fb755e2a8e7 in event_dispatch_epoll (event_pool=0x22e3d90) at event.c:856 #17 0x00007fb755e2ac72 in event_dispatch (event_pool=0x22e3d90) at event.c:956 #18 0x0000000000407a5e in main () How reproducible: Mount a regular single-brick volume. Start I/O (I used iozone). Simulate a brick failure by stopping the glusterfsd process. More than half the time, the client will crash as above. Actual results: See above. Expected results: I/O stoppage and/or errors on the client, followed by normal operation after the glusterfsd process is resumed and allows reconnection. Additional info: This was first observed using the SSL patch, but I wanted to reproduce it on mainline before reporting and I was able to do so on the first try. --- Additional comment from jdarcy on 2011-12-14 00:31:16 EST --- *** Bug 767367 has been marked as a duplicate of this bug. *** --- Additional comment from jdarcy on 2011-12-14 00:38:00 EST --- I think I found the real problem, and it's not in write-behind at all. It's in rpc-clnt instead. What happens is that client3_1_writev calls client_submit_vec_request, which in turn calls rpc_clnt_submit. If the last of these fails due to a broken connection, we end up unwinding twice - once near the end of rpc_clnt_submit, and again near the end of client3_1_writev. With only one of these unwinds enabled, I was able to get through a half dozen disconnect/reconnect cycles whereas previously I could hardly even get through one. --- Additional comment from amarts on 2011-12-19 01:23:23 EST --- The patch Jeff sent should solve the issue... Assigning the bug to him instead of me. --- Additional comment from aavati on 2011-12-20 00:40:38 EST --- CHANGE: http://review.gluster.com/784 (Fix local==NULL crash in wb_sync_cbk during disconnect.) merged in master by Vijay Bellur (vijay) --- Additional comment from aavati on 2012-02-20 03:48:49 EST --- CHANGE: http://review.gluster.com/2770 (protocol/client: writev(): don't unwind if rpc_submit failed) merged in release-3.2 by Vijay Bellur (vijay) --- Additional comment from rgowdapp on 2012-03-07 22:58:38 EST --- *** Bug 768348 has been marked as a duplicate of this bug. *** --- Additional comment from aavati on 2012-03-18 04:52:11 EDT --- CHANGE: http://review.gluster.com/2896 (rpc: don't unwind the fop in caller if client_submit_request fails) merged in master by Anand Avati (avati) --- Additional comment from aavati on 2012-03-18 04:52:45 EDT --- CHANGE: http://review.gluster.com/2897 (protocol/client: replace STACK_UNWIND_STRICT macro with CLIENT_STACK_UNWIND, which does appropraite cleanup before unwinding.) merged in master by Anand Avati (avati) --- Additional comment from jdarcy on 2012-04-11 11:56:08 EDT --- I don't have s specific regression test for this, but I often run other (manual) tests that exercise the modified code paths, and I have been unable to reproduce this in quite some time.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0538.html