Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 767359

Summary:	SIGSEGV in wb_sync_cbk during disconnect
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	rpc	Assignee:	Jeff Darcy <jdarcy>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amarts, b.candler, gluster-bugs, vbhat
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	815027 (view as bug list)		Environment:
Last Closed:	2013-07-24 18:03:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	811632, 815027, 817967

Description Jeff Darcy 2011-12-13 21:30:12 UTC

Description of problem:

Program received signal SIGSEGV, Segmentation fault.
0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, 
    this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0)
    at write-behind.c:375
375	        file = local->file;
(gdb) bt
#0  0x00007fb751bb46c5 in wb_sync_cbk (frame=0x7fb7547c3338, cookie=0x7fb754a3cdbc, 
    this=0x22eeb40, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0)
    at write-behind.c:375
#1  0x00007fb751de5df9 in client3_1_writev (frame=0x7fb754a3cdbc, this=0x22ed860, 
    data=0x7fffd4332e10) at client3_1-fops.c:3587
#2  0x00007fb751dcfd0f in client_writev (frame=0x7fb754a3cdbc, this=0x22ed860, 
    fd=0x7fb75082219c, vector=0x22f8380, count=1, off=7995392, iobref=0x22f7cc0)
    at client.c:820
#3  0x00007fb751bb5123 in wb_sync (frame=0x7fb7547c35b0, file=0x22f8d40, 
    winds=0x7fffd43330a0) at write-behind.c:548
#4  0x00007fb751bbb6e9 in wb_do_ops (frame=0x7fb7547c35b0, file=0x22f8d40, 
    winds=0x7fffd43330a0, unwinds=0x7fffd4333090, other_requests=0x7fffd4333080)
    at write-behind.c:1859
#5  0x00007fb751bbbf5d in wb_process_queue (frame=0x7fb7547c35b0, file=0x22f8d40)
    at write-behind.c:2048
#6  0x00007fb751bb4854 in wb_sync_cbk (frame=0x7fb7547c35b0, cookie=0x7fb754a3c2fc, 
    this=0x22eeb40, op_ret=-1, op_errno=107, prebuf=0x7fffd4333210, 
    postbuf=0x7fffd43331a0) at write-behind.c:405
#7  0x00007fb751dda0e4 in client3_1_writev_cbk (req=0x7fb750eea02c, 
    iov=0x7fffd43333e0, count=1, myframe=0x7fb754a3c2fc) at client3_1-fops.c:692
#8  0x00007fb755bd89e1 in saved_frames_unwind (saved_frames=0x22e90f0)
    at rpc-clnt.c:385
#9  0x00007fb755bd8a90 in saved_frames_destroy (frames=0x22e90f0) at rpc-clnt.c:403
#10 0x00007fb755bd9005 in rpc_clnt_connection_cleanup (conn=0x22f6d40)
    at rpc-clnt.c:559
#11 0x00007fb755bd9ae6 in rpc_clnt_notify (trans=0x22f6e60, mydata=0x22f6d40, 
    event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-clnt.c:863
#12 0x00007fb755bd5d5c in rpc_transport_notify (this=0x22f6e60, 
    event=RPC_TRANSPORT_DISCONNECT, data=0x22f6e60) at rpc-transport.c:498
#13 0x00007fb752c1d213 in socket_event_poll_err (this=0x22f6e60) at socket.c:694
#14 0x00007fb752c21849 in socket_event_handler (fd=7, idx=1, data=0x22f6e60, 
    poll_in=1, poll_out=0, poll_err=16) at socket.c:1797
#15 0x00007fb755e2a6c4 in event_dispatch_epoll_handler (event_pool=0x22e3d90, 
    events=0x22e8040, i=0) at event.c:794
#16 0x00007fb755e2a8e7 in event_dispatch_epoll (event_pool=0x22e3d90) at event.c:856
#17 0x00007fb755e2ac72 in event_dispatch (event_pool=0x22e3d90) at event.c:956
#18 0x0000000000407a5e in main ()


How reproducible:

Mount a regular single-brick volume.  Start I/O (I used iozone).  Simulate a brick failure by stopping the glusterfsd process.  More than half the time, the client will crash as above.


Actual results:

See above.

Expected results:

I/O stoppage and/or errors on the client, followed by normal operation after the glusterfsd process is resumed and allows reconnection.

Additional info:

This was first observed using the SSL patch, but I wanted to reproduce it on mainline before reporting and I was able to do so on the first try.

Comment 1 Jeff Darcy 2011-12-14 05:31:16 UTC

*** Bug 767367 has been marked as a duplicate of this bug. ***

Comment 2 Jeff Darcy 2011-12-14 05:38:00 UTC

I think I found the real problem, and it's not in write-behind at all.  It's in rpc-clnt instead.  What happens is that client3_1_writev calls client_submit_vec_request, which in turn calls rpc_clnt_submit.  If the last of these fails due to a broken connection, we end up unwinding twice - once near the end of rpc_clnt_submit, and again near the end of client3_1_writev.  With only one of these unwinds enabled, I was able to get through a half dozen disconnect/reconnect cycles whereas previously I could hardly even get through one.

Comment 3 Amar Tumballi 2011-12-19 06:23:23 UTC

The patch Jeff sent should solve the issue... Assigning the bug to him instead of me.

Comment 4 Anand Avati 2011-12-20 05:40:38 UTC

CHANGE: http://review.gluster.com/784 (Fix local==NULL crash in wb_sync_cbk during disconnect.) merged in master by Vijay Bellur (vijay)

Comment 5 Anand Avati 2012-02-20 08:48:49 UTC

CHANGE: http://review.gluster.com/2770 (protocol/client: writev(): don't unwind if rpc_submit failed) merged in release-3.2 by Vijay Bellur (vijay)

Comment 6 Raghavendra G 2012-03-08 03:58:38 UTC

*** Bug 768348 has been marked as a duplicate of this bug. ***

Comment 7 Anand Avati 2012-03-18 08:52:11 UTC

CHANGE: http://review.gluster.com/2896 (rpc: don't unwind the fop in caller if client_submit_request fails) merged in master by Anand Avati (avati)

Comment 8 Anand Avati 2012-03-18 08:52:45 UTC

CHANGE: http://review.gluster.com/2897 (protocol/client: replace STACK_UNWIND_STRICT macro with CLIENT_STACK_UNWIND, which does appropraite cleanup before unwinding.) merged in master by Anand Avati (avati)

Comment 9 Jeff Darcy 2012-04-11 15:56:08 UTC

I don't have s specific regression test for this, but I often run other (manual) tests that exercise the modified code paths, and I have been unable to reproduce this in quite some time.

Comment 10 Amar Tumballi 2012-06-05 09:53:22 UTC

*** Bug 828509 has been marked as a duplicate of this bug. ***