Bug 765473 (GLUSTER-3741)

Summary: [glusterfs-3.2.5qa1] glusterfs client process crashed
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: rdmaAssignee: Raghavendra G <rgowdapp>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: pre-releaseCC: amarts, gluster-bugs, jdarcy, rabhat
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 849125 (view as bug list) Environment:
Last Closed: 2013-07-24 13:46:08 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 849125, 895528    
Attachments:
Description Flags
glusterfs client log none

Description M S Vishwanath Bhat 2011-10-19 10:21:02 EDT
Created a volume with rdma transport type and ran sanity scripts. Client process crashed with following backtrace

#0  0x00007f4911d266ca in wb_sync_cbk (frame=0x7f491416a848, cookie=0x7f49143d51d4, this=0xcfcd20, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0) at write-behind.c:375
#1  0x00007f4911f5ec4c in dht_writev_cbk (frame=0x7f49143d51d4, cookie=0x7f49143d5278, this=0xd08da0, op_ret=-1, op_errno=116, prebuf=0x0, postbuf=0x0) at dht-common.c:2670
#2  0x00007f491219e49c in client3_1_writev (frame=0x7f49143d5278, this=0xd07f00, data=0x7fff24c43f80) at client3_1-fops.c:3623
#3  0x00007f4912187f09 in client_writev (frame=0x7f49143d5278, this=0xd07f00, fd=0x7f490844b0d4, vector=0x26b5870, count=1, off=9891840, iobref=0x26b5920) at client.c:817
#4  0x00007f4911f5f0f9 in dht_writev (frame=0x7f49143d51d4, this=0xd08da0, fd=0x7f490844b0d4, vector=0x26b5870, count=1, off=9891840, iobref=0x26b5920) at dht-common.c:2706
#5  0x00007f4911d27128 in wb_sync (frame=0x7f491416a378, file=0x7f4905dadf10, winds=0x7fff24c442d0) at write-behind.c:548
#6  0x00007f4911d2d29e in wb_do_ops (frame=0x7f491416a378, file=0x7f4905dadf10, winds=0x7fff24c442d0, unwinds=0x7fff24c442c0, other_requests=0x7fff24c442b0) at write-behind.c:1859
#7  0x00007f4911d2db12 in wb_process_queue (frame=0x7f491416a378, file=0x7f4905dadf10) at write-behind.c:2048
#8  0x00007f4911d26859 in wb_sync_cbk (frame=0x7f491416a378, cookie=0x7f49143d508c, this=0xd0a010, op_ret=-1, op_errno=107, prebuf=0x7fff24c444d0, postbuf=0x7fff24c44460) at write-behind.c:405
#9  0x00007f4911f5ec4c in dht_writev_cbk (frame=0x7f49143d508c, cookie=0x7f49143d5130, this=0xd08da0, op_ret=-1, op_errno=107, prebuf=0x7fff24c444d0, postbuf=0x7fff24c44460) at dht-common.c:2670
#10 0x00007f491219297f in client3_1_writev_cbk (req=0x7f4911085094, iov=0x0, count=0, myframe=0x7f49143d5130) at client3_1-fops.c:685
#11 0x00007f49150f78ea in rpc_clnt_submit (rpc=0xd1acd0, prog=0x7f49123b5260, procnum=13, cbkfn=0x7f4912192639 <client3_1_writev_cbk>, proghdr=0x7fff24c44820, proghdrcount=1, progpayload=0x26a5f30, progpayloadcount=1, iobref=0x26b57b0,
    frame=0x7f49143d5130, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1450
#12 0x00007f4912190148 in client_submit_vec_request (this=0xd07f00, req=0x7fff24c448c0, frame=0x7f49143d5130, prog=0x7f49123b5260, procnum=13, cbk=0x7f4912192639 <client3_1_writev_cbk>, payload=0x26a5f30, payloadcnt=1,
    iobref=0x26b5070, sfunc=0x7f4914ed8a6b <xdr_from_writev_req>) at client3_1-fops.c:95
#13 0x00007f491219e30f in client3_1_writev (frame=0x7f49143d5130, this=0xd07f00, data=0x7fff24c44980) at client3_1-fops.c:3613
#14 0x00007f4912187f09 in client_writev (frame=0x7f49143d5130, this=0xd07f00, fd=0x7f490844b0d4, vector=0x26a5f30, count=1, off=7335936, iobref=0x26b5070) at client.c:817
#15 0x00007f4911f5f0f9 in dht_writev (frame=0x7f49143d508c, this=0xd08da0, fd=0x7f490844b0d4, vector=0x26a5f30, count=1, off=7335936, iobref=0x26b5070) at dht-common.c:2706
#16 0x00007f4911d27128 in wb_sync (frame=0x7f491416a5e0, file=0x7f4905dadf10, winds=0x7fff24c44cd0) at write-behind.c:548
#17 0x00007f4911d2d29e in wb_do_ops (frame=0x7f491416a5e0, file=0x7f4905dadf10, winds=0x7fff24c44cd0, unwinds=0x7fff24c44cc0, other_requests=0x7fff24c44cb0) at write-behind.c:1859
#18 0x00007f4911d2db12 in wb_process_queue (frame=0x7f491416a5e0, file=0x7f4905dadf10) at write-behind.c:2048
#19 0x00007f4911d26859 in wb_sync_cbk (frame=0x7f491416a5e0, cookie=0x7f49143d4f44, this=0xd0a010, op_ret=-1, op_errno=107, prebuf=0x7fff24c44ed0, postbuf=0x7fff24c44e60) at write-behind.c:405
#20 0x00007f4911f5ec4c in dht_writev_cbk (frame=0x7f49143d4f44, cookie=0x7f49143d4fe8, this=0xd08da0, op_ret=-1, op_errno=107, prebuf=0x7fff24c44ed0, postbuf=0x7fff24c44e60) at dht-common.c:2670
#21 0x00007f491219297f in client3_1_writev_cbk (req=0x7f4911084e50, iov=0x0, count=0, myframe=0x7f49143d4fe8) at client3_1-fops.c:685
#22 0x00007f49150f78ea in rpc_clnt_submit (rpc=0xd1acd0, prog=0x7f49123b5260, procnum=13, cbkfn=0x7f4912192639 <client3_1_writev_cbk>, proghdr=0x7fff24c45220, proghdrcount=1, progpayload=0x26a5680, progpayloadcount=1, iobref=0x26a5e70,
    frame=0x7f49143d4fe8, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1450
#23 0x00007f4912190148 in client_submit_vec_request (this=0xd07f00, req=0x7fff24c452c0, frame=0x7f49143d4fe8, prog=0x7f49123b5260, procnum=13, cbk=0x7f4912192639 <client3_1_writev_cbk>, payload=0x26a5680, payloadcnt=1,
    iobref=0x26a5730, sfunc=0x7f4914ed8a6b <xdr_from_writev_req>) at client3_1-fops.c:95
#24 0x00007f491219e30f in client3_1_writev (frame=0x7f49143d4fe8, this=0xd07f00, data=0x7fff24c45380) at client3_1-fops.c:3613
#25 0x00007f4912187f09 in client_writev (frame=0x7f49143d4fe8, this=0xd07f00, fd=0x7f490844b0d4, vector=0x26a5680, count=1, off=2670592, iobref=0x26a5730) at client.c:817
#26 0x00007f4911f5f0f9 in dht_writev (frame=0x7f49143d4f44, this=0xd08da0, fd=0x7f490844b0d4, vector=0x26a5680, count=1, off=2670592, iobref=0x26a5730) at dht-common.c:2706
#27 0x00007f4911d27128 in wb_sync (frame=0x7f4914169fdc, file=0x7f4905dadf10, winds=0x7fff24c456d0) at write-behind.c:548
#28 0x00007f4911d2d29e in wb_do_ops (frame=0x7f4914169fdc, file=0x7f4905dadf10, winds=0x7fff24c456d0, unwinds=0x7fff24c456c0, other_requests=0x7fff24c456b0) at write-behind.c:1859
#29 0x00007f4911d2db12 in wb_process_queue (frame=0x7f4914169fdc, file=0x7f4905dadf10) at write-behind.c:2048
#30 0x00007f4911d26859 in wb_sync_cbk (frame=0x7f4914169fdc, cookie=0x7f49143d4dfc, this=0xd0a010, op_ret=-1, op_errno=107, prebuf=0x7fff24c458d0, postbuf=0x7fff24c45860) at write-behind.c:405
#31 0x00007f4911f5ec4c in dht_writev_cbk (frame=0x7f49143d4dfc, cookie=0x7f49143d4ea0, this=0xd08da0, op_ret=-1, op_errno=107, prebuf=0x7fff24c458d0, postbuf=0x7fff24c45860) at dht-common.c:2670
#32 0x00007f491219297f in client3_1_writev_cbk (req=0x7f4911084c0c, iov=0x0, count=0, myframe=0x7f49143d4ea0) at client3_1-fops.c:685
#33 0x00007f49150f78ea in rpc_clnt_submit (rpc=0xd1acd0, prog=0x7f49123b5260, procnum=13, cbkfn=0x7f4912192639 <client3_1_writev_cbk>, proghdr=0x7fff24c45c20, proghdrcount=1, progpayload=0x26a1d50, progpayloadcount=1, iobref=0x26a1f30,
    frame=0x7f49143d4ea0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1450
#34 0x00007f4912190148 in client_submit_vec_request (this=0xd07f00, req=0x7fff24c45cc0, frame=0x7f49143d4ea0, prog=0x7f49123b5260, procnum=13, cbk=0x7f4912192639 <client3_1_writev_cbk>, payload=0x26a1d50, payloadcnt=1,
    iobref=0x26a1e00, sfunc=0x7f4914ed8a6b <xdr_from_writev_req>) at client3_1-fops.c:95
#35 0x00007f491219e30f in client3_1_writev (frame=0x7f49143d4ea0, this=0xd07f00, data=0x7fff24c45d80) at client3_1-fops.c:3613
#36 0x00007f4912187f09 in client_writev (frame=0x7f49143d4ea0, this=0xd07f00, fd=0x7f490844b0d4, vector=0x26a1d50, count=1, off=5947392, iobref=0x26a1e00) at client.c:817
#37 0x00007f4911f5f0f9 in dht_writev (frame=0x7f49143d4dfc, this=0xd08da0, fd=0x7f490844b0d4, vector=0x26a1d50, count=1, off=5947392, iobref=0x26a1e00) at dht-common.c:2706
#38 0x00007f4911d27128 in wb_sync (frame=0x7f4914169d74, file=0x7f4905dadf10, winds=0x7fff24c460d0) at write-behind.c:548
#39 0x00007f4911d2d29e in wb_do_ops (frame=0x7f4914169d74, file=0x7f4905dadf10, winds=0x7fff24c460d0, unwinds=0x7fff24c460c0, other_requests=0x7fff24c460b0) at write-behind.c:1859
#40 0x00007f4911d2db12 in wb_process_queue (frame=0x7f4914169d74, file=0x7f4905dadf10) at write-behind.c:2048
#41 0x00007f4911d26859 in wb_sync_cbk (frame=0x7f4914169d74, cookie=0x7f49143d4cb4, this=0xd0a010, op_ret=-1, op_errno=107, prebuf=0x7fff24c462d0, postbuf=0x7fff24c46260) at write-behind.c:405
#42 0x00007f4911f5ec4c in dht_writev_cbk (frame=0x7f49143d4cb4, cookie=0x7f49143d4d58, this=0xd08da0, op_ret=-1, op_errno=107, prebuf=0x7fff24c462d0, postbuf=0x7fff24c46260) at dht-common.c:2670
#43 0x00007f491219297f in client3_1_writev_cbk (req=0x7f49110849c8, iov=0x0, count=0, myframe=0x7f49143d4d58) at client3_1-fops.c:685
#44 0x00007f49150f78ea in rpc_clnt_submit (rpc=0xd1acd0, prog=0x7f49123b5260, procnum=13, cbkfn=0x7f4912192639 <client3_1_writev_cbk>, proghdr=0x7fff24c46620, proghdrcount=1, progpayload=0x269ddc0, progpayloadcount=1, iobref=0x26a1c90,
    frame=0x7f49143d4d58, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1450
#45 0x00007f4912190148 in client_submit_vec_request (this=0xd07f00, req=0x7fff24c466c0, frame=0x7f49143d4d58, prog=0x7f49123b5260, procnum=13, cbk=0x7f4912192639 <client3_1_writev_cbk>, payload=0x269ddc0, payloadcnt=1,
    iobref=0x269de70, sfunc=0x7f4914ed8a6b <xdr_from_writev_req>) at client3_1-fops.c:95
#46 0x00007f491219e30f in client3_1_writev (frame=0x7f49143d4d58, this=0xd07f00, data=0x7fff24c46780) at client3_1-fops.c:3613
#47 0x00007f4912187f09 in client_writev (frame=0x7f49143d4d58, this=0xd07f00, fd=0x7f490844b0d4, vector=0x269ddc0, count=1, off=3321856, iobref=0x269de70) at client.c:817


There are more frames in core file.

I have attached the client log and archived the core file.
Comment 1 Raghavendra G 2011-10-24 00:30:25 EDT
Its a case of stack overflow. Since the transport is not connected, write frame is unwound from protocol/client and there by resulting in an indirect recursion of wb_process_queue. Also, the number of write requests is large enough to cause this stack to overflow.
Comment 2 Amar Tumballi 2012-10-11 06:03:10 EDT
http://review.gluster.org/3874 should help to fix this... need a rebase..
Comment 3 Amar Tumballi 2012-10-11 06:10:08 EDT
*** Bug 797742 has been marked as a duplicate of this bug. ***
Comment 4 Vijay Bellur 2013-02-18 02:33:52 EST
CHANGE: http://review.gluster.org/4515 (performance/write-behind: mark fd bad if any written behind writes fail.) merged in master by Anand Avati (avati@redhat.com)
Comment 5 Raghavendra G 2013-02-22 02:28:37 EST
*** Bug 890472 has been marked as a duplicate of this bug. ***
Comment 6 Vijay Bellur 2013-02-22 15:12:45 EST
CHANGE: http://review.gluster.org/4559 (tests: move common funtion definitions to include.rc) merged in master by Anand Avati (avati@redhat.com)
Comment 7 Vijay Bellur 2013-02-28 18:22:14 EST
CHANGE: http://review.gluster.org/4560 (performance/write-behind: Add test case for fd being marked bad                           after write failures.) merged in master by Anand Avati (avati@redhat.com)
Comment 8 Vijay Bellur 2013-03-07 00:13:23 EST
CHANGE: http://review.gluster.org/4631 (tests: move common funtion definitions to include.rc) merged in release-3.4 by Anand Avati (avati@redhat.com)
Comment 9 Vijay Bellur 2013-03-07 00:18:15 EST
CHANGE: http://review.gluster.org/4630 (performance/write-behind: mark fd bad if any written behind writes fail) merged in release-3.4 by Anand Avati (avati@redhat.com)
Comment 10 Vijay Bellur 2013-03-07 01:37:37 EST
CHANGE: http://review.gluster.org/4632 (performance/write-behind: Add test case for fd being marked bad after write failures.) merged in release-3.4 by Vijay Bellur (vbellur@redhat.com)