Description of problem: ====================== The mount process went into an infinite indirect recursion with about 49K stack frames in the rpc_clnt_start_ping() codepath before it was aborted. Ran into this issue while I was trying to recreate https://bugzilla.redhat.com/show_bug.cgi?id=1065332 Excerpt from examination of the core in gdb: =========================================== (gdb) down #49092 0x00007fce7033ba32 in client3_3_readv (frame=0x7fce65212aec, this=0x1533fc0, data=0x7fce6a9c6f60) at client-rpc-fops.c:4109 4109 ret = client_submit_request (this, &req, frame, conf->fops, (gdb) down #49091 0x00007fce7031d44c in client_submit_request (this=0x1533fc0, req=0x7fce6a9c6e60, frame=0x7fce65212aec, prog=0x7fce70555ba0 <clnt3_3_fop_prog>, procnum=12, cbkfn=0x7fce70336059 <client3_3_readv_cbk>, iobref=0x0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x7fce6a9c6e50, rsp_payload_count=1, rsp_iobref=0x7fce5d457f20, xdrproc=0x7fce79ddcd62 <xdr_gfs3_read_req>) at client.c:242 242 ret = rpc_clnt_submit (conf->rpc, prog, procnum, cbkfn, &iov, count, (gdb) down #49090 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, prog=0x7fce70555ba0 <clnt3_3_fop_prog>, procnum=12, cbkfn=0x7fce70336059 <client3_3_readv_cbk>, proghdr=0x7fce6a9c6da0, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d44d160, frame=0x7fce65212aec, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x7fce6a9c6e50, rsp_payload_count=1, rsp_iobref=0x7fce5d457f20) at rpc-clnt.c:1582 1582 rpc_clnt_start_ping (rpc); (gdb) down #49089 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285 285 rpc_clnt_ping(rpc); (gdb) down #49088 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198 198 ret = rpc_clnt_submit (rpc, &clnt_ping_prog, (gdb) down #49087 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d445610, frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1582 1582 rpc_clnt_start_ping (rpc); 1582 rpc_clnt_start_ping (rpc); (gdb) down #49086 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285 285 rpc_clnt_ping(rpc); (gdb) down #49085 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198 198 ret = rpc_clnt_submit (rpc, &clnt_ping_prog, (gdb) down #49084 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d3fc960, frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1582 1582 rpc_clnt_start_ping (rpc); (gdb) down #49083 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285 285 rpc_clnt_ping(rpc); (gdb) down #49082 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198 198 ret = rpc_clnt_submit (rpc, &clnt_ping_prog, (gdb) down #49081 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d488b40, frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1582 1582 rpc_clnt_start_ping (rpc); .... and so on Attached the generated core file. Version-Release number of selected component (if applicable): Saw this in master with head at commit 5723645715ae25c2de1bbdde1cf793e194e69eff on my Fedora 19 VM. How reproducible: Happened 2 out of 3 times. Steps to Reproduce: Hit this issue while recreating https://bugzilla.redhat.com/show_bug.cgi?id=1065332, mostly during the act of creating data on the fuse mount point through kernel untar. Actual results: Expected results: Additional info:
Root Cause Analysis: =================== Pranith and I found the following while debugging this issue: Here are the sequence of events that lead to this bug: 1. The fuse thread submits a readv request over to the brick at the client xlator via a call to rpc_clnt_submit(). 2. The client xlator, after sending the request over the wire, calls rpc_clnt_start_ping(). 3. Here conn->ping_started is set to 1 and rpc_clnt_ping() is called. 4. rpc_clnt_ping() in turn calls rpc_clnt_submit() again to submit ping packets. 5. Now after the ping packets are sent over the network, and by the time rpc_clnt_submit() in event 4 calls rpc_clnt_start_ping() again, the epoll thread processes the ping response in rpc_clnt_ping_cbk() and resets conn->ping_started to 0. 6. This causes rpc_clnt_start_ping() in fuse thread to resend ping request and again by the time it checks whether to start ping or not (by examining conn->ping_started), the epoll thread resets it to 0, and this cycle continues until stack overflow.
REVIEW: http://review.gluster.org/8257 (rpc: Do not reset @ping_started to 0 in ping callback) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/8257 (rpc: Do not reset @ping_started to 0 in ping callback) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/8257 committed in master by Vijay Bellur (vbellur) ------ commit b9f1d1a120b4469d51d2a96eecc7ce83516593ba Author: Krutika Dhananjay <kdhananj> Date: Tue Jul 8 08:18:27 2014 +0530 rpc: Do not reset @ping_started to 0 in ping callback This is to avoid indefinite recursion of the following kind, that could lead to a stack overflow: rpc_clnt_start_ping() -> rpc_clnt_ping() -> rpc_clnt_submit() -> rpc_clnt_start_ping() -> rpc_clnt_ping() -> rpc_clnt_submit() ... and so on, since it is possible that before rpc_clnt_start_ping() is called a second time by the thread executing this codepath, the response to previous ping request could ALWAYS come by and cause epoll thread to reset conn->ping_started to 0. This patch also fixes the issue of excessive ping traffic, which was due to the client sending one ping rpc for every fop in the worst case. Also removed dead code in glusterd. Change-Id: I7c5e6ae3b1c9d23407c0a12a319bdcb43ba7a359 BUG: 1116243 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/8257 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/8301 (rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit()) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/8301 (rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit()) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/8301 committed in master by Vijay Bellur (vbellur) ------ commit 3e7c3b25129df098511970c79faf561be08ef87b Author: Krutika Dhananjay <kdhananj> Date: Sat Jul 12 22:31:45 2014 +0530 rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit() This is to guard against a double STACK_DESTROY in the event that rpc_clnt_submit() failed and returned -1 anytime after sending the request over the wire. in which case the stack could be destroyed twice: once in the callback function of the request and once in the error codepath of some of the callers of glusterd_submit_request(). This bug was introduced in http://review.gluster.org/#/c/8257/ Change-Id: I18a345db6eafc62ba312743201ced7d3f8697622 BUG: 1116243 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/8301 Reviewed-by: Harshavardhana <harsha> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report. glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html [2] http://supercolony.gluster.org/mailman/listinfo/gluster-users