Bug 1116243 - Glusterfs client goes into an infinite recursion in rpc_clnt_start_ping() codepath
Summary: Glusterfs client goes into an infinite recursion in rpc_clnt_start_ping() cod...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Krutika Dhananjay
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1116253
TreeView+ depends on / blocked
 
Reported: 2014-07-04 05:15 UTC by Krutika Dhananjay
Modified: 2014-11-11 08:36 UTC (History)
2 users (show)

Fixed In Version: glusterfs-3.6.0beta1
Clone Of:
: 1116253 (view as bug list)
Environment:
Last Closed: 2014-11-11 08:36:34 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Krutika Dhananjay 2014-07-04 05:15:06 UTC
Description of problem:
======================

The mount process went into an infinite indirect recursion with about 49K stack frames in the rpc_clnt_start_ping() codepath before it was aborted. Ran into this issue while I was trying to recreate https://bugzilla.redhat.com/show_bug.cgi?id=1065332

Excerpt from examination of the core in gdb:
===========================================

(gdb) down
#49092 0x00007fce7033ba32 in client3_3_readv (frame=0x7fce65212aec, this=0x1533fc0, 
    data=0x7fce6a9c6f60) at client-rpc-fops.c:4109
4109            ret = client_submit_request (this, &req, frame, conf->fops,
(gdb) down
#49091 0x00007fce7031d44c in client_submit_request (this=0x1533fc0, req=0x7fce6a9c6e60, 
    frame=0x7fce65212aec, prog=0x7fce70555ba0 <clnt3_3_fop_prog>, procnum=12, 
    cbkfn=0x7fce70336059 <client3_3_readv_cbk>, iobref=0x0, rsphdr=0x0, rsphdr_count=0, 
    rsp_payload=0x7fce6a9c6e50, rsp_payload_count=1, rsp_iobref=0x7fce5d457f20, 
    xdrproc=0x7fce79ddcd62 <xdr_gfs3_read_req>) at client.c:242
242             ret = rpc_clnt_submit (conf->rpc, prog, procnum, cbkfn, &iov, count,
(gdb) down
#49090 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, 
    prog=0x7fce70555ba0 <clnt3_3_fop_prog>, procnum=12, 
    cbkfn=0x7fce70336059 <client3_3_readv_cbk>, proghdr=0x7fce6a9c6da0, proghdrcount=1, 
    progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d44d160, frame=0x7fce65212aec, 
    rsphdr=0x0, rsphdr_count=0, rsp_payload=0x7fce6a9c6e50, rsp_payload_count=1, 
    rsp_iobref=0x7fce5d457f20) at rpc-clnt.c:1582
1582            rpc_clnt_start_ping (rpc);
(gdb) down
#49089 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285
285             rpc_clnt_ping(rpc);
(gdb) down
#49088 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198
198             ret = rpc_clnt_submit (rpc, &clnt_ping_prog,
(gdb) down
#49087 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, 
    prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, 
    proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d445610, 
    frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, 
    rsp_iobref=0x0) at rpc-clnt.c:1582
1582            rpc_clnt_start_ping (rpc);
1582            rpc_clnt_start_ping (rpc);
(gdb) down
#49086 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285
285             rpc_clnt_ping(rpc);
(gdb) down
#49085 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198
198             ret = rpc_clnt_submit (rpc, &clnt_ping_prog,
(gdb) down
#49084 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, 
    prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, 
    proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d3fc960, 
    frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, 
    rsp_iobref=0x0) at rpc-clnt.c:1582
1582            rpc_clnt_start_ping (rpc);
(gdb) down
#49083 0x00007fce7a008558 in rpc_clnt_start_ping (rpc_ptr=0x155f430) at rpc-clnt-ping.c:285
285             rpc_clnt_ping(rpc);
(gdb) down
#49082 0x00007fce7a00824b in rpc_clnt_ping (rpc=0x155f430) at rpc-clnt-ping.c:198
198             ret = rpc_clnt_submit (rpc, &clnt_ping_prog,
(gdb) down
#49081 0x00007fce7a004575 in rpc_clnt_submit (rpc=0x155f430, 
    prog=0x7fce7a20faa0 <clnt_ping_prog>, procnum=2, cbkfn=0x7fce7a007ebe <rpc_clnt_ping_cbk>, 
    proghdr=0x0, proghdrcount=0, progpayload=0x0, progpayloadcount=0, iobref=0x7fce5d488b40, 
    frame=0x7fce5d4d9384, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, 
    rsp_iobref=0x0) at rpc-clnt.c:1582
1582            rpc_clnt_start_ping (rpc);

.... and so on

Attached the generated core file.

Version-Release number of selected component (if applicable):
Saw this in master with head at commit 5723645715ae25c2de1bbdde1cf793e194e69eff on my Fedora 19 VM.

How reproducible:
Happened 2 out of 3 times.

Steps to Reproduce:
Hit this issue while recreating https://bugzilla.redhat.com/show_bug.cgi?id=1065332, mostly during the act of creating data on the fuse mount point through kernel untar.

Actual results:

Expected results:


Additional info:

Comment 1 Krutika Dhananjay 2014-07-04 05:44:26 UTC
Root Cause Analysis:
===================

Pranith and I found the following while debugging this issue:

Here are the sequence of events that lead to this bug:

1. The fuse thread submits a readv request over to the brick at the client xlator via a call to rpc_clnt_submit().
2. The client xlator, after sending the request over the wire, calls rpc_clnt_start_ping().
3. Here conn->ping_started is set to 1 and rpc_clnt_ping() is called.
4. rpc_clnt_ping() in turn calls rpc_clnt_submit() again to submit ping packets.
5. Now after the ping packets are sent over the network, and by the time rpc_clnt_submit() in event 4 calls rpc_clnt_start_ping() again, the epoll thread processes the ping response in rpc_clnt_ping_cbk() and resets conn->ping_started to 0.
6. This causes rpc_clnt_start_ping() in fuse thread to resend ping request and again by the time it checks whether to start ping or not (by examining conn->ping_started), the epoll thread resets it to 0, and this cycle continues until stack overflow.

Comment 2 Anand Avati 2014-07-08 05:19:55 UTC
REVIEW: http://review.gluster.org/8257 (rpc: Do not reset @ping_started to 0 in ping callback) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Anand Avati 2014-07-09 08:08:40 UTC
REVIEW: http://review.gluster.org/8257 (rpc: Do not reset @ping_started to 0 in ping callback) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Anand Avati 2014-07-09 13:57:25 UTC
COMMIT: http://review.gluster.org/8257 committed in master by Vijay Bellur (vbellur) 
------
commit b9f1d1a120b4469d51d2a96eecc7ce83516593ba
Author: Krutika Dhananjay <kdhananj>
Date:   Tue Jul 8 08:18:27 2014 +0530

    rpc: Do not reset @ping_started to 0 in ping callback
    
    This is to avoid indefinite recursion of the following kind, that could
    lead to a stack overflow:
    
    rpc_clnt_start_ping() -> rpc_clnt_ping() -> rpc_clnt_submit() ->
    rpc_clnt_start_ping() -> rpc_clnt_ping() -> rpc_clnt_submit() ...
    and so on,
    
    since it is possible that before rpc_clnt_start_ping() is called a
    second time by the thread executing this codepath, the response to
    previous ping request could ALWAYS come by and cause epoll thread to
    reset conn->ping_started to 0.
    
    This patch also fixes the issue of excessive ping traffic, which was
    due to the client sending one ping rpc for every fop in the worst case.
    
    Also removed dead code in glusterd.
    
    Change-Id: I7c5e6ae3b1c9d23407c0a12a319bdcb43ba7a359
    BUG: 1116243
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/8257
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 5 Anand Avati 2014-07-12 17:11:49 UTC
REVIEW: http://review.gluster.org/8301 (rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit()) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 6 Anand Avati 2014-07-18 06:10:27 UTC
REVIEW: http://review.gluster.org/8301 (rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit()) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 7 Anand Avati 2014-07-27 20:36:52 UTC
COMMIT: http://review.gluster.org/8301 committed in master by Vijay Bellur (vbellur) 
------
commit 3e7c3b25129df098511970c79faf561be08ef87b
Author: Krutika Dhananjay <kdhananj>
Date:   Sat Jul 12 22:31:45 2014 +0530

    rpc,glusterd: Set ret to 0 after call to rpc_clnt_submit()
    
    This is to guard against a double STACK_DESTROY in the event that
    rpc_clnt_submit() failed and returned -1 anytime after sending the
    request over the wire. in which case the stack could be destroyed
    twice: once in the callback function of the request and once in
    the error codepath of some of the callers of glusterd_submit_request().
    This bug was introduced in http://review.gluster.org/#/c/8257/
    
    Change-Id: I18a345db6eafc62ba312743201ced7d3f8697622
    BUG: 1116243
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/8301
    Reviewed-by: Harshavardhana <harsha>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 8 Niels de Vos 2014-09-22 12:44:26 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 9 Niels de Vos 2014-11-11 08:36:34 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users


Note You need to log in before you can comment on or make changes to this bug.