1331554 – rpcbind closes connection when rpc fragment sent using multiple send() calls

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1331554 - rpcbind closes connection when rpc fragment sent using multiple send() calls

Summary: rpcbind closes connection when rpc fragment sent using multiple send() calls

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libtirpc
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	7.5
Assignee:	Steve Dickson
QA Contact:	Zhi Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1420851 1469559 1477664 1546181 1546815 1577173
TreeView+	depends on / blocked

Reported:	2016-04-28 19:17 UTC by Frank Sorenson
Modified:	2021-09-09 11:50 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libtirpc-0.2.4-0.16.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1641875 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:40:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reproducer (3.31 KB, text/x-csrc) 2016-04-28 19:17 UTC, Frank Sorenson	no flags	Details
tcpdump with rpc call sent using a single send() call (2.36 KB, application/octet-stream) 2016-04-28 19:20 UTC, Frank Sorenson	no flags	Details
tcpdump with rpc call divided between two send() calls (4 bytes and 60 bytes) (1.22 KB, application/octet-stream) 2016-04-28 19:21 UTC, Frank Sorenson	no flags	Details
tcpdump with rpc call divided between two send() calls (8 bytes and 56 bytes) (760 bytes, application/octet-stream) 2016-04-28 19:22 UTC, Frank Sorenson	no flags	Details
Patch - test_short_rpc.c false EOF (1.56 KB, patch) 2018-10-19 01:47 UTC, Ian Kent	no flags	Details \| Diff
Patch - fix fragmented read EAGAIN (1.25 KB, patch) 2018-10-22 02:48 UTC, Ian Kent	no flags	Details \| Diff
Patch - fix fragmented read EAGAIN (with conditional expectdata check) (1.31 KB, patch) 2018-10-22 02:59 UTC, Ian Kent	no flags	Details \| Diff
Patch - fix EOF detection on non-blocking socket (2.74 KB, patch) 2018-10-25 02:37 UTC, Ian Kent	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2061	0	None	None	None	2019-08-06 12:40:20 UTC

Description Frank Sorenson 2016-04-28 19:17:06 UTC

Created attachment 1152043 [details]
reproducer

Description of problem:

When sending an rpc fragment, if the byte stream is divided among multiple send() calls, rpcbind will close the connection.


Version-Release number of selected component (if applicable):

libtirpc-0.2.1-10.el6.x86_64
rpcbind-0.2.0-11.el6.x86_64


How reproducible:

reproducer attached


Steps to Reproduce:

# gcc test_short_rpc.c -o test_short_rpc
# ./test_short_rpc localhost 4

use any hostname and a number of bytes to send in the first send() call.  The remainder of the bytes in the message (total of 64 bytes) will be sent in the second send() call


Actual results:

# /root/test_short_rpc localhost 4
1: sent 4 bytes... sent 60 bytes...  received 0 bytes
2: caught SIGPIPE


Expected results:

# /root/test_short_rpc localhost 4
1: sent 4 bytes... sent 60 bytes...  received 32 bytes
2: sent 4 bytes... sent 60 bytes...  received 32 bytes
3: sent 4 bytes... sent 60 bytes...  received 32 bytes
4: sent 4 bytes... sent 60 bytes...  received 32 bytes
5: sent 4 bytes... sent 60 bytes...  received 32 bytes
6: sent 4 bytes... sent 60 bytes...  received 32 bytes
...


Additional info:

if number of bytes are < 4, the calls will succeed:
# /root/test_short_rpc localhost 2
1: sent 2 bytes... sent 62 bytes...  received 32 bytes
2: sent 2 bytes... sent 62 bytes...  received 32 bytes
3: sent 2 bytes... sent 62 bytes...  received 32 bytes
4: sent 2 bytes... sent 62 bytes...  received 32 bytes

if the number of bytes == 4, rpcbind will return '0' bytes and close the connection:

]# /root/test_short_rpc localhost 4
1: sent 4 bytes... sent 60 bytes...  received 0 bytes
2: caught SIGPIPE


any other number of bytes, rpcbind will disconnect during the second send() call:
# /root/test_short_rpc localhost 8
1: sent 8 bytes... sent 56 bytes...  Error 104: Connection reset by peer - calling recv


I'll attach some tcpdumps as well.

Comment 1 Frank Sorenson 2016-04-28 19:20:03 UTC

Created attachment 1152046 [details]
tcpdump with rpc call sent using a single send() call

Comment 2 Frank Sorenson 2016-04-28 19:21:42 UTC

Created attachment 1152047 [details]
tcpdump with rpc call divided between two send() calls (4 bytes and 60 bytes)

Comment 3 Frank Sorenson 2016-04-28 19:22:36 UTC

Created attachment 1152048 [details]
tcpdump with rpc call divided between two send() calls (8 bytes and 56 bytes)

Comment 5 Steve Dickson 2016-06-03 18:52:49 UTC

I've figured out what is happening and why its happening and how to
fix it... but I'm not sure the fix is the right thing to do.

Basically the socket the TCP connection is accepted on (call the 
rendezvous socket) becomes non-blocking for (in my option) 
an arbitrary reason. 

The reproducer causes the read of that rendezvous socket to return
EAGAIN. The problem is there is no logic to recover from that error
so the read just fails. The reproducer gets the SIGPIPE when the
connected socket is closed (I think).

The fix is not to make the rendezvous socket non-blocking... But 
I don't understand why socket is become non-blocking in the 
first place.

The only reason the socket is becoming non-block is because, in 
rendezvous_request(), the maxrec is non-zero, which it will
always be since the default value is non-zero.

So I need to go to upstream to see if anybody knows why this
non-block code exists... The non-blocking self induced, meaning 
a user can not set it, so maybe it was a good idea gone bad??

I wonder in later versions of libtirpc if the code exists...

Comment 6 Karl Abbott 2016-06-15 12:04:16 UTC

Steve,

Any new findings here from your work with upstream?

Cheers,
Karl

Comment 7 Dave Wysochanski 2016-06-20 13:58:46 UTC

Recommending for 6.9 filesystem priority list based on:
- reproducer
- problem understood by developer (patch status unclear)

Comment 8 Efraim Marquez-Arreaza 2016-06-24 14:22:05 UTC

@Steve any update from the upstream work?

Thanks,
Efraim

Comment 9 Steve Dickson 2016-07-07 17:59:47 UTC

(In reply to Efraim Marquez-Arreaza from comment #8)
> @Steve any update from the upstream work?
> 
> Thanks,
> Efraim

Well I sent out the email awhile ago
    https://sourceforge.net/p/libtirpc/mailman/message/35139459/

with no response. I need to get my head around this again
and work up some patches...

Comment 10 Karl Abbott 2016-08-01 20:27:50 UTC

Steve,

Just checking in -- any new progress to report?

Cheers,
Karl

Comment 12 Steve Dickson 2016-08-17 20:11:09 UTC

(In reply to Karl Abbott from comment #10)
> Steve,
> 
> Just checking in -- any new progress to report?
No... unfortunately I never got back to it... 

We might have to move this to 7.4... will that
be a big deal?

Comment 15 Karl Abbott 2017-03-15 20:29:48 UTC

Sorry for the delay in my response here but if we can get this into 6.10 that would be great.

Otherwise, if 6.10 is not an option, let's definitely proceed with trying to get this into 7.4.

Cheers,
Karl

Comment 24 Steve Dickson 2018-05-06 14:59:24 UTC

Question... what is this reproducer simulating? The norm is to
send rpc messages in one send not two...

The fix is 
 iff -up libtirpc-0.2.4/src/svc_vc.c.orig libtirpc-0.2.4/src/svc_vc.c
--- libtirpc-0.2.4/src/svc_vc.c.orig	2018-05-05 13:28:22.847009922 -0400
+++ libtirpc-0.2.4/src/svc_vc.c	2018-05-05 13:31:55.617366203 -0400
@@ -493,19 +493,6 @@ read_vc(xprtp, buf, len)
 
 	cfp = (struct cf_conn *)xprt->xp_p1;
 
-	if (cfp->nonblock) {
-		len = read(sock, buf, (size_t)len);
-		if (len < 0) {
-			if (errno == EAGAIN)
-				len = 0;
-			else
-				goto fatal_err;
-		}
-		if (len != 0)
-			gettimeofday(&cfp->last_recv_time, NULL);
-		return len;
-	}
-
 	do {
 		pollfd.fd = sock;
 		pollfd.events = POLLIN;

Basically have TCP reads use poll() instead of nonblocking reads
but before I purpose this patch to upstream I would a little more
ammo as to why sending RPC messages in multiple sends is necessary

Comment 25 Frank Sorenson 2018-06-22 14:50:20 UTC

I'm not entirely sure what the customer's use case is/was, but it appears to have been an app (perhaps internal?) that built and sent the rpc in sections, rather than saving it all for a single send().

This is similar to the kernel issue resolved in this upstream kernel commit:
commit 1f691b07c5dc51b2055834f58c0f351defd97f27
Author: J. Bruce Fields <bfields>
Date:   2013-06-26 10:55:40 -0400

    svcrpc: don't error out on small tcp fragment
    
    Though clients we care about mostly don't do this, it is possible for
    rpc requests to be sent in multiple fragments.  Here we have a sanity
    check to ensure that the final received rpc isn't too small--except that
    the number we're actually checking is the length of just the final
    fragment, not of the whole rpc.  So a perfectly legal rpc that's
    unluckily fragmented could cause the server to close the connection
    here.
    
    Cc: stable.org
    Signed-off-by: J. Bruce Fields <bfields>

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index df74919c81c0..305374d4fb98 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1095,7 +1095,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
                goto err_noclose;
        }
 
-       if (svc_sock_reclen(svsk) < 8) {
+       if (svsk->sk_datalen < 8) {
                svsk->sk_datalen = 0;
                goto err_delete; /* client is nuts. */
        }

So my two (conflicting) thoughts are:
 * it is 'perfectly legal' and was appropriate for -stable kernel, so makes sense we ought to fix libtirpc similarly
 * customer is no longer pushing for the fix, and the case itself was closed last week as the customer may have abandoned the request, so this may not be important.  At the least, the urgency is lower now (adjusting priority accordingly)

Comment 26 Steve Whitehouse 2018-07-03 10:40:40 UTC

What has multiple sends go to do with this? The issue appears to be on the receive side. Are the socket watermarks set here? Using poll seems like the usual solution to reading from a socket, and we should not block in the case that poll says there is data to read. That said, it there should be no issue if reads a set to be non-blocking too.

Steve D, can you update this bug since it is on the 7.6 RPL, and we need to figure out what it's current status should be.

Comment 27 Steve Dickson 2018-07-10 16:57:32 UTC

With the following rpms

nfs-utils-1.3.0-0.55.el7.x86_64
libtirpc-0.2.4-0.10.1.el7.x86_64

I was no longer able to reproduce this problem
Could someone please verify this finding?

Comment 28 Zhi Li 2018-07-11 10:18:55 UTC

(In reply to Steve Dickson from comment #27)
> With the following rpms
> 
> nfs-utils-1.3.0-0.55.el7.x86_64
> libtirpc-0.2.4-0.10.1.el7.x86_64
> 
> I was no longer able to reproduce this problem
> Could someone please verify this finding?


Seem that the problem can be reproduced by the rpms specified. List the test result:

[root@xxx]# rpm -q nfs-utils
nfs-utils-1.3.0-0.55.el7.x86_64
[root@xxx]# rpm -q libtirpc
libtirpc-0.2.4-0.10.el7.x86_64

[root@xxx]# ls
test_short_rpc-1  test_short_rpc-1.c

[root@xxx]# ./test_short_rpc-1 localhost 4
1: sent 4 bytes... sent 60 bytes...  received 32 bytes
2: sent 4 bytes... sent 60 bytes...  received 0 bytes
3: caught SIGPIPE

Comment 29 Steve Whitehouse 2018-07-19 10:30:40 UTC

SteveD, can you give a status update here? Since this is on the RPL for 7.6 we need to figure out a plan fairly quickly.

Comment 30 Steve Dickson 2018-07-20 16:28:24 UTC

(In reply to Steve Whitehouse from comment #29)
> SteveD, can you give a status update here? Since this is on the RPL for 7.6
> we need to figure out a plan fairly quickly.

I'm just concerned about ripping out that code... I guess 
I would rather have the customer change the app to make
it work than change a library that effects all apps.

Comment 31 Steve Whitehouse 2018-08-02 13:47:59 UTC

The question is what is wrong here? If the library is assuming anything at all about the way in which the incoming data is broken up into receive calls from the socket, then it is incorrect. So where is the code that had the incorrect assumptions about the data thats being received?

Comment 32 Steve Dickson 2018-08-02 15:54:52 UTC

(In reply to Steve Whitehouse from comment #31)
> The question is what is wrong here? If the library is assuming anything at
> all about the way in which the incoming data is broken up into receive calls
> from the socket, then it is incorrect. So where is the code that had the
> incorrect assumptions about the data thats being received?

I really don't know what logic was behind making the read on
an accepted tcp connection non-blocking... but that code
has been there since the library was introduced, in 07.

This is the first time any app had a problem with that
logic... now that customer has closed the ticket and
and has moved on... So I see no reason tinker with
mainstream code that only one app has an issue 
with...

Comment 33 Steve Whitehouse 2018-08-03 13:05:34 UTC

I'm still not sure that I understand whats happening here. In comment #24 you mentioned multiple sends for an RPC, yet that should be irrelevant to the receiving end, assuming that the receiving end is written correctly.

Also, whether the socket is blocking or non-blocking (comment #32) is also not relevant either, in that in both cases you still have a stream socket which can provide the data in any sized chunks that it wants to. So in all cases you need the receiving end to parse the incoming stream to figure out where the message boundaries are. If it is not doing that, then it is a bug that we need to fix.

I'm still not clear if that receiving code is actually in the libtirpc or not? Can you confirm? That is what I was trying to get at in comment #31

Comment 34 Steve Dickson 2018-08-07 12:34:47 UTC

(In reply to Steve Whitehouse from comment #33)
> I'm still not sure that I understand whats happening here. In comment #24
> you mentioned multiple sends for an RPC, yet that should be irrelevant to
> the receiving end, assuming that the receiving end is written correctly.
> 
> Also, whether the socket is blocking or non-blocking (comment #32) is also
> not relevant either, in that in both cases you still have a stream socket
> which can provide the data in any sized chunks that it wants to. So in all
> cases you need the receiving end to parse the incoming stream to figure out
> where the message boundaries are. If it is not doing that, then it is a bug
> that we need to fix.
It is a corner case since 99% of the apps that used this code
use UDP (locally) in which there are no messages boundaries which is
the reason we have not seen this issue in the 11 years
libtirpc has been out there. 

It is my opinion, mucking around critical path for that %1
is not worth the aggravation. If more apps run into this
problem, which I doubt, then we should take another look.
 
> I'm still not clear if that receiving code is actually in the libtirpc or
> not? Can you confirm? That is what I was trying to get at in comment #31
The code is in a read_vc which is used to read stream sockets.

Comment 36 Ian Kent 2018-10-19 01:47:25 UTC

Created attachment 1495528 [details]
Patch - test_short_rpc.c false EOF

It seems that, for this case, test_short_rpc.c also suffers
from false EOFs resulting in a zero length read after write.

Comment 39 Steve Whitehouse 2018-10-21 07:59:54 UTC

Hi - can you explain a bit more about how that patch works. Polling for 200ms intervals in a loop doesn't look like non-blocking to me, or is this intended to wait for a whole RPC and only "non-block" between messages?

Comment 41 Ian Kent 2018-10-21 14:00:31 UTC

(In reply to Steve Whitehouse from comment #39)
> Hi - can you explain a bit more about how that patch works. Polling for
> 200ms intervals in a loop doesn't look like non-blocking to me, or is this
> intended to wait for a whole RPC and only "non-block" between messages?

I hope that my reply to SteveD answered most of your questions.

The poll() is in a loop but I don't expect the non-blocking
case (or the blocking case for that matter) to traverse it
more than once. Either there really isn't any data to be had,
an error occurs, or we get some data and return appropriately.

I could leave the poll() loop in place and put a single short
timeout poll() in the non-blocking conditional branch.

But personally I don't think the while part of the loop will
get exercised.

Comment 43 Steve Whitehouse 2018-10-21 15:55:59 UTC

Well what I was wondering is why the poll loop is required at all.... if you want to read, that can be done in a non-blocking way and you can find out then whether there is any data to read. That seems much simpler to me.

Also, why the 200ms timeout... that is not what I'd normally expect from something that is non-blocking - I'd expect it to return immediately, either with data or without.

Perhaps I've not understood quite what is expected of this code, but it looks suspicious to me, and I suspect it could be simplified depending on quite what the expected result is intended to be.

Comment 44 Ian Kent 2018-10-22 00:47:05 UTC

(In reply to Steve Whitehouse from comment #43)
> Well what I was wondering is why the poll loop is required at all.... if you
> want to read, that can be done in a non-blocking way and you can find out
> then whether there is any data to read. That seems much simpler to me.

There's two reasons for the loop, one sensible and one not so
much.

First there's the need to handle EINTR premature returns which
are rare.

Second I seem to recall reading that poll(2) can sometimes see
false returns (other than EINTR) but I can't find the literature
where I saw that now so maybe I've imagined it and maybe it
isn't a problem.

I don't know why we are seeing an EOF when we expect data on
the socket.

I do understand why non-segmented reads don't see this problem.

The listening socket, the so-called rendezvous socket, will
accept() the incoming connection and create an RPC transport
to receive requests on the socket. After the create it returns
all the way to the outermost service loop and calls poll()
(there are still some questions about the topmost loop code
in rpcbind but I'll not go there now as I haven't studied
that code enough to comment).

All this takes a while so poll() returns with data ready and
the the first read is done. If the request is sent in one
chunk all is good but if it is split into two ore more reads
the second read is done immediately and finds the data is not
yet available. This causes a fail return (XPRT_DIED) which is
the false EOF (on the socket not the transport).

This is why I think a small delay is needed to handle this
case.

But as I have said, maybe I need to spend more time on it,
there may be a different way to deal with the problem.

All that's really needed is for the false EOF receive to
return XPRT_IDLE rather than XPRT_DIED. But telling the
difference between a real socket EOF and a false one is
the problem.

> 
> Also, why the 200ms timeout... that is not what I'd normally expect from
> something that is non-blocking - I'd expect it to return immediately, either
> with data or without.

Well, yes you are right, I expect that too, but that's not
what happens for the problem case.

If I haven't described it sufficiently well I can try again?

As for the 200ms, it's arbitrary and due to the way the loop
is written it should rarely wait that long, except for real
error conditions.

I'm pretty sure the timeout can be made shorter, I'll check
that. Most likely all that's needed is a context switch so
it may be able to be made very short indeed.
 
> 
> Perhaps I've not understood quite what is expected of this code, but it
> looks suspicious to me, and I suspect it could be simplified depending on
> quite what the expected result is intended to be.

The intent of the code is to accumulate multiple partial
receives for a given request so in that respect this is
a bug, possibly introduced by behavioural changes to
socket IO over time.

Perhaps it could be simplified, depending on how the error
case is identified, if we constrain it to read_vc() where
it happens I think the code is about as simple as it can
be in handling the needed poll() return cases.

Of course I'm open to suggestions.

And perhaps we will get some interesting alternative
suggestions from upstream too and maybe my understanding
of the code is not as complete as I think, ;)

Ian

Comment 45 Ian Kent 2018-10-22 02:48:26 UTC

Created attachment 1496267 [details]
Patch - fix fragmented read EAGAIN

This patch was my first attempt and also resolves the reported
problem (assuming the reproducer is also fixed).

It's simpler but has some problems.

Looking further I see that SOCK_DGRAM sockets use a different
code path so we are dealing only with SOCK_STREAM sockets that
are blocking or non-blocking.

Also expectdata will always be FALSE for blocking sockets and
EAGAIN won't be returned by them either due to the poll()
used for them.

The conditional return of XPRT_MOREREQS results in an
immediate retry always but returning XPRT_IDLE for EAGAIN
will cause an exit to the top level service loop where
poll() is called and also resolves the problem.

But this patch still has the problem of not properly
identifying a false EOF on the socket.

If a client sends part of a request and then does nothing
it will remain connected and this could be done with many
sockets or if it continually sends zero bytes (I think)
rpcbind can be made to perform tight loop, both DOS attacks.

Another problem with this approach is that it is a little
fragile in that it depends on expectdata to continue to be
set as it is now for calls to __xdrrec_getrec() which could
change over time with other development efforts.

This is why I went with the previous patch.

Ian

Comment 46 Ian Kent 2018-10-22 02:59:23 UTC

Created attachment 1496268 [details]
Patch - fix fragmented read EAGAIN (with conditional expectdata check)

Sorry, the previous EAGAIN patch didn't have the conditional
return I spoke about in the discussion of the patch.

This is the patch I was talking about.

I've tested both.

Nevertheless both EAGAIN patches have the problems I
described.

Comment 47 Ian Kent 2018-10-22 03:12:45 UTC

Finally I tried a couple of different timeout values for the
poll() of the original patch I posted.

A timeout setting of 20ms works fine, I didn't go any lower.

As I said I suspect it's the context switch the poll() sleep
will cause that probably fixes this since the reproducer is
using the local loopback interface.

So, in a sense, is a slightly different additional problem
in itself.

Comment 48 Ian Kent 2018-10-22 03:18:21 UTC

(In reply to Ian Kent from comment #47)
> Finally I tried a couple of different timeout values for the
> poll() of the original patch I posted.
> 
> A timeout setting of 20ms works fine, I didn't go any lower.
> 
> As I said I suspect it's the context switch the poll() sleep
> will cause that probably fixes this since the reproducer is
> using the local loopback interface.
> 
> So, in a sense, is a slightly different additional problem
> in itself.

Another closing thought.

How long should the timeout be to identify a real false
socket EOF over an remote network connection, maybe longer
that 20ms, maybe it isn't needed in that case?

Comment 49 Steve Whitehouse 2018-10-22 15:32:50 UTC

My concern has been just for the stream sockets (i.e. tcp) here. Datagram works in a very different way, and looks more sensible to me.

As per earlier in this bug, how the requests are sent is irrelevant. Since these are stream sockets, it is perfectly valid for a read to return any number of bytes from the stream greater than zero. If it is a non-blocking socket, then zero is ok too.

I'm not aware of any cases when poll gives the wrong answer, howver the whole poll thing seems like a red herring to me. If you want to make this robust, whi ch seems like a good plan, then we should deal with the DoS case of servers sending partial RPCs and then stopping I think. It is not easy though, and a timeout might be one appropriate solution, but really it is much easier to just add the timeout to the read call in the first place by setting the RCVTIMEO. No need to poll at all.

It might also be appropriate to set the low water mark on the socket too, probably to 4 bytes initially, in order to read the length of the RPC and then to whatever the length is (minus what has already been received). That will reduce the number of reads in case the server sends one byte at a time.

That will save a lot of that looping, which is not really getting us anywhere beyond being confusing I think. We should only need one loop to read in the RPC completely.

It would be good to try and get a clear statement of what the various conditions are that need to be handled in this code, since that might help to clarify things and gives something to test the code against.

Comment 50 Ian Kent 2018-10-23 02:17:46 UTC

(In reply to Steve Whitehouse from comment #49)
> My concern has been just for the stream sockets (i.e. tcp) here. Datagram
> works in a very different way, and looks more sensible to me.

Right, as I mentioned, after looking again the code we are looking
at is only concerned with stream sockets.

> 
> As per earlier in this bug, how the requests are sent is irrelevant. Since
> these are stream sockets, it is perfectly valid for a read to return any
> number of bytes from the stream greater than zero. If it is a non-blocking
> socket, then zero is ok too.

That's right of course and is essentially a statement of the
problem, zero byte reads on a non-blocking socket are not handled
properly.

> 
> I'm not aware of any cases when poll gives the wrong answer, howver the
> whole poll thing seems like a red herring to me. If you want to make this
> robust, whi ch seems like a good plan, then we should deal with the DoS case
> of servers sending partial RPCs and then stopping I think. It is not easy
> though, and a timeout might be one appropriate solution, but really it is
> much easier to just add the timeout to the read call in the first place by
> setting the RCVTIMEO. No need to poll at all.

Again that's a perfectly sensible view.

But RCVTIMEO socket option doesn't apply to non-blocking sockets.

And it's the library caller thatsets the socket non-blocking so
the library probably shouldn't change that.

> 
> It might also be appropriate to set the low water mark on the socket too,
> probably to 4 bytes initially, in order to read the length of the RPC and
> then to whatever the length is (minus what has already been received). That
> will reduce the number of reads in case the server sends one byte at a time.

Any non-zero value would do because the request accumulation
code will handle it, just not zero length reads.

But it appears RCVLOWAT also can't be used with non-blocking
sockets.

> 
> That will save a lot of that looping, which is not really getting us
> anywhere beyond being confusing I think. We should only need one loop to
> read in the RPC completely.

Not sure I understand your concern.

If a request is sent in multiple writes there's no way to
know the delay between each so a loop is going to be needed,
at least to some extent. Ideally the outermost read loop.

I suspect your talking about minimising the number of time
the loop needs to be traversed which is definitely what we
want. 

But if it's just the loop in read_vc() then that's more a
construct to handle EINTR and can be implemented without
the while. If it's the use of poll() at all on a non-blocking
socket then it's much harder to handle the zero length reads.

> 
> It would be good to try and get a clear statement of what the various
> conditions are that need to be handled in this code, since that might help
> to clarify things and gives something to test the code against.

Sure, there are several fields in the connection structure
that are used to check progress and such that I haven't paid
much attention to, I'll spend a bit of time looking at them
and have a go at describing this.

Ian

Comment 51 Steve Whitehouse 2018-10-23 06:53:17 UTC

If the problem is that you've been passed a non-blocking socket, but you need to make a call that blocks (even for just a set period of time) then you can set MSG_WAITALL. The advantage of the MSG_DONTWAIT/MSG_WAITALL flags is that you can use them for a single call and it doesn't change the overall blocking status of the socket, so will not affect any other uses of the same socket.

Comment 52 Ian Kent 2018-10-23 10:50:09 UTC

(In reply to Steve Whitehouse from comment #51)
> If the problem is that you've been passed a non-blocking socket, but you
> need to make a call that blocks (even for just a set period of time) then
> you can set MSG_WAITALL. The advantage of the MSG_DONTWAIT/MSG_WAITALL flags
> is that you can use them for a single call and it doesn't change the overall
> blocking status of the socket, so will not affect any other uses of the same
> socket.

I can't see how that can help.

Don't forget this is rpcbind, it's a single threaded RPC server so
non-blocking sockets must be used and you must not block on socket
reads for an extended length of time or request throughput will
stop periodically while waiting for the input.

The problem is only that a zero length read return causes XPRT_DIED
to be set in the RPC transport status which closes the connection.

The assumption that data will be available on the socket is only
valid on first entry to the reading functions but the case we have
is where the request is split over multiple writes from the client.
In that case a zero length read doesn't necessarily mean it's a
dead socket.

This assumption is plain wrong given the code tries hard to handle
the multiple records of multiple requests. It's the record oriented
IO handling that makes the code so ugly!

Comment 53 Steve Whitehouse 2018-10-23 13:19:47 UTC

So if it is supposed to be properly non-blocking, then lets make it such. In that case a poll with a timeout of 200ms is a bug. We should just call read (of some sort) and deal with the return status in the right way.

I'm still not quite sure that we are understanding each other, and perhaps I'm missing something here...

Multiple requests from the client doen't make any sense. We are talking about a stream socket, and there is no such thing as a message boundary with a stream socket - it is just a stream of bytes. You have no way to know at the receiver how many requests the client took to send that stream of bytes. How the client sends the message doesn't matter, except in that it might take a long time to do so, and thus take up a lot of local resources, if it sends many such slow, or never ending requests. As such RPC prepends a length to the start of each RPC request so we know how long it is.

We know the minimum amount of data per request which is: 4 bytes plus sizeof RPC message struct. So we can set that as our initial read size from the socket. Once we have that four bytes we can update the target size accordingly.

A highly simplified version, just for the purposes of illustration, should look like this:


#define BUFSZ 4096
char buffer[BUFSZ]
unsigned off = 0;

int rcv_rpc(int sk)
{
        int target = (off >= 4) ? ntohl(*(__be32 *)buffer) - off : (4 + sizeof(struct rpcmsg));
        int ret;

        ret = recv(sk, buffer + off, target, MSG_DONTWAIT);

        if (ret > 0)
                off += ret;
        else if (ret < 0) {
                switch(errno) {
                case EAGAIN:
                        /* nothing to read */
                        break;
                case EINTR:
                        /* interrupted and nothing to read */
                        break;
                default:
                        peror("recvmsg");
                }
        }
}


Obviously you'll need better buffer management to make sure that it is not overflowed, and I assume that the buffer will be attached to the connection somehow in the library. The assumption is that the upper layer calls this when it knows that the socket has data ready, but it is equally ok to call on spec and it will not block even if the socket is not ready.

This will allow the splitting of different messages into different buffers. It will not read beyond the end of an rpc message in any one read due to the way in which the target is calculated. Bearing in mind the way in which the rpc messages are encapsulated in tcp, that is about as good as we can make it I think.

Does that make what I'm getting at a bit clearer? That poll rather sticks out as a red flag that something is not right. I have to rush off now again, so will have to catch up later on :-)

Comment 54 Ian Kent 2018-10-24 00:41:31 UTC

(In reply to Steve Whitehouse from comment #53)
> So if it is supposed to be properly non-blocking, then lets make it such. In
> that case a poll with a timeout of 200ms is a bug. We should just call read
> (of some sort) and deal with the return status in the right way.

LOL, the second patch I posted above does just that.

Unfortunately I didn't add a description to it so it's not
clear what it's about.

I have confirmed that this call path only affects stream
socket reads but I still need to spend a little more thought
on it.

I'll do that and see where it leads.

If using poll() is a bug then we should be discussing the
second patch and forget about the poll() altogether.

> 
> I'm still not quite sure that we are understanding each other, and perhaps
> I'm missing something here...

Yes, there is some sort of misunderstanding here, not sure 
exactly what it is.

> 
> Multiple requests from the client doen't make any sense. We are talking
> about a stream socket, and there is no such thing as a message boundary with
> a stream socket - it is just a stream of bytes. You have no way to know at
> the receiver how many requests the client took to send that stream of bytes.
> How the client sends the message doesn't matter, except in that it might
> take a long time to do so, and thus take up a lot of local resources, if it
> sends many such slow, or never ending requests. As such RPC prepends a
> length to the start of each RPC request so we know how long it is.

Sure, and the existing code works hard to manage that, I'm not
concerned about it but the record oriented structure of requests
does make the code, well, ugly.

The only problem with it (AFAICS) is the boolean that indicates
the header has been received (and hence the length of the body
is now known) is not set in the connection structure (that tracks
the progress) when it has been successfully received which also
causes problems for the case of this bug.

> 
> We know the minimum amount of data per request which is: 4 bytes plus sizeof
> RPC message struct. So we can set that as our initial read size from the
> socket. Once we have that four bytes we can update the target size
> accordingly.

My point then is it isn't the ->read_vc() function that needs work.

I think all that you are saying, by the look of the code fragment,
is that it needs to be ok for the function to return 0 which it
already does anyway.

The problem is the caller of the ->read_vc() function doesn't
properly handle this case, as you can see from the second patch
I posted for discussion.

snip ...

> 
> Does that make what I'm getting at a bit clearer? That poll rather sticks
> out as a red flag that something is not right. I have to rush off now again,
> so will have to catch up later on :-)

I think so but hopefully looking at my second patch will help
with the discussion.

Personally I don't really care if poll() is used, it just seemed
like a simple way to identify a false EOF as the libtirpc code
incorrectly assumes a zero length read on a non-blocking socket
"is" an EOF and kills the connection.

Probably what we should be discussing is how to identify real
dead, but still open, sockets so they don't accumulate.

I was thinking that it might be enough to record the number
of zero length reads in the connection tracking struct, and
return XPRT_DIED when some sensible limit is exceeded, and
a non fatal status otherwise.

Comment 55 Ian Kent 2018-10-24 02:47:40 UTC

(In reply to Ian Kent from comment #54)
> (In reply to Steve Whitehouse from comment #53)
> > So if it is supposed to be properly non-blocking, then lets make it such. In
> > that case a poll with a timeout of 200ms is a bug. We should just call read
> > (of some sort) and deal with the return status in the right way.
> 
> LOL, the second patch I posted above does just that.
> 
> Unfortunately I didn't add a description to it so it's not
> clear what it's about.
> 
> I have confirmed that this call path only affects stream
> socket reads but I still need to spend a little more thought
> on it.
> 
> I'll do that and see where it leads.
> 
> If using poll() is a bug then we should be discussing the
> second patch and forget about the poll() altogether.

I posted two subsequent patches that don't use poll() so
lets talk about them.

In order to understand the question posed by difference
between these two patches we need to understand the effect
of the return values at certain points of the request read
(something that was asked earlier).

rpcbind uses its own svc_run() function, it implements what
we'll call the "top most loop". Essentially it poll()s all
it's active socket descriptors for activity and calls (in
our case) svc_getreq_poll() from libtirpc.

Despite the function name svc_getreq_poll() doesn't call
poll() it implements the request read loop, we'll call
this the "read loop".

svc_getreq_poll() loops calling a macro SVC_RECV().

This ends up calling svc_vc_recv(), we'll skip describing
the rendezvous client transport accept() and transport
setup as it doesn't directly relate to the problem.

It's svc_vc_recv() that calls __xdrrec_getrec(), the
function were the change to the return value is needed
to handle zero length reads.

Everything in this call path returns a boolean.

The read loop uses the boolean to identify if the request
has been fully received and if so processes it and continues.

The read loop termination condition is based on the value
of the connection struct status which can have three values,
XPRT_DIED, XPRT_IDLE or XPRT_MOREREQS.

XPRT_MOREREQS means continue the read loop, XPRT_IDLE means
I'm done for now, return to the top most loop and wait for
more activity and lastly XPRT_DIED means terminate this
socket connection it's broken.

The non-blocking read case calls __xdrrec_getrec() with it's
expectdata parameter set to true which causes a zero length
read on a non-blocking socket to return XPRT_DIED so the
connection is terminated, the cause of the problem in this
bug.

Now we can understand the difference between these two patches.

The first returns XPRT_MOREREQS for zero length reads causing
an immediate read retry in the read loop while the second patch
cause a return to the top most loop to utilize the poll() to
wait for more data on the connection.

In both patches the problem is identifying errant clients that
are taking up resources and need to be terminated since XPRT_DIED
is no longer returned.

That's why I was thinking about tracking the number of zero
length reads on a connection to identify if a connection
should be terminated, that limit being different depending
on which of the above approaches are used.

That still leaves the problem of accumulation of idle sockets
but that can be done for any connection by opening it, sending
at least one valid request (to construct the transport that
adds a descriptor to listen on) and then doing nothing more.
So maybe this isn't such an immediate concern.

There may be a simpler way to do this but nothing comes to
mind, suggestions welcome though.

Comment 56 Ian Kent 2018-10-24 05:30:21 UTC

(In reply to Ian Kent from comment #36)
> Created attachment 1495528 [details]
> Patch - test_short_rpc.c false EOF
> 
> It seems that, for this case, test_short_rpc.c also suffers
> from false EOFs resulting in a zero length read after write.

I thought this was a bit strange and it is.

My claim is rubbish, the reproducer works fine as long as
libtirpc is fixed, any of the three patches I posted allows
the reproducer to function as it should and wait for the
reply.

Comment 57 Ian Kent 2018-10-25 02:33:23 UTC

(In reply to Steve Whitehouse from comment #53)
> So if it is supposed to be properly non-blocking, then lets make it such. In
> that case a poll with a timeout of 200ms is a bug. We should just call read
> (of some sort) and deal with the return status in the right way.
> 
> I'm still not quite sure that we are understanding each other, and perhaps
> I'm missing something here...

Well, I'm glad you persisted with this because I was wrong about
this (partly anyway).

The patch I'm about to post makes libtirpc behave exactly as you
describe, as it should, and handles the EOF detection of a dead
socket. Hopefully it is more like what you expected.

Implicit assumptions, I'm caught by these all too often, and by
definition you can't be aware of when you make them.

Sorry, for the waste of time trying to get through to me.

Comment 58 Ian Kent 2018-10-25 02:37:30 UTC

Created attachment 1497259 [details]
Patch - fix EOF detection on non-blocking socket

Well, EWOULDBLOCK is not returned but is allowed to be returned
so it's added to the check.

Comment 59 Steve Whitehouse 2018-10-25 08:47:32 UTC

I've not followed all the logic through, but that does look like it is doing the right thing now. Thanks for taking another look.

Comment 60 Ian Kent 2018-10-25 10:14:09 UTC

(In reply to Steve Whitehouse from comment #59)
> I've not followed all the logic through, but that does look like it is doing
> the right thing now. Thanks for taking another look.

Yeah, I think so too.

It could be better but would mean changing the calling semantics
of ->read_vc() and would increase the potential for side effects
a lot more.

This patch does it's best to minimise the calling semantic change
and restricts the change to non-blocking sockets only which reduces
the scope for side effects a lot so checking for them is not as
difficult.

I still need to check for side effects one more time before posting
it to the upstream list.

Thanks for your help with this.

Comment 61 Steve Dickson 2018-10-26 13:22:10 UTC

(In reply to Ian Kent from comment #58)
> Created attachment 1497259 [details]
> Patch - fix EOF detection on non-blocking socket
> 
> Well, EWOULDBLOCK is not returned but is allowed to be returned
> so it's added to the check.

Would mind posting this to the upstream list 
   Libtirpc-devel Mailing List <libtirpc-devel.net>

tia,

Comment 62 Ian Kent 2018-10-27 01:19:46 UTC

(In reply to Steve Dickson from comment #61)
> (In reply to Ian Kent from comment #58)
> > Created attachment 1497259 [details]
> > Patch - fix EOF detection on non-blocking socket
> > 
> > Well, EWOULDBLOCK is not returned but is allowed to be returned
> > so it's added to the check.
> 
> Would mind posting this to the upstream list 
>    Libtirpc-devel Mailing List <libtirpc-devel.net>

Yes, I plan to, just haven't got around to doing a final check
for possible side effects.

Comment 63 Steve Dickson 2018-11-08 18:35:27 UTC

Upstream commit:

commit 3a17941dbbfcc8e9fcf08004992615a774443a0d 
Author: Ian Kent <raven>
Date:   Thu Nov 8 13:26:57 2018 -0500

    Fix EOF detection on non-blocking socket

Comment 71 Zhi Li 2018-12-25 03:33:49 UTC

Moving to VERIFIED according to comment #70.

Comment 75 errata-xmlrpc 2019-08-06 12:40:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2061

Note You need to log in before you can comment on or make changes to this bug.