Bug 869365 - RHEL6 rpcbind is "swallowing" broadcast RPC replies
Summary: RHEL6 rpcbind is "swallowing" broadcast RPC replies
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rpcbind
Version: 18
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 781880 (view as bug list)
Depends On: 864056
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-23 17:18 UTC by Steve Dickson
Modified: 2018-11-30 21:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 864056
Environment:
Last Closed: 2012-12-20 15:53:20 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Steve Dickson 2012-10-23 17:18:13 UTC
+++ This bug was initially created as a clone of Bug #864056 +++

We're using RHEL5 systems as network boot servers for Solaris/SPARC machines, via the "classical" (feel free to say last-millenium) rarp/bootparams booting. This relies on:
* rarpd
* bootparamd

In the attempt to migrate this to RHEL6, we found that while rarpd/rpc.bootparamd are no longer delivered by default, the RHEL5 binaries work just fine, except for a peculiarity in the interaction between RHEL6's TI-RPC rpcbind and rpc.bootparamd.

What happens is that the SPARC systems send broadcast bootparams WHOAMI requests:

14:13:55.550691 IP (tos 0x0, ttl 255, id 1, offset 0, flags [DF], proto UDP (17), length 104)
    cwtestora.biff > 255.255.255.255.sunrpc: [udp sum ok] UDP, length 76

On the Linux side, these are received by rpcbind (can see that by starting it in debugging mode, and/or by strac'ing rpcbind) and relayed to rpc.bootparamd via their local socket - in debug mode, rpc.bootparamd logs:

bootparamd: whoami got question for 172.24.40.55
This is host cwtestora
cl_addr = 172.24.40.55
if_name = lo, if_addr = 7f000001 , if_mask = ff000000
if_name = eth0, if_addr = ac182824 , if_mask = ffffffc0
Source address 172.24.40.36
Read 28 bytes (0)
Received reply from 172.24.40.62
Read 28 bytes (1)
Received reply from 172.24.40.61
Timed out in recvfrom() (1)
Timed out in recvfrom() (2)
Routers ac18283d (0)
Routers ac18283e (1)
rt_addr = 172.24.40.61
Returning cwtestora   (none)    172.24.40.61

But the answer is never sent back to the wire by rpcbind.
One of my colleagues found the following:

==============================================================================
The response is swallowed by the TI-RPC library. In the source of libtirpc, there's this fragment in src/svc_dg.c:

        if (xdr_replymsg(xdrs, msg) &&
            (!has_args || (xprt->xp_auth &&
             SVCAUTH_WRAP(xprt->xp_auth, xdrs, xdr_results, xdr_location)))) {
               ... send the reply over the wire ...
        }
        ... else do nothing ...

I traced it in GDB, and marshalling the reply succeeds, but xprt->xp_auth is NULL. If I understand the code, then rpcbind uses the same ti-rpc connection handle to send the reply to WHOAMI as the original query was received on. However, this is done asynchronously:

- ti-rpc parses the message
- xp_auth is initialized based on the message contents (AUTH_SYS in this case, but it is not really important)
  - rpcbind callback is activated to process the CALLIT command
    - the rpcbproc_callit_com() callback sends the request to bootparamd, and returns immediately
- ti-rpc resets xp_auth to NULL

Later:

- rpcbind receives a reply from bootparamd
- ti-rpc is called to forward the reply to the original caller
- however nothing sets the xp_auth field, so the code quoted above silently drops the message on the floor

The following patch to rpcbind fixes the issue, and rpcbind does send the reply to the WHOAMI request.

--- rpcbind-0.2.0/src/rpcb_svc_com.c.orig       2012-09-12 08:52:45.028057360 -0400
+++ rpcbind-0.2.0/src/rpcb_svc_com.c    2012-09-12 08:57:25.554959759 -0400
@@ -1227,6 +1227,8 @@
        return;
 }
 
+extern SVCAUTH svc_auth_none;
+
 static void
 handle_reply(int fd, SVCXPRT *xprt)
 {
@@ -1293,7 +1295,10 @@
        a.rmt_localvers = fi->versnum;
 
        xprt_set_caller(xprt, fi);
+       xprt->xp_auth = &svc_auth_none;
        svc_sendreply(xprt, (xdrproc_t) xdr_rmtcall_result, (char *) &a);
+       SVCAUTH_DESTROY(xprt->xp_auth);
+       xprt->xp_auth = NULL;
 done:
        if (buffer)
                free(buffer);
==============================================================================

We've done the test, and ran RHEL5 rpc.bootparamd on RHEL6 with a modified rpcbind as per above, and that results in a successful reply to the above WHOAMI request and subsequently a successful network boot of the SPARC system using a RHEL6 boot server.

We can provide more diagnostic logs (tcpdump packet logs, strace logs of rpc.bootparamd / rpcbind) for the above if you require.

Would it be possible to get the above change into a fix for rpcbind in RHEL6 ? This issue stops us from rolling out RHEL6 on bootservers.

P.S.
I did some more testing, and while the patch mentioned above demonstrates where the bug is, it is not a complete fix. Even though the patch makes rpcbind work in our tests and therefore it is better than nothing, the response to the CALLIT RPC function goes out on the wrong socket: it uses the socket allocated to communicate between rpcbind and rpc.bootparamd, instead of the socket where the original request from the remote client was received. As a consequence, the response packet has the wrong source port: it is sent using a random port in the 0-1023 range instead of the standard sunrpc port (111).

If you could figure out how to make the response go out on the right socket/port, that would be great; if not, we would still like to see the simple patch applied.

--- Additional comment from jiali on 2012-10-08 22:30:09 EDT ---

set qa_ack+, reproducor refer to rpc.bootparamd .

Comment 1 Fedora Update System 2012-10-23 17:36:19 UTC
rpcbind-0.2.0-20.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/rpcbind-0.2.0-20.fc18

Comment 2 Fedora Update System 2012-10-23 17:42:26 UTC
rpcbind-0.2.0-19.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/FEDORA-2012-16150/rpcbind-0.2.0-19.fc17

Comment 3 Fedora Update System 2012-10-23 19:44:41 UTC
Package rpcbind-0.2.0-20.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rpcbind-0.2.0-20.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-16728/rpcbind-0.2.0-20.fc18
then log in and leave karma (feedback).

Comment 4 Honza Horak 2012-11-05 09:05:34 UTC
*** Bug 781880 has been marked as a duplicate of this bug. ***

Comment 5 Fedora Update System 2012-12-20 15:53:22 UTC
rpcbind-0.2.0-20.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.