Bug 228893

Summary: RPC clnt_create call on x86_64 passes back NULL pointer with error code of "Success"
Product: Red Hat Enterprise Linux 4 Reporter: Jeff Layton <jlayton>
Component: nfs-utilsAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: coughlan, staubach, steved
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0217 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-01 23:21:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace -s 1024 of rpc.gssd
none
nfs-utils-lib patch -- change all u_int32_t definitions in rpc_msg.h to u_long
none
patch2: alternate patch -- remove extraneous header files and clean up functions
none
a patch that eliminates the clnt_create
none
This patch removes the inclusion of any glibc routines from librpcsecgss
none
specfile patch for patch 148480
none
patch to make gssd include auth_gss.h none

Description Jeff Layton 2007-02-15 19:40:13 UTC
Opening this as a glibc problem, but it may turn out to be an issue with
nfs-utils or some related library. I could use some help tracking down the cause
if so, however. The problem is this:

When a client tries to do a krb5 or krb5i mount, rpc.gssd eventually calls
clnt_create like this:

        printerr(2, "creating %s client for server %s\n", clp->protocol,
                        clp->servername);
        if ((rpc_clnt = clnt_create(clp->servername, clp->prog, clp->vers,
                                        clp->protocol)) == NULL) {
                snprintf(rpc_errmsg, sizeof(rpc_errmsg),
                         "WARNING: can't create rpc_clnt for server "
                         "%s for user with uid %d", clp->servername, uid);
                printerr(0, "%s\n", clnt_spcreateerror(rpc_errmsg));
                goto out_fail;
        }

On i386, this seems to work fine and the call returns a valid pointer. On
x86_64, however, this clnt_create call consistently returns a NULL pointer but
with an error message of "Success". The error and debug messages look like this:

creating tcp client for server host.foo.redhat.com
WARNING: can't create rpc_clnt for server host.foo.redhat.com for user with uid
0: RPC: Success

This makes the krb5 context creation and the mount fail.

I attempted to write a small reproducer that just does the clnt_create call
passing in the same values as the call above. That consistently worked and
passed back a valid pointer. So my guess is that this issue seems to be related
to something else happening prior to this call getting made.

Still, I'm having some real trouble tracking down why I'm getting a NULL pointer
here. Since it works on i386, I'm guessing that the problem is related to data
type sizes, but I'm not clear on whether this is a bug in glibc or something
else we're doing earlier on...

Comment 1 Jakub Jelinek 2007-02-15 22:24:00 UTC
Can you attach strace -s 1024 output of the testcase? Can you under debugger
hit breakpoint on the snprintf or printerr and print rpc_createerr ?

Comment 2 Jeff Layton 2007-02-15 22:34:50 UTC
Created attachment 148155 [details]
strace -s 1024 of rpc.gssd

Here's the strace of rpc.gssd. I think the clnt_create happens around line 799.
All the system calls seem to be OK as far as I can tell.

Comment 3 Jeff Layton 2007-02-15 22:40:58 UTC
Output from gdb session. Let me know if this isn't what you needed:

(gdb) b gssd_proc.c:487
Breakpoint 1 at 0x403731: file gssd_proc.c, line 487.
(gdb) run -f -vvvvvv
Starting program: /usr/sbin/rpc.gssd -f -vvvvvv
Using keytab file '/etc/krb5.keytab'
Processing keytab entry for principal
'nfs/warthog.gsslab.rdu.redhat.com'
We will use this entry (nfs/warthog.gsslab.rdu.redhat.com)
Using (machine) credentials cache: 'FILE:/tmp/krb5cc_machine_EXAMPLE.COM'
processing client list

Program received signal SIG37, Real-time event 37.
0x0000003597abd96f in poll () from /lib64/tls/libc.so.6
(gdb) cont
Continuing.
processing client list
handling krb5 upcall
Using keytab file '/etc/krb5.keytab'
INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_EXAMPLE.COM' are good until
1171665540
using FILE:/tmp/krb5cc_machine_EXAMPLE.COM as credentials cache for machine creds
creating context using euid 0 (save_uid 0)
creating tcp client for server dhcp231-99.rdu.redhat.com

Breakpoint 1, create_auth_rpc_client (clp=0x50d1b0, auth_return=0x7fbffff708,
uid=0, authtype=Variable "authtype" is not available.
) at gssd_proc.c:487
487                     snprintf(rpc_errmsg, sizeof(rpc_errmsg),
(gdb) p rpc_createerr
$1 = 0



Comment 4 Jakub Jelinek 2007-02-15 22:59:26 UTC
Guess x/10wx rpc_createerr instead of p rpc_createerr would be needed.
Anyway, it might be best if you could install glibc-debuginfo and
just step through clnttcp_create to see where the call fails.
From the strace it seems everything up to __connect succeeded,
but then the socket was closed and NULL returned, so this would mean probably
that:
  call_msg.rm_xid = _create_xid ();
  call_msg.rm_direction = CALL;
  call_msg.rm_call.cb_rpcvers = RPC_MSG_VERSION;
  call_msg.rm_call.cb_prog = prog;
  call_msg.rm_call.cb_vers = vers;

  /*
   * pre-serialize the static part of the call msg and stash it away
   */
  INTUSE(xdrmem_create) (&(ct->ct_xdrs), ct->ct_mcall, MCALL_MSG_SIZE,
                         XDR_ENCODE);
  if (!INTUSE(xdr_callhdr) (&(ct->ct_xdrs), &call_msg))
    {
      if (ct->ct_closeit)
        {
          (void) __close (*sockp);
        }
      goto fooy;
    }

failed.  If you get to the xdr_callhdr call, can you print in gdb call_msg
structure?  Also, step into it and if it really fails see which field couldn't
be encoded?

Comment 5 Jeff Layton 2007-02-16 01:37:29 UTC
Ok, I think I see what may be happening:

It looks like we're getting into xdr_callhdr, which eventually does this in the
big if statement:

INTUSE(xdr_u_long) (xdrs, &(cmsg->rm_call.cb_prog))

...in xdr_u_long, it does this:

    case XDR_ENCODE:
      if (sizeof (uint32_t) != sizeof (u_long)
          && (uint32_t) *ulp != *ulp)
        return FALSE;

...from gdb in xdr_callhdr:

(gdb) p cmsg->ru.RM_cmb.cb_prog
$7 = 230175057546

230175057546 == 359780A68A

So the second condition in the if statement in xdr_u_long fails and returns
FALSE. I need to look over the code a little more closely and see where the
cb_prog value is actually coming from.


Comment 6 Jeff Layton 2007-02-16 13:30:32 UTC
It looks like the call_msg struct getting passed into xdr_callhdr is being
corrupted somehow, but only in the xdr_callhdr frame. Here's some excerpts from gdb:

200             if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) {
(gdb) p call_msg
$13 = {rm_xid = 1172209978, rm_direction = CALL, ru = {RM_cmb = {
      cb_rpcvers = 2, cb_prog = 100003, cb_vers = 4, cb_proc = 53, cb_cred = {
        oa_flavor = -1750220896, 
        oa_base = 0x197c32da0 <Address 0x197c32da0 out of bounds>, 
        oa_length = 3221221920}, cb_verf = {oa_flavor = 5306656, 
        oa_base = 0x511310 "", oa_length = 2505562416}}, RM_rmb = {
      rp_stat = 2, ru = {RP_ar = {ar_verf = {oa_flavor = 4, 
            oa_base = 0x3597adbfa0
"\205�t�\203�\001L\211�t\0021�H\213\\$\b\203�\001H\2113w\035H\213=n\036\025",
oa_length = 2546150816}, 
          ar_stat = 3221221920, ru = {AR_versions = {low = 5306656, high = 0}, 
            AR_results = {where = 0x50f920 "tcp", proc = 0x511310}}}, RP_dr = {
          rj_stat = 4, ru = {RJ_versions = {low = 53, high = 2544746400}, 
            RJ_why = 53}}}}}}
(gdb) step

Breakpoint 2, xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:163
163     {
(gdb) p * cmsg
$14 = {rm_xid = 1172209978, rm_direction = 2, ru = {RM_cmb = {
      cb_rpcvers = 227633266692, cb_prog = 230178013088, cb_vers = 6841118112, 
      cb_proc = 548682068512, cb_cred = {oa_flavor = 5306656, 
        oa_base = 0x511310 "", oa_length = 2505562416}, cb_verf = {
        oa_flavor = 5306656, 
        oa_base = 0x45d5f28d <Address 0x45d5f28d out of bounds>, 
        oa_length = 747838}}, RM_rmb = {rp_stat = 4, ru = {RP_ar = {ar_verf = {
            oa_flavor = -1750220896, 
            oa_base = 0x197c32da0 <Address 0x197c32da0 out of bounds>, 
            oa_length = 3221221920}, ar_stat = 5306656, ru = {AR_versions = {
              low = 5313296, high = 182894188848}, AR_results = {
              where = 0x511310 "", 
              proc = 0x2a9557d930 <_nss_files_getprotobyname_r>}}}, RP_dr = {
          rj_stat = 2544746400, ru = {RJ_versions = {low = 6841118112, 
              high = 548682068512}, RJ_why = 2546150816}}}}}}

(gdb) bt
#0  xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:166
#1  0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available.
) at clnt_tcp.c:200
#2  0x0000003e7a9091d8 in clnt_create (hostname=Variable "hostname" is not
available.
) at clnt_generic.c:99
#3  0x00000000004035c3 in create_auth_rpc_client (clp=0x50e0f0, 
    auth_return=0x7fbffff708, uid=0, authtype=Variable "authtype" is not available.
) at gssd_proc.c:485
#4  0x00000000004038f1 in handle_krb5_upcall (clp=0x50e0f0) at gssd_proc.c:569
#5  0x00000000004027d6 in gssd_run () at gssd_main_loop.c:76
#6  0x000000000040259b in main (argc=Variable "argc" is not available.
) at gssd.c:141
#7  0x0000003597a1c3fb in __libc_start_main (main=0x4023c0 <main>, argc=3, 
    ubp_av=0x7fbffff9c8, init=0x405450 <__libc_csu_init>, fini=Variable "fini"
is not available.
)
    at ../sysdeps/generic/libc-start.c:209
#8  0x00000000004022fa in _start ()
#9  0x0000007fbffff9b8 in ?? ()
#10 0x000000000000001c in ?? ()
#11 0x0000000000000003 in ?? ()
#12 0x0000007fbffffbbc in ?? ()
#13 0x0000007fbffffbcf in ?? ()
#14 0x0000007fbffffbd2 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) p cmsg
$20 = (struct rpc_msg *) 0x7fbffff1a0
(gdb) f 1
#1  0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available.
) at clnt_tcp.c:200
200             if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) {
(gdb) p &call_msg
$21 = (struct rpc_msg *) 0x7fbffff1a0

...so everything looks fine until we enter xdr_callhdr, at which point the cmsg
struct looks garbled (at least everything in it starting at rm_direction). It
looks like there is some confusion between the two frames about the size of the
elements in this struct:

(gdb) f 0
#0  xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:166
166       cmsg->rm_call.cb_rpcvers = RPC_MSG_VERSION;
(gdb) p sizeof (cmsg->ru.RM_cmb.cb_rpcvers)
$25 = 8
(gdb) f 1
#1  0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available.
) at clnt_tcp.c:200
200             if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) {
(gdb) p sizeof (call_msg.ru.RM_cmb.cb_rpcvers)
$26 = 4

...maybe some problem with includes?


Comment 7 Jeff Layton 2007-02-16 14:03:28 UTC
I probably should also post what versions I'm using:

nfs-utils-1.0.6-77.EL4
nfs-utils-lib-1.0.6-5
glibc-2.3.4-2.25

...though I've also tested with glibc-2.3.4-2.36 and the U4 version of nfs-utils.



Comment 8 Jeff Layton 2007-02-16 14:49:00 UTC
clnttcp_create is called recursively when we query the portmapper. When we do
the portmapper call, however, there seems to be agreement between clnttcp_create
and xdr_callhdr about the sizes of these struct members:

(gdb) bt
#0  xdr_callhdr (xdrs=0x50fc38, cmsg=0x7fbffff060) at rpc_prot.c:163
#1  0x0000003597ae5fb3 in clnttcp_create (raddr=Variable "raddr" is not available.
) at clnt_tcp.c:205
#2  0x0000003597ae7c61 in *__GI_pmap_getport (address=0x7fbffff260, 
    program=100003, version=4, protocol=6) at pm_getport.c:113
#3  0x0000003e7a909db6 in clnttcp_create (raddr=0x7fbffff260, prog=100003, 
    vers=4, sockp=0x7fbffff24c, sendsz=0, recvsz=0) at clnt_tcp.c:145
#4  0x0000003e7a9091d8 in clnt_create (hostname=Variable "hostname" is not
available.
) at clnt_generic.c:99
#5  0x00000000004035c3 in create_auth_rpc_client (clp=0x50e0f0, 
    auth_return=0x7fbffff708, uid=0, authtype=Variable "authtype" is not available.
) at gssd_proc.c:485
#6  0x00000000004038f1 in handle_krb5_upcall (clp=0x50e0f0) at gssd_proc.c:569
#7  0x00000000004027d6 in gssd_run () at gssd_main_loop.c:76
#8  0x000000000040259b in main (argc=Variable "argc" is not available.
) at gssd.c:141
#9  0x0000003597a1c3fb in __libc_start_main (main=0x4023c0 <main>, argc=3, 
    ubp_av=0x7fbffff9c8, init=0x405450 <__libc_csu_init>, fini=Variable "fini"
is not available.
)
(gdb) f 0
#0  xdr_callhdr (xdrs=0x50fc38, cmsg=0x7fbffff060) at rpc_prot.c:163
163     {
(gdb) p sizeof (cmsg->ru.RM_cmb.cb_rpcvers)
$11 = 8
(gdb) f 1
#1  0x0000003597ae5fb3 in clnttcp_create (raddr=Variable "raddr" is not available.
) at clnt_tcp.c:205
205       if (!INTUSE(xdr_callhdr) (&(ct->ct_xdrs), &call_msg))
(gdb) p sizeof (call_msg.ru.RM_cmb.cb_rpcvers)
$12 = 8

...so whatever this is doesn't seem to be affecting the client created for the
portmapper conversation.


Comment 9 Jeff Layton 2007-02-16 19:35:06 UTC
My current thinking on this is that this is an nfs-utils bug. I see the
following header in the nfs-utils package:

support/rpc/include/rpc/rpc_msg.h

...and that header defines this:

struct call_body {
        u_int32_t cb_rpcvers;   /* must be equal to two */
        u_int32_t cb_prog;
        u_int32_t cb_vers;
        u_int32_t cb_proc;
        struct opaque_auth cb_cred;
        struct opaque_auth cb_verf; /* protocol specific - provided by client */
};

...but the glibc headers seem to define these as u_long instead of u_int32_t.
There are some other, similar issues in here as well.

I tried deleting this file from the source tree and it seemed like the package
picked up the glibc header file and built, but I still ran into the same issue.
Still, I'm thinking that the problem is likely the headers that are in the
nfs-utils tarball.

Jakub, thanks for the help. I'll ping you if it comes back to looking like a
glibc problem.



Comment 10 Jeff Layton 2007-02-17 02:25:39 UTC
The -65.EL package works fine even on a 4.5 install, but the -69 package does
not. So something broke between those two releases. Unfortunately, I can't seem
to get the -65 package to build on 4.5. So it's not yet clear whether the
problem is related to something in the nfs-utils package itself or maybe some
header that got pulled in during the build process.

Trying to determine that now...


Comment 11 Jeff Layton 2007-02-17 02:46:44 UTC
Got -65 to build on my 4.5 machine (pilot error before) and the mount still
worked afterward. I'm taking that as confirmation that the problem is something
in the nfs-utils package and not anything to do glibc.

I also built a -67 package and it worked, so the problem seems to lie somewhere
in the changes between -67 and -69.


Comment 12 Jeff Layton 2007-02-17 03:01:03 UTC
-68 is also broken. So the issue is the delta between -67 and -68. That seems to
be where libnfsidmap, librpcsecgss and libgssapi were broken out into their own
packages and the corresponding changes were made to the build process for nfs-utils.

Comment 13 Jeff Layton 2007-02-18 22:57:36 UTC
I think the problem may be in librpcsecgss. It looks like that package has a set
of headers similar to what's included in the nfs-utils SRPM, and it also looks
like it has some global symbols that conflict with libc that might be a factor here.

0000000000000000 T clnt_create
0000000000000000 T clnttcp_create

I'll probably try to do a quick fix tomorrow to see if I can confirm whether
that's the problem and then see what might be the best way to fix it.


Comment 14 Jeff Layton 2007-02-19 13:20:03 UTC
Created attachment 148317 [details]
nfs-utils-lib patch -- change all u_int32_t definitions in rpc_msg.h to u_long

That is indeed the problem. This patch seems to correct the issue, but the
whole situation with duplicate functions and the way the headers are in
nfs-utils-lib is a bit of a mess. There are number of symbols that are provided
by both libc and librpcsecgss, and their headers don't necessarily match.

So this patch seems to fix the immediate problem, but I'm not clear on whether
it'll break anything else, or (more likely) if there are similar bugs lurking
in nfs-utils/nfs-utils-lib due to the fact that it uses what seems to be older
duplicate headers.

Comment 15 RHEL Program Management 2007-02-19 13:24:10 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 16 Jeff Layton 2007-02-19 15:06:16 UTC
Created attachment 148322 [details]
patch2: alternate patch -- remove extraneous header files and clean up functions

This patch + a specfile patch to remove extraneous header files might be a
better way to go. This cleans up function definitions and fixes up the
Makefile.am to not include header files that are bad duplicates of glibc
headers.

There's a later patch that also renames the functions to make sure we don't
collide with glibc function names. We should consider that if we want to go
this route.

Alternately, it may be better to just rebase the lib on RHEL5's version (0.14).

Comment 17 Jeff Layton 2007-02-19 15:07:36 UTC
Steve, what do you think is the right approach here?

Comment 22 Jeff Layton 2007-02-20 18:06:01 UTC
Steve pointed out that fixing this by changing the headers may be problematic
for any 3rd party apps that are linking to librpcsecgss.

OTOH, we only started providing that shared lib in U4, so there probably aren't
a lot of apps that link to it, and those apps probably also link in glibc and
may be having similar issues to the problem that nfs-utils is having.


Comment 23 Steve Dickson 2007-02-20 22:07:21 UTC
Created attachment 148447 [details]
a patch that eliminates the clnt_create

Comment 25 Steve Dickson 2007-02-21 14:09:02 UTC
Created attachment 148480 [details]
This patch removes the inclusion of any glibc routines from librpcsecgss

Comment 26 Jeff Layton 2007-02-21 15:24:35 UTC
Created attachment 148490 [details]
specfile patch for patch 148480

Thanks, Steve -- that looks good to me. rpcgssd seems to start up and work
correctly with it, and rpc.svcgssd seems to start up (haven't tested it for
functionality). That seems better than my earlier approach which left the
duplicate code in place, but removed the headers.

This specfile patch should make it build correctly, I think.

Comment 28 Jay Turner 2007-02-21 18:43:28 UTC
QE ack for 4.5 . . . will need testing from IBM,


Comment 29 Jeff Layton 2007-02-21 18:50:33 UTC
Created attachment 148525 [details]
patch to make gssd include auth_gss.h

Now that we're #including libc's rpc.h header, we're not pulling in the
definitions in auth_gss.h. This patch makes gssd_proc.c include it directly and
allows nfs-utils to build against the patched nfs-utils-lib.

Comment 31 Jeff Layton 2007-02-23 08:03:05 UTC
In the end, we've decided to go with the patch in comment #14. It doesn't fix
all the potential problems with this library and its headers, but it does fix
the immediate issue with a minimal set of changes.

I've done some cursory testing and this cures the problem where x86_64 is unable
to mount a krb5 mount, but haven't tested much beyond that.


Comment 40 Red Hat Bugzilla 2007-05-01 23:21:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0217.html