Opening this as a glibc problem, but it may turn out to be an issue with nfs-utils or some related library. I could use some help tracking down the cause if so, however. The problem is this: When a client tries to do a krb5 or krb5i mount, rpc.gssd eventually calls clnt_create like this: printerr(2, "creating %s client for server %s\n", clp->protocol, clp->servername); if ((rpc_clnt = clnt_create(clp->servername, clp->prog, clp->vers, clp->protocol)) == NULL) { snprintf(rpc_errmsg, sizeof(rpc_errmsg), "WARNING: can't create rpc_clnt for server " "%s for user with uid %d", clp->servername, uid); printerr(0, "%s\n", clnt_spcreateerror(rpc_errmsg)); goto out_fail; } On i386, this seems to work fine and the call returns a valid pointer. On x86_64, however, this clnt_create call consistently returns a NULL pointer but with an error message of "Success". The error and debug messages look like this: creating tcp client for server host.foo.redhat.com WARNING: can't create rpc_clnt for server host.foo.redhat.com for user with uid 0: RPC: Success This makes the krb5 context creation and the mount fail. I attempted to write a small reproducer that just does the clnt_create call passing in the same values as the call above. That consistently worked and passed back a valid pointer. So my guess is that this issue seems to be related to something else happening prior to this call getting made. Still, I'm having some real trouble tracking down why I'm getting a NULL pointer here. Since it works on i386, I'm guessing that the problem is related to data type sizes, but I'm not clear on whether this is a bug in glibc or something else we're doing earlier on...
Can you attach strace -s 1024 output of the testcase? Can you under debugger hit breakpoint on the snprintf or printerr and print rpc_createerr ?
Created attachment 148155 [details] strace -s 1024 of rpc.gssd Here's the strace of rpc.gssd. I think the clnt_create happens around line 799. All the system calls seem to be OK as far as I can tell.
Output from gdb session. Let me know if this isn't what you needed: (gdb) b gssd_proc.c:487 Breakpoint 1 at 0x403731: file gssd_proc.c, line 487. (gdb) run -f -vvvvvv Starting program: /usr/sbin/rpc.gssd -f -vvvvvv Using keytab file '/etc/krb5.keytab' Processing keytab entry for principal 'nfs/warthog.gsslab.rdu.redhat.com' We will use this entry (nfs/warthog.gsslab.rdu.redhat.com) Using (machine) credentials cache: 'FILE:/tmp/krb5cc_machine_EXAMPLE.COM' processing client list Program received signal SIG37, Real-time event 37. 0x0000003597abd96f in poll () from /lib64/tls/libc.so.6 (gdb) cont Continuing. processing client list handling krb5 upcall Using keytab file '/etc/krb5.keytab' INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_EXAMPLE.COM' are good until 1171665540 using FILE:/tmp/krb5cc_machine_EXAMPLE.COM as credentials cache for machine creds creating context using euid 0 (save_uid 0) creating tcp client for server dhcp231-99.rdu.redhat.com Breakpoint 1, create_auth_rpc_client (clp=0x50d1b0, auth_return=0x7fbffff708, uid=0, authtype=Variable "authtype" is not available. ) at gssd_proc.c:487 487 snprintf(rpc_errmsg, sizeof(rpc_errmsg), (gdb) p rpc_createerr $1 = 0
Guess x/10wx rpc_createerr instead of p rpc_createerr would be needed. Anyway, it might be best if you could install glibc-debuginfo and just step through clnttcp_create to see where the call fails. From the strace it seems everything up to __connect succeeded, but then the socket was closed and NULL returned, so this would mean probably that: call_msg.rm_xid = _create_xid (); call_msg.rm_direction = CALL; call_msg.rm_call.cb_rpcvers = RPC_MSG_VERSION; call_msg.rm_call.cb_prog = prog; call_msg.rm_call.cb_vers = vers; /* * pre-serialize the static part of the call msg and stash it away */ INTUSE(xdrmem_create) (&(ct->ct_xdrs), ct->ct_mcall, MCALL_MSG_SIZE, XDR_ENCODE); if (!INTUSE(xdr_callhdr) (&(ct->ct_xdrs), &call_msg)) { if (ct->ct_closeit) { (void) __close (*sockp); } goto fooy; } failed. If you get to the xdr_callhdr call, can you print in gdb call_msg structure? Also, step into it and if it really fails see which field couldn't be encoded?
Ok, I think I see what may be happening: It looks like we're getting into xdr_callhdr, which eventually does this in the big if statement: INTUSE(xdr_u_long) (xdrs, &(cmsg->rm_call.cb_prog)) ...in xdr_u_long, it does this: case XDR_ENCODE: if (sizeof (uint32_t) != sizeof (u_long) && (uint32_t) *ulp != *ulp) return FALSE; ...from gdb in xdr_callhdr: (gdb) p cmsg->ru.RM_cmb.cb_prog $7 = 230175057546 230175057546 == 359780A68A So the second condition in the if statement in xdr_u_long fails and returns FALSE. I need to look over the code a little more closely and see where the cb_prog value is actually coming from.
It looks like the call_msg struct getting passed into xdr_callhdr is being corrupted somehow, but only in the xdr_callhdr frame. Here's some excerpts from gdb: 200 if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) { (gdb) p call_msg $13 = {rm_xid = 1172209978, rm_direction = CALL, ru = {RM_cmb = { cb_rpcvers = 2, cb_prog = 100003, cb_vers = 4, cb_proc = 53, cb_cred = { oa_flavor = -1750220896, oa_base = 0x197c32da0 <Address 0x197c32da0 out of bounds>, oa_length = 3221221920}, cb_verf = {oa_flavor = 5306656, oa_base = 0x511310 "", oa_length = 2505562416}}, RM_rmb = { rp_stat = 2, ru = {RP_ar = {ar_verf = {oa_flavor = 4, oa_base = 0x3597adbfa0 "\205�t�\203�\001L\211�t\0021�H\213\\$\b\203�\001H\2113w\035H\213=n\036\025", oa_length = 2546150816}, ar_stat = 3221221920, ru = {AR_versions = {low = 5306656, high = 0}, AR_results = {where = 0x50f920 "tcp", proc = 0x511310}}}, RP_dr = { rj_stat = 4, ru = {RJ_versions = {low = 53, high = 2544746400}, RJ_why = 53}}}}}} (gdb) step Breakpoint 2, xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:163 163 { (gdb) p * cmsg $14 = {rm_xid = 1172209978, rm_direction = 2, ru = {RM_cmb = { cb_rpcvers = 227633266692, cb_prog = 230178013088, cb_vers = 6841118112, cb_proc = 548682068512, cb_cred = {oa_flavor = 5306656, oa_base = 0x511310 "", oa_length = 2505562416}, cb_verf = { oa_flavor = 5306656, oa_base = 0x45d5f28d <Address 0x45d5f28d out of bounds>, oa_length = 747838}}, RM_rmb = {rp_stat = 4, ru = {RP_ar = {ar_verf = { oa_flavor = -1750220896, oa_base = 0x197c32da0 <Address 0x197c32da0 out of bounds>, oa_length = 3221221920}, ar_stat = 5306656, ru = {AR_versions = { low = 5313296, high = 182894188848}, AR_results = { where = 0x511310 "", proc = 0x2a9557d930 <_nss_files_getprotobyname_r>}}}, RP_dr = { rj_stat = 2544746400, ru = {RJ_versions = {low = 6841118112, high = 548682068512}, RJ_why = 2546150816}}}}}} (gdb) bt #0 xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:166 #1 0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available. ) at clnt_tcp.c:200 #2 0x0000003e7a9091d8 in clnt_create (hostname=Variable "hostname" is not available. ) at clnt_generic.c:99 #3 0x00000000004035c3 in create_auth_rpc_client (clp=0x50e0f0, auth_return=0x7fbffff708, uid=0, authtype=Variable "authtype" is not available. ) at gssd_proc.c:485 #4 0x00000000004038f1 in handle_krb5_upcall (clp=0x50e0f0) at gssd_proc.c:569 #5 0x00000000004027d6 in gssd_run () at gssd_main_loop.c:76 #6 0x000000000040259b in main (argc=Variable "argc" is not available. ) at gssd.c:141 #7 0x0000003597a1c3fb in __libc_start_main (main=0x4023c0 <main>, argc=3, ubp_av=0x7fbffff9c8, init=0x405450 <__libc_csu_init>, fini=Variable "fini" is not available. ) at ../sysdeps/generic/libc-start.c:209 #8 0x00000000004022fa in _start () #9 0x0000007fbffff9b8 in ?? () #10 0x000000000000001c in ?? () #11 0x0000000000000003 in ?? () #12 0x0000007fbffffbbc in ?? () #13 0x0000007fbffffbcf in ?? () #14 0x0000007fbffffbd2 in ?? () #15 0x0000000000000000 in ?? () (gdb) p cmsg $20 = (struct rpc_msg *) 0x7fbffff1a0 (gdb) f 1 #1 0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available. ) at clnt_tcp.c:200 200 if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) { (gdb) p &call_msg $21 = (struct rpc_msg *) 0x7fbffff1a0 ...so everything looks fine until we enter xdr_callhdr, at which point the cmsg struct looks garbled (at least everything in it starting at rm_direction). It looks like there is some confusion between the two frames about the size of the elements in this struct: (gdb) f 0 #0 xdr_callhdr (xdrs=0x50fc88, cmsg=0x7fbffff1a0) at rpc_prot.c:166 166 cmsg->rm_call.cb_rpcvers = RPC_MSG_VERSION; (gdb) p sizeof (cmsg->ru.RM_cmb.cb_rpcvers) $25 = 8 (gdb) f 1 #1 0x0000003e7a909ce5 in clnttcp_create (raddr=Variable "raddr" is not available. ) at clnt_tcp.c:200 200 if (! xdr_callhdr(&(ct->ct_xdrs), &call_msg)) { (gdb) p sizeof (call_msg.ru.RM_cmb.cb_rpcvers) $26 = 4 ...maybe some problem with includes?
I probably should also post what versions I'm using: nfs-utils-1.0.6-77.EL4 nfs-utils-lib-1.0.6-5 glibc-2.3.4-2.25 ...though I've also tested with glibc-2.3.4-2.36 and the U4 version of nfs-utils.
clnttcp_create is called recursively when we query the portmapper. When we do the portmapper call, however, there seems to be agreement between clnttcp_create and xdr_callhdr about the sizes of these struct members: (gdb) bt #0 xdr_callhdr (xdrs=0x50fc38, cmsg=0x7fbffff060) at rpc_prot.c:163 #1 0x0000003597ae5fb3 in clnttcp_create (raddr=Variable "raddr" is not available. ) at clnt_tcp.c:205 #2 0x0000003597ae7c61 in *__GI_pmap_getport (address=0x7fbffff260, program=100003, version=4, protocol=6) at pm_getport.c:113 #3 0x0000003e7a909db6 in clnttcp_create (raddr=0x7fbffff260, prog=100003, vers=4, sockp=0x7fbffff24c, sendsz=0, recvsz=0) at clnt_tcp.c:145 #4 0x0000003e7a9091d8 in clnt_create (hostname=Variable "hostname" is not available. ) at clnt_generic.c:99 #5 0x00000000004035c3 in create_auth_rpc_client (clp=0x50e0f0, auth_return=0x7fbffff708, uid=0, authtype=Variable "authtype" is not available. ) at gssd_proc.c:485 #6 0x00000000004038f1 in handle_krb5_upcall (clp=0x50e0f0) at gssd_proc.c:569 #7 0x00000000004027d6 in gssd_run () at gssd_main_loop.c:76 #8 0x000000000040259b in main (argc=Variable "argc" is not available. ) at gssd.c:141 #9 0x0000003597a1c3fb in __libc_start_main (main=0x4023c0 <main>, argc=3, ubp_av=0x7fbffff9c8, init=0x405450 <__libc_csu_init>, fini=Variable "fini" is not available. ) (gdb) f 0 #0 xdr_callhdr (xdrs=0x50fc38, cmsg=0x7fbffff060) at rpc_prot.c:163 163 { (gdb) p sizeof (cmsg->ru.RM_cmb.cb_rpcvers) $11 = 8 (gdb) f 1 #1 0x0000003597ae5fb3 in clnttcp_create (raddr=Variable "raddr" is not available. ) at clnt_tcp.c:205 205 if (!INTUSE(xdr_callhdr) (&(ct->ct_xdrs), &call_msg)) (gdb) p sizeof (call_msg.ru.RM_cmb.cb_rpcvers) $12 = 8 ...so whatever this is doesn't seem to be affecting the client created for the portmapper conversation.
My current thinking on this is that this is an nfs-utils bug. I see the following header in the nfs-utils package: support/rpc/include/rpc/rpc_msg.h ...and that header defines this: struct call_body { u_int32_t cb_rpcvers; /* must be equal to two */ u_int32_t cb_prog; u_int32_t cb_vers; u_int32_t cb_proc; struct opaque_auth cb_cred; struct opaque_auth cb_verf; /* protocol specific - provided by client */ }; ...but the glibc headers seem to define these as u_long instead of u_int32_t. There are some other, similar issues in here as well. I tried deleting this file from the source tree and it seemed like the package picked up the glibc header file and built, but I still ran into the same issue. Still, I'm thinking that the problem is likely the headers that are in the nfs-utils tarball. Jakub, thanks for the help. I'll ping you if it comes back to looking like a glibc problem.
The -65.EL package works fine even on a 4.5 install, but the -69 package does not. So something broke between those two releases. Unfortunately, I can't seem to get the -65 package to build on 4.5. So it's not yet clear whether the problem is related to something in the nfs-utils package itself or maybe some header that got pulled in during the build process. Trying to determine that now...
Got -65 to build on my 4.5 machine (pilot error before) and the mount still worked afterward. I'm taking that as confirmation that the problem is something in the nfs-utils package and not anything to do glibc. I also built a -67 package and it worked, so the problem seems to lie somewhere in the changes between -67 and -69.
-68 is also broken. So the issue is the delta between -67 and -68. That seems to be where libnfsidmap, librpcsecgss and libgssapi were broken out into their own packages and the corresponding changes were made to the build process for nfs-utils.
I think the problem may be in librpcsecgss. It looks like that package has a set of headers similar to what's included in the nfs-utils SRPM, and it also looks like it has some global symbols that conflict with libc that might be a factor here. 0000000000000000 T clnt_create 0000000000000000 T clnttcp_create I'll probably try to do a quick fix tomorrow to see if I can confirm whether that's the problem and then see what might be the best way to fix it.
Created attachment 148317 [details] nfs-utils-lib patch -- change all u_int32_t definitions in rpc_msg.h to u_long That is indeed the problem. This patch seems to correct the issue, but the whole situation with duplicate functions and the way the headers are in nfs-utils-lib is a bit of a mess. There are number of symbols that are provided by both libc and librpcsecgss, and their headers don't necessarily match. So this patch seems to fix the immediate problem, but I'm not clear on whether it'll break anything else, or (more likely) if there are similar bugs lurking in nfs-utils/nfs-utils-lib due to the fact that it uses what seems to be older duplicate headers.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Created attachment 148322 [details] patch2: alternate patch -- remove extraneous header files and clean up functions This patch + a specfile patch to remove extraneous header files might be a better way to go. This cleans up function definitions and fixes up the Makefile.am to not include header files that are bad duplicates of glibc headers. There's a later patch that also renames the functions to make sure we don't collide with glibc function names. We should consider that if we want to go this route. Alternately, it may be better to just rebase the lib on RHEL5's version (0.14).
Steve, what do you think is the right approach here?
Steve pointed out that fixing this by changing the headers may be problematic for any 3rd party apps that are linking to librpcsecgss. OTOH, we only started providing that shared lib in U4, so there probably aren't a lot of apps that link to it, and those apps probably also link in glibc and may be having similar issues to the problem that nfs-utils is having.
Created attachment 148447 [details] a patch that eliminates the clnt_create
Created attachment 148480 [details] This patch removes the inclusion of any glibc routines from librpcsecgss
Created attachment 148490 [details] specfile patch for patch 148480 Thanks, Steve -- that looks good to me. rpcgssd seems to start up and work correctly with it, and rpc.svcgssd seems to start up (haven't tested it for functionality). That seems better than my earlier approach which left the duplicate code in place, but removed the headers. This specfile patch should make it build correctly, I think.
QE ack for 4.5 . . . will need testing from IBM,
Created attachment 148525 [details] patch to make gssd include auth_gss.h Now that we're #including libc's rpc.h header, we're not pulling in the definitions in auth_gss.h. This patch makes gssd_proc.c include it directly and allows nfs-utils to build against the patched nfs-utils-lib.
In the end, we've decided to go with the patch in comment #14. It doesn't fix all the potential problems with this library and its headers, but it does fix the immediate issue with a minimal set of changes. I've done some cursory testing and this cures the problem where x86_64 is unable to mount a krb5 mount, but haven't tested much beyond that.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0217.html