Bug 562807

Summary: secure nfs mount sec=krb5 fails in Fedora 12
Product: [Fedora] Fedora Reporter: Michael Young <m.a.young>
Component: libtirpcAssignee: Steve Dickson <steved>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: chuck.lever, jlayton, k.georgiou, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libtirpc-0.2.1-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 619792 (view as bug list) Environment:
Last Closed: 2010-05-19 11:50:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 619792    
Attachments:
Description Flags
rpc.gssd log from failed attempt
none
patch -- have gssd print rpc_createerr when auth_gss creation fails
none
rpc.gssd log with libtirpc debugging turned on
none
patch -- allow larger ticket sizes with auth_gss none

Description Michael Young 2010-02-08 13:04:22 UTC
If I try an nfs mount of a directory with krb5=sec in Fedora 12 then it fails. I have tried this with the current nfs-utils version (1.2.1-4.f12) and the original release version (1.2.0-18.f12). If however I downgrade it to the most recent Fedora 11 version (1.2.0-6.f11) then it works. I have packet traces and by comparing a working version with one that isn't the difference seems to be in the first packet after the keys have been negotiated. In the version which works this is an NFS v3 NULL call with a GSS token attached. In the one that doesn't there is a malformed  NFS v3 NULL call packet, which is comparable as far as the GSS Token length but it ends there where the GSS token should start.
Since the length is the same, I would guess that rpc.gssd is generating a token, but failing to pass it to the mount process.

Comment 1 Michael Young 2010-02-08 15:58:20 UTC
Going back through the f12 versions (that are still available from koji) 1.2.0-1.f12 works, 1.2.0-5.f12 doesn't.

Comment 2 Michael Young 2010-02-24 16:06:59 UTC
I have done a bit of experimenting and the problem seems to have been introduced when --enable-tirpc was made the default. If I add --disable-tirpc to the spec file of nfs-utils-1.2.1-5.fc12 and build and install an RPM then it works, however the nfs-utils-1.2.1-5.fc12 RPM from updates-testing doesn't.

Comment 3 Jeff Layton 2010-03-02 00:08:30 UTC
I'll test this as soon as I'm able. One question -- are any messages logged to syslog during these mount attempts?

Even better might be to run gssd in the foreground in debug mode and see whether it prints out anything suspicious:

# service rpcgssd stop
# rpc.gssd -f -vvvvv

...attempt the mount in another shell, then kill gssd and copy the output to a file. That might help point out where the problem is.

Comment 4 Jeff Layton 2010-03-02 02:04:43 UTC
So far, this works for me. Client and server are both f12, both using nfs-utils-1.2.1-4.fc12. You'll probably need debug output from gssd to understand what's happening here.

Comment 5 Michael Young 2010-03-02 10:16:01 UTC
Created attachment 397288 [details]
rpc.gssd log from failed attempt

Here are the logs (slightly anonymized). I had already looked at this but I didn't think they were very informative.

Comment 6 Jeff Layton 2010-03-02 12:13:12 UTC
This is failing:

        auth = authgss_create_default(rpc_clnt, clp->servicename, &sec);
        if (!auth) {
                /* Our caller should print appropriate message */
                printerr(2, "WARNING: Failed to create %s context for "
                            "user with uid %d for server %s\n",
                        (authtype == AUTHTYPE_KRB5 ? "krb5":"spkm3"),

...though it's not clear to me why it's failing for you and not me. I'll have to look and see what sort of logging we can get out of libtirpc to diagnose this.

Comment 7 Jeff Layton 2010-03-02 13:37:47 UTC
Created attachment 397327 [details]
patch -- have gssd print rpc_createerr when auth_gss creation fails

Here's an initial patch that might help point us in the right direction. Tested for compilation only. You'll want to apply this patch to the nfs-utils sources and rebuild gssd (or maybe just build a new package with the patch).

Then, run gssd in foreground debug mode again and reattempt the mount. With luck, we'll get a bit more info when that error message prints. If that doesn't help then we may need to rebuild libtirpc with -DDEBUG and see whether that gives us more info.

Comment 8 Michael Young 2010-03-02 14:35:08 UTC
It returns RPC: Success which still isn't very helpful. Actually that doesn't surprise me as we know why the remote end rejects the call, it receives a malformed packet. The question is why libtirpc is malforming the packet by not attaching the GSS token.

Comment 9 Michael Young 2010-03-02 17:52:26 UTC
Created attachment 397384 [details]
rpc.gssd log with libtirpc debugging turned on

This is the log with libtirpc debugging turned on. I have not had a chance to analyze it much yet.

Comment 10 Jeff Layton 2010-03-02 21:02:54 UTC
rpcsec_gss: in authgss_marshal()
rpcsec_gss: xdr_rpc_gss_cred: encode success (v 1, proc 1, seq 0, svc 1, ctx (nil):0)
rpcsec_gss: xdr_rpc_gss_init_args: encode failure (token 0x1992e30:1221)

I have a hunch that I know what this is...

From your logs it looks like you're using AD as a KDC. This is fine, but one thing about AD is that it puts extra authorization info into krb5 tickets (the PAC -- privilege access certificate). They can grow to be quite large (on the order of 64k).

xdr_rpc_gss_init_args does this:

        xdr_stat = xdr_bytes(xdrs, (char **)&p->value,
                              (u_int *)&p->length, MAX_NETOBJ_SZ);

...and...

#define MAX_NETOBJ_SZ 1024

I suspect that the tickets from your AD server are larger than 1k and that's causing this to fail. What might be interesting is to increase this value and then rebuild tirpc and see if that works around the problem. A real fix will probably mean inlining the bytes, but we'll need to go over this carefully to be sure it out to be sure.

Here's what I'd do:

Try a mount, let it fail
stat /tmp/krb5cc_machine_MDS.AD.DUR.AC.UK

...then increase MAX_NETOBJ_SZ to something bigger than the size of the credcache.

I haven't surveyed this code fully, so I don't know whether a really big MAX_NETOBJ_SZ is ok, but it's worth a shot.

Comment 11 Chuck Lever 2010-03-02 21:10:21 UTC
(In reply to comment #10)
> Here's what I'd do:
> 
> Try a mount, let it fail
> stat /tmp/krb5cc_machine_MDS.AD.DUR.AC.UK
> 
> ...then increase MAX_NETOBJ_SZ to something bigger than the size of the
> credcache.
> 
> I haven't surveyed this code fully, so I don't know whether a really big
> MAX_NETOBJ_SZ is ok, but it's worth a shot.    

I haven't looked at this code, but do note that a netobj is a well-known XDR type which is never larger than 1024, so I don't think the value of that constant should be changed.  If the argument being marshalled can be larger than 1024, the use of MAX_NETOBJ_SZ for the maximum size of that particular argument is not appropriate.

Comment 12 Michael Young 2010-03-03 10:11:29 UTC
I tried increasing MAX_NETOBJ_SZ in two steps. Firstly we know from the logs how big the packet that fails was (1221 bytes) so I increased MAX_NETOBJ_SZ to 1280. That allowed me to mount the filesystem but not to access it. This is because the user tickets seem to be a bit bigger. Thus I increased it further to 1536 and I was then able to access the files. For reference /tmp/krb5cc_machine_MDS.AD.DUR.AC.UK is 2325 bytes and the user krb5cc file 2599 bytes, somewhat larger than the packets actually sent because they contain a krbtgt ticket as well as the ticket for the file server.

Comment 13 Jeff Layton 2010-03-03 12:01:32 UTC
Ok, that's good news. Yep, I knew that we'd have more than one ticket there, but figured you wouldn't need larger than that.

Regarding Chuck's comment -- I'm not planning to propose that as a fix. It was simply a way to check to see whether the problem is what I think it is.

From what I can tell, librpcsecgss inlines the service ticket rather than copying in the bytes, but I need to look over this code more closely and see what the proper fix should be.

Comment 14 Jeff Layton 2010-03-03 19:00:09 UTC
Changing this to a libtirpc bug since that's where the problem seems to be.

Comment 15 Jeff Layton 2010-03-03 20:53:52 UTC
Created attachment 397659 [details]
patch -- allow larger ticket sizes with auth_gss

Here's an initial (untested) patch that I think will fix this issue the correct way. It also "backports" a number of other fixes that went into librpcsecgss. Please test this patch if you're able and let me know if it fixes the problem.

Chuck, any comments?

Comment 16 Chuck Lever 2010-03-03 21:15:51 UTC
(In reply to comment #15)
> Chuck, any comments?    

I don't have any immediate objections, but you should have Kevin Coffman review this fix.

Comment 17 Jeff Layton 2010-03-03 21:28:36 UTC
Good idea. If it tests out ok, I'll cc him when I send it out to the list.

Comment 18 Michael Young 2010-03-04 10:47:06 UTC
(In reply to comment #15)
> Created an attachment (id=397659) [details]
> patch -- allow larger ticket sizes with auth_gss
> 
> Here's an initial (untested) patch that I think will fix this issue the correct
> way. It also "backports" a number of other fixes that went into librpcsecgss.
> Please test this patch if you're able and let me know if it fixes the problem.

Yes, with the patch it builds and works for me. I can mount the filesystem and view and write to files and directories within it.

Comment 19 Jeff Layton 2010-03-08 19:28:02 UTC
The patch has been pushed to mainline libtirpc. Reassigning to steved so he can work out how to release the fix.