Bug 1283341
Summary: | cannot mount RHEL7 NFS server with nfsvers=4.1,sec=krb5 but nfsvers=4.0,sec=krb5 works | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | James Ralston <ralston> |
Component: | kernel | Assignee: | J. Bruce Fields <bfields> |
kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | bfields, chunwang, cmaiolin, eguan, fs-qe, gkulkarn, jiyin, kfiresmith, ralston, smayhew, ssorce, steved, troels, yoyang |
Version: | 7.1 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-337.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-03 14:15:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
James Ralston
2015-11-18 18:43:03 UTC
(In reply to James Ralston from comment #0) > However, if I specify NFS version 4.1, the mount command hangs: > > $ mount -t nfs -o nfsvers=4.1,sec=krb5 server.example.org:/ /mnt > > Running tcpdump reveals this interaction between the client and the server: > > C: EXCHANGE_ID > S: EXCHANGE_ID: NFS4_OK (0) > C: CREATE_SESSION > S: CREATE_SESSION: NFS4ERR_WRONG_CRED (10082) Looking at fs/nfsd/nfs4state.c:nfsd4_exchange_id(): this means mach_creds_match failed, so I think either: - EXCHANGE_ID used SP4_MACH_CRED protection, but CREATE_SESSION is using only RPC_GSS_SVC_NONE (not integrity or privacy), or - CREATE_SESSION and EXCHANGE_ID came from different krb5 principals. The former seems more likely, and can be checked by using wireshark to look at the EXCHANGE_ID arguments (to see if SP4_MACH_CHRED is set) and to look at the rpc header of the two calls (look in the cred and see what the gss service is set to). > (...I can't find any official Red Hat documentation that claims RHEL7 > supports NFSv4.2 servers.) That should probably be fixed. I expect 4.2 to work at this point. Greetings Red Hat folks, I have created a customer support case for this bug: https://access.redhat.com/support/cases/#/case/01541617. The RPC headers of the EXCHANGE_ID and and CREATE_SESSION calls are identical: Frame 18: 410 bytes on wire (3280 bits), 410 bytes captured (3280 bits) Remote Procedure Call, Type:Call XID:0x84489b7f Credentials Flavor: RPCSEC_GSS (6) Length: 28 GSS Version: 1 GSS Procedure: RPCSEC_GSS_DATA (0) GSS Sequence Number: 1 GSS Service: rpcsec_gss_svc_integrity (2) GSS Context GSS Context Length: 8 GSS Context: 0100000000000000 [Created in frame: 13] Network File System GSS Data, Ops(1): EXCHANGE_ID Operations (count: 1): EXCHANGE_ID Opcode: EXCHANGE_ID (42) eia_state_protect: SP4_MACH_CRED (1) Frame 21: 338 bytes on wire (2704 bits), 338 bytes captured (2704 bits) Remote Procedure Call, Type:Call XID:0x85489b7f Credentials Flavor: RPCSEC_GSS (6) Length: 28 GSS Version: 1 GSS Procedure: RPCSEC_GSS_DATA (0) GSS Sequence Number: 2 GSS Service: rpcsec_gss_svc_integrity (2) GSS Context GSS Context Length: 8 GSS Context: 0100000000000000 [Created in frame: 13] Network File System GSS Data, Ops(1): CREATE_SESSION Operations (count: 1): CREATE_SESSION Opcode: CREATE_SESSION (43) The only place where "SP4_MACH_CRED" appears in the capture is in the EXCHANGE_ID calls (the "eia_state_protect: SP4_MACH_CRED (1)" line); it does not appear in the CREATE_SESSION calls. I'm not sure how the EXCHANGE_ID and CREATE_SESSION calls would be using different krb5 principals, but since we create the /etc/krb5.keytab file by joining hosts to AD via Samba "net ads join", the hosts have multiple SPNs: $ klist -k -e Keytab name: FILE:/etc/krb5.keytab KVNO Principal ---- -------------------------------------------------------------------------- 2 host/nfsclient.example.org.ORG (des-cbc-crc) 2 host/nfsclient.example.org.ORG (des-cbc-md5) 2 host/nfsclient.example.org.ORG (aes128-cts-hmac-sha1-96) 2 host/nfsclient.example.org.ORG (aes256-cts-hmac-sha1-96) 2 host/nfsclient.example.org.ORG (arcfour-hmac) 2 host/NFSCLIENT.ORG (des-cbc-crc) 2 host/NFSCLIENT.ORG (des-cbc-md5) 2 host/NFSCLIENT.ORG (aes128-cts-hmac-sha1-96) 2 host/NFSCLIENT.ORG (aes256-cts-hmac-sha1-96) 2 host/NFSCLIENT.ORG (arcfour-hmac) 2 NFSCLIENT$@AD.EXAMPLE.ORG (des-cbc-crc) 2 NFSCLIENT$@AD.EXAMPLE.ORG (des-cbc-md5) 2 NFSCLIENT$@AD.EXAMPLE.ORG (aes128-cts-hmac-sha1-96) 2 NFSCLIENT$@AD.EXAMPLE.ORG (aes256-cts-hmac-sha1-96) 2 NFSCLIENT$@AD.EXAMPLE.ORG (arcfour-hmac) 2 nfs/nfsclient.example.org.ORG (des-cbc-crc) 2 nfs/nfsclient.example.org.ORG (des-cbc-md5) 2 nfs/nfsclient.example.org.ORG (aes128-cts-hmac-sha1-96) 2 nfs/nfsclient.example.org.ORG (aes256-cts-hmac-sha1-96) 2 nfs/nfsclient.example.org.ORG (arcfour-hmac) 2 nfs/NFSCLIENT.ORG (des-cbc-crc) 2 nfs/NFSCLIENT.ORG (des-cbc-md5) 2 nfs/NFSCLIENT.ORG (aes128-cts-hmac-sha1-96) 2 nfs/NFSCLIENT.ORG (aes256-cts-hmac-sha1-96) 2 nfs/NFSCLIENT.ORG (arcfour-hmac) If it would be helpful, I can provide the raw tcpdump capture (preferably in a more private channel, like the support case we opened). (In reply to James Ralston from comment #4) > The RPC headers of the EXCHANGE_ID and and CREATE_SESSION calls are > identical: Thanks, yes, the client behavior looks correct there, and there's nothing there that I would expect to make mach_creds_match() return false. > The only place where "SP4_MACH_CRED" appears in the capture is in the > EXCHANGE_ID calls (the "eia_state_protect: SP4_MACH_CRED (1)" line); it does > not appear in the CREATE_SESSION calls. That makes sense, thanks. > I'm not sure how the EXCHANGE_ID and CREATE_SESSION calls would be using > different krb5 principals, The context handle is the same for both, so the principals can't be different. > If it would be helpful, I can provide the raw tcpdump capture (preferably in > a more private channel, like the support case we opened). I think the information above is exactly what was needed, thanks! So the most likely explanation to me is that cr_principal is NULL here. That should have been set by code in gssp_accept_sec_context_upcall that handles the reply from gss-proxy, but perhaps something's going wrong there. I'll have to think about how to debug that. Worst case, perhaps we can get you a test kernel with some printk()s there. If it would make it easier, since I'm a Fedora packager, if you attach a patch that adds the debugging you want, I can easily roll a local kernel RPM with the patch and install it on our NFS server. Created attachment 1096838 [details] debugging printk's (In reply to James Ralston from comment #6) > If it would make it easier, since I'm a Fedora packager, if you attach a > patch that adds the debugging you want, I can easily roll a local kernel RPM > with the patch and install it on our NFS server. Oh, that would make it very easy, thanks; attached. Unless I'm overlooking something obvious (always possible), gss-proxy is either failing to pass down the principal name, or there's something unusual about the name. This should help confirm that. These are just ordinary printk's, so should go to the system logs unconditionally, but it may be useful to also turn on some more debugging, probably "rpcdebug -m nfsd -s proc" and "rpcdebug -m rpc -s auth". Wow, kernel builds take a long time nowadays. (It's been years since I've had cause to re-roll the kernel RPM.) OK. I believe the problem is that gss-proxy is failing to pass down the principal name: Nov 19 18:02:27 nfsserver kernel: mach_creds_match failure: ffff88007b666158 has no principal I'll attach the log output from an NFSv4.0 mount request (which succeeds), and an NFSv4.1 mount request (which kept failing/retrying until I interrupted it). I've obfuscated the hostnames and our realm name, but have not otherwise modified the output. Created attachment 1096953 [details]
log messages on NFS server when processing NFSv4.0 mount request (which succeeds)
Created attachment 1096954 [details]
log messages on NFS server when processing NFSv4.1 mount request (which fails)
One additional thing: even though rpc.svcgssd is deprecated in favor of gssproxy, one of the things I did try was having the NFS server use rpc.svcgssd instead of gssproxy. Although I don't have a tcpdump from that attempt, the NFSv4.1 mount attempt hung on the client, the same as when using gssproxy on the server. If you think it would be useful, now that we're running a kernel on the NFS server with more debugging, I could try using rpc.svcgssd again. Oh, I see: Nov 19 18:02:27 nfsserver kernel: name not a service principal Nov 19 18:02:27 nfsserver kernel: ffff88004e8dbca8->cr_principal unset: Nov 19 18:02:27 nfsserver kernel: found_creds = 1, name = NFSCLIENT$@AD.EXAMPLE.ORG and the code here is: c = strchr(data->creds.cr_principal, '@'); if (c) { *c = '\0'; /* change service-hostname delimiter */ c = strchr(data->creds.cr_principal, '/'); if (c) *c = '@'; } if (!c) { printk("name not a service principal\n"); ... So gss-proxy is giving us a principal name, but when we can't find a "/" in it we give up and set cr_principal to NULL. Upstream code is the same. Ugh. I think this was needed for the 4.0 callback case, where the server needs to be able to initiate a client back to the client. (Not necessary any more in the 4.1 case.) Cc'ing Simo in case he remembers any details there. I can't remember if we've run across this before. Anyway actually I think there's a simple fix--just store the unprocessed name in cr_principal, and remove the ugly logic above from the gss-proxy code and instead put it in fs/nfsd/nfs4callback.c:setup_callback_client() in the 4.0 case which is the place it's needed, and then the 4.1 code will no longer have to care about this 4.0 quirk. (In reply to James Ralston from comment #11) > One additional thing: even though rpc.svcgssd is deprecated in favor of > gssproxy, one of the things I did try was having the NFS server use > rpc.svcgssd instead of gssproxy. > > Although I don't have a tcpdump from that attempt, the NFSv4.1 mount attempt > hung on the client, the same as when using gssproxy on the server. > > If you think it would be useful, now that we're running a kernel on the NFS > server with more debugging, I could try using rpc.svcgssd again. I suspect fixing that would require modifying the kernel and svcgssd and their interface, so it's probably not worth it--we should just tell people they need gss-proxy in that case. svcgssd's already unreliable in the AD case due to some unfortunate size limits. (In reply to J. Bruce Fields from comment #12) > Anyway actually I think there's a simple fix--just store the unprocessed > name in cr_principal, and remove the ugly logic above from the gss-proxy > code and instead put it in fs/nfsd/nfs4callback.c:setup_callback_client() in > the 4.0 case which is the place it's needed, and then the 4.1 code will no > longer have to care about this 4.0 quirk. That seems reasonable to me. If you provide the patches, I'll build local kernel/gssproxy packages with them and give them a whirl. (The only hosts where we'll need to use these custom kernel/gssproxy patches are the NFS servers, which is only one or two hosts. While we try hard to avoid spinning our own RPMs unless absolutely necessary, I think this is one of those cases where it's worth it, because we're not going to have a working Linux NFSv4.1 server (with sec=krb5 and Microsoft AD) without it.) Besides, we'll only have to maintain the local kernel/gssproxy spins until 7.3 or 7.4 (in the worst case). We can live with that. (In reply to J. Bruce Fields from comment #13) > I suspect fixing that would require modifying the kernel and svcgssd and > their interface, so it's probably not worth it--we should just tell people > they need gss-proxy in that case. svcgssd's already unreliable in the AD > case due to some unfortunate size limits. Well, I was trying to isolate whether the problem was with the kernel, or with gssproxy. If rpc.svcgssd had worked, then I could have pinned down the problem to either gssproxy, or an interaction between the kernel and gssproxy. But since rpc.svcgssd has issues with the AD case, when it failed as well, I couldn't know whether it was because the bug was in the kernel instead of gssproxy, or because rpc.svcgssd was choking on our AD credentials. Since rpc.svcgssd is legacy code at this point, I'd much rather spend cycles working on the kernel/gssproxy interaction. Created attachment 1097471 [details] test patch to fix principal used in mach_cred comparisons (In reply to James Ralston from comment #14) > (In reply to J. Bruce Fields from comment #12) > > Anyway actually I think there's a simple fix--just store the unprocessed > > name in cr_principal, and remove the ugly logic above from the gss-proxy > > code and instead put it in fs/nfsd/nfs4callback.c:setup_callback_client() in > > the 4.0 case which is the place it's needed, and then the 4.1 code will no > > longer have to care about this 4.0 quirk. OK, that doesn't quite work as cr_principal is used elsewhere. And turns out there's a few other minor fixes needed here. Attached is a first draft. It passes my regression tests, but I haven't tried to work out a test to cover your case. > That seems reasonable to me. If you provide the patches, I'll build local > kernel/gssproxy packages with them and give them a whirl. Thanks! You could try the attached, or if you wait I'll likely have an updated version Monday or Tuesday. Any testing results welcomed. > (The only hosts where we'll need to use these custom kernel/gssproxy patches > are the NFS servers, which is only one or two hosts. While we try hard to > avoid spinning our own RPMs unless absolutely necessary, I think this is one > of those cases where it's worth it, because we're not going to have a > working Linux NFSv4.1 server (with sec=krb5 and Microsoft AD) without it.) We'll work on getting this into RHEL once we've got it right, we definitely want the the krb5/AD/4.1 (or 4.2) combination to work out of the box. Success! With the patch (attachment ID 1097471) applied to the NFS server, krb5/AD/4.1 and krb5/AD/4.2 mounts succeed now on clients. Once mounted, users automatically acquire the nfs/nfsserver.example.org.ORG credentials when they walk into the mountpoint, and permissions and ACLs appear to be correctly enforced. We'll be happy to test any revisions to the patch. In the meantime, please let me know if there are specific activities you want us to test. Thanks! (In reply to James Ralston from comment #17) > Success! With the patch (attachment ID 1097471) applied to the NFS server, > krb5/AD/4.1 and krb5/AD/4.2 mounts succeed now on clients. Great, thanks for testing! Upstream posting: http://lkml.kernel.org/r/1448385497-23737-1-git-send-email-bfields@redhat.com Somebody ran across this problem in Fedora; mind if I open it to the public? (In reply to J. Bruce Fields from comment #20) > Great, thanks for testing! No problem; I'm glad we were able to help track down the bug. > Upstream posting: > > http://lkml.kernel.org/r/1448385497-23737-1-git-send-email-bfields@redhat.com Looks good! BTW, I'm pretty sure I know the answer to this already, but there's no hope for RHEL6, is there? Fixing the bug for svcgssd case wouldn't really help, as svcgssd's limitations make AD KRB5 intractable, and even if the gssproxy kernel code could be backported to the 2.6.32 kernel series, gssproxy itself wants a more recent version of krb5 than what RHEL6 provides... (In reply to J. Bruce Fields from comment #21) > Somebody ran across this problem in Fedora; mind if I open it to the public? No objections. (I redacted the log snippets explicitly for that possibility.) (In reply to James Ralston from comment #23) > BTW, I'm pretty sure I know the answer to this already, but there's no hope > for RHEL6, is there? Fixing the bug for svcgssd case wouldn't really help, > as svcgssd's limitations make AD KRB5 intractable, and even if the gssproxy > kernel code could be backported to the 2.6.32 kernel series, gssproxy itself > wants a more recent version of krb5 than what RHEL6 provides... Yes. Also, we don't recommend server-side 4.1 in RHEL6. (In reply to J. Bruce Fields from comment #20) > Upstream posting: > > http://lkml.kernel.org/r/1448385497-23737-1-git-send-email-bfields@redhat.com The upstream diffs appear to be corrupted. See here: http://marc.info/?l=linux-nfs&m=144838550825481&q=raw Look at line 15: @@ -55,6 +55,7 @@ svc_authenticate(struct svc_rqst *rqstp, __be32 *authp) Maybe the MARC archiver corrupted it, but I could've sworn I've pulled patches successfully from their "raw" version of posts before. Regardless, any chance you could attach your final patch to this bug? Because I need to rebuild locally for 327.3.1, and I'd like to do it as closely as possible to what the upstream fix will be. Thanks! (In reply to James Ralston from comment #29) > (In reply to J. Bruce Fields from comment #20) > > Upstream posting: > > > > http://lkml.kernel.org/r/1448385497-23737-1-git-send-email-bfields@redhat.com > > The upstream diffs appear to be corrupted. See here: > > http://marc.info/?l=linux-nfs&m=144838550825481&q=raw > > Look at line 15: > > @@ -55,6 +55,7 @@ svc_authenticate(struct svc_rqst *rqstp, __be32 *authp) What's wrong? Looks OK to me. > Regardless, any chance you could attach your final patch to this bug? > Because I need to rebuild locally for 327.3.1, and I'd like to do it as > closely as possible to what the upstream fix will be. Would git be OK? What you want is at the current tip of my for-4.5 branch at git://linux-nfs.org/~bfields/linux.git (414ca017a54d "nfsd4: fix gss-proxy 4.1 mounts for some AD principals" and the preceding four patches). (In reply to J. Bruce Fields from comment #30) > What's wrong? Looks OK to me. Hmmm... that's the line "git apply" was barking about, but you're right; excluding the git-style diff (which the RHEL7 patch should be able to handle), I don't see what's wrong with it. > Would git be OK? What you want is at the current tip of my for-4.5 branch > at git://linux-nfs.org/~bfields/linux.git (414ca017a54d "nfsd4: fix > gss-proxy 4.1 mounts for some AD principals" and the preceding four patches). Ah; even better—now I don't have to figure out where my screen-scrape is going south. Thanks. BTW, is there a timeframe for getting this patch into the RHEL7 kernel yet? (If it's going to have to wait for a point release, I'm hoping it can at least hit 7.3, instead of having to wait for 7.4...) (In reply to James Ralston from comment #31) > (If it's going to have to wait for a point release, I'm hoping it can at > least hit 7.3 Looks like 7.3, yes. Patch(es) available on kernel-3.10.0-337.el7 Please, any probable date for the errata to be available ? (In reply to James Ralston from comment #0) Hi, James, I am working on reproducing this problem, but to my sorrow, it may be hard for me to configure a Kerberos realm right as yours. For getting more deep inside, I will be appreciate if you can describe your configuration of: 1. At this kerberos realm, the version of Windows Server and the Domain Functional Level of the AD DC 2. The integrating method of connecting RHEL`s to AD (samba+winbind or adcli+sssd or even others) Thanks. 1. Server 2008 R2 Enterprise, domain function level Server 2008 R2 2. samba to do the join (via "net ads join"), sssd for integration: access_provider = ad auth_provider = ad chpass_provider = ad id_provider = ad (In reply to James Ralston from comment #39) Thanks James, I was working hard on performing this job with Server 2012R2 and realmd, but ever failed on mounting. I will turn to perform with the same OS and Function Level as yours to try for reproducing. Thanks for your information, that is of vital importance. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2574.html |