Description of problem:
If a RHEL 5.5 server is set up as a kerberized NFS (sec=krb5) server in conjunction with Active Directory, users with large GSSAPI tokens cannot access the file shares, and whenever they attempt to, they hang the NFS daemons for 30 seconds.
Note that there are two problems with this:
- Users in lots of security groups cannot access the file server;
- A malicious user can set up their user ID to satisfy the conditions, and launch a denial of service attack on the file server.
When a client accesses an NFS share that uses krb5 for the first time, an NFS NULL call is made to the server to establish a GSSAPI security context. If this NULL call has a GSSAPI token (past the 64-byte RPC header portion) of 2052 or larger bytes, the NFS daemons on the server will go into a busy-wait loop until they time out (I observed 3 daemons consuming CPU). On my test machine, this consumes 50% CPU for 30 seconds for a single call (on a dual-cpu system - the NFS processes were using 100% of one of the CPUs). With a GSSAPI token of 2044 bytes (or less), the NFS NULL call is successful, establishing a valid GSSAPI security context (I did not set up a test for 2048 bytes). This happens in both NFSv3 and v4 (the issue appears to be in net/sunrpc in the kernel tree).
Using MIT or Heimdal servers, you won't get these large GSSAPI tokens in the initial NFS NULL RPC call. But with Active Directory, the ticket contains an AuthorizationData field with additional information. One of the things in this field is an array of the groups the user is in. On my test user, it took 113 group memberships to push the GSSAPI token to 2052 bytes.
It is not infrequent in large organizations for users to be in enough groups to trip this condition (the number includes all nested groups, not just top-level groups).
Version-Release number of selected component (if applicable):
2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:58:38 EDT 2010 i686 i686 i386 GNU/Linux
Did a yum update to update all components just before submitting bug.
Using a user in lots of Active Directory groups, attempt to access a previously-mounted NFS share that uses krb5 for security (either v3 or v4).
Steps to Reproduce:
1. In Active Directory, configure a user with ~ 115 group memberships. The key is to set it up so that the GSSAPI token is 2052 or larger (unknown if this will also happen at 2048).
2. Set up a RHEL5.5 system that shares NFS with krb5 security, using an Active Directory master Kerberos server.
3. Grant access to the file share to the user created in step 1
4. Export the filesystems (note that I tried a variety of combinations of secure/root_squash, and they didn't make a difference for this bug)
contents of /etc/exports:
(/home is bind mounted to /export/NFS4)
5. On a client machine, mount the NFS share with sec=krb5
mount -t nfs -o sec=krb5 server.uiuc.edu:/home /mnt
mount -t nfs4 -o sec=krb5 server.uiuc.edu:/ /mnt
6. kinit to that user on a client machine (I used a Fedora 11 client).
7. Do a list of the shared directory (e.g., "ls /mnt")
CPU will spike to ~50% on the server for 30 seconds, and the client will appear to hang. After 30 seconds, the client will return with "permission denied", and CPU will return to normal on the server
List of the directory /mnt. This is what happens if the GSS Token is 2044 bytes or lower (unknown what the behavior is between 2045 and 2051 bytes).
Please contact me for more information. I have done a lot of debugging and tracing on this, since it is a blocking issue for implementing a new fileserver in our environment.
I have traced it to the server handling of the call, rather than the formation of the call on the client side.
I don't know the *exact* location of the problem, but I've narrowed it down to a couple of calls in net/sunrpc and net/sunrpc/auth_gss (using rpcdebug, tcpdump/wireshark, a debug version of librpcsecgss on the client, and a lot of code inspection). I just don't have the expertise to trace the RPC calls through the kernel any deeper.
This bug exists in the newest mainline kernel (2.6.35) as well.
When sunrpc formats the upcall to the user space daemon (svcgssd), it does it in ASCII. This is done in a function called qword_addhex() in net/sunrpc/cache.c. The buffer passed in to qword_addhex() is set to be PAGE_SIZE, which on my kernel is 4096. Each byte encoded takes two bytes in ASCII. So it tries to encode the upcall, fails, and returns a -1 value.
What ends up happening is that each nfsd process continuously tries to process the RPC request (checking the cache), without pause.
Ultimately what this means is that the current system silently fails (and hangs the NFS server for 30 seconds) if the GSSAPI token is 2048 bytes or larger.
Note this is one of the things that should be fixed by switch to gss-proxy, kernel code for which will probably land in 3.10.
So the general workaround I've used for this against AD is to set userAccountControl such that the PAC is not included in the ticket (NO_AUTH_DATA_REQUIRED), thus reducing the size and keeping it below this threshold. In general, that seems like a perfectly acceptable workaround.
But I'm not clear what the solution is if other services on the same machine require the PAC to be present to function properly. Samba and winbind seem keen to have the PAC, and misbehave somewhat when it's not present, but I can't believe that NFS and Samba together on a fileserver authenticated against AD is an unusual situation. Any suggestions?
Will this solution landing in 3.10 actually have any bearing on it being backported to RHEL 5 or 6 (or even 7)?
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).