Bug 618841 - Denial of service attack in Kerberized NFS (v3 and v4) [NEEDINFO]
Denial of service attack in Kerberized NFS (v3 and v4)
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.5
All Linux
low Severity urgent
: rc
: ---
Assigned To: nfs-maint
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-27 16:46 EDT by Jonathan Manton
Modified: 2014-06-02 09:17 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-02 09:17:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
pm-rhel: needinfo? (jmanton)


Attachments (Terms of Use)

  None (edit)
Description Jonathan Manton 2010-07-27 16:46:33 EDT
Description of problem:
If a RHEL 5.5 server is set up as a kerberized NFS (sec=krb5) server in conjunction with Active Directory, users with large GSSAPI tokens cannot access the file shares, and whenever they attempt to, they hang the NFS daemons for 30 seconds.

Note that there are two problems with this:
- Users in lots of security groups cannot access the file server;
- A malicious user can set up their user ID to satisfy the conditions, and launch a denial of service attack on the file server.

When a client accesses an NFS share that uses krb5 for the first time, an NFS NULL call is made to the server to establish a GSSAPI security context.  If this NULL call has a GSSAPI token (past the 64-byte RPC header portion) of 2052 or larger bytes, the NFS daemons on the server will go into a busy-wait loop until they time out (I observed 3 daemons consuming CPU).  On my test machine, this consumes 50% CPU for 30 seconds for a single call (on a dual-cpu system - the NFS processes were using 100% of one of the CPUs).  With a GSSAPI token of 2044 bytes (or less), the NFS NULL call is successful, establishing a valid GSSAPI security context (I did not set up a test for 2048 bytes).  This happens in both NFSv3 and v4 (the issue appears to be in net/sunrpc in the kernel tree).

Using MIT or Heimdal servers, you won't get these large GSSAPI tokens in the initial NFS NULL RPC call.  But with Active Directory, the ticket contains an AuthorizationData field with additional information.  One of the things in this field is an array of the groups the user is in.  On my test user, it took 113 group memberships to push the GSSAPI token to 2052 bytes.

It is not infrequent in large organizations for users to be in enough groups to trip this condition (the number includes all nested groups, not just top-level groups).


Version-Release number of selected component (if applicable):

2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:58:38 EDT 2010 i686 i686 i386 GNU/Linux

Did a yum update to update all components just before submitting bug.

How reproducible:

Using a user in lots of Active Directory groups, attempt to access a previously-mounted NFS share that uses krb5 for security (either v3 or v4).


Steps to Reproduce:
1.  In Active Directory, configure a user with ~ 115 group memberships.  The key is to set it up so that the GSSAPI token is 2052 or larger (unknown if this will also happen at 2048).

2.  Set up a RHEL5.5 system that shares NFS with krb5 security, using an Active Directory master Kerberos server.  

3.  Grant access to the file share to the user created in step 1

4.  Export the filesystems (note that I tried a variety of combinations of secure/root_squash, and they didn't make a difference for this bug)

contents of /etc/exports:
/export/NFS4 gss/krb5(insecure,rw,fsid=0,no_root_squash)

/home gss/krb5(rw,fsid=0)
/home client.uiuc.edu(rw,root_squash)

(/home is bind mounted to /export/NFS4)

5.  On a client machine, mount the NFS share with sec=krb5
NFS3:
mount -t nfs -o sec=krb5 server.uiuc.edu:/home /mnt

NFS4:
mount -t nfs4 -o sec=krb5 server.uiuc.edu:/ /mnt

6.  kinit to that user on a client machine (I used a Fedora 11 client).

7.  Do a list of the shared directory (e.g., "ls /mnt")
  
Actual results:

CPU will spike to ~50% on the server for 30 seconds, and the client will appear to hang.  After 30 seconds, the client will return with "permission denied", and CPU will return to normal on the server


Expected results:

List of the directory /mnt.  This is what happens if the GSS Token is 2044 bytes or lower (unknown what the behavior is between 2045 and 2051 bytes).


Additional info:

Please contact me for more information.  I have done a lot of debugging and tracing on this, since it is a blocking issue for implementing a new fileserver in our environment.  

I have traced it to the server handling of the call, rather than the formation of the call on the client side.

I don't know the *exact* location of the problem, but I've narrowed it down to a couple of calls in net/sunrpc and net/sunrpc/auth_gss (using rpcdebug, tcpdump/wireshark, a debug version of librpcsecgss on the client, and a lot of code inspection).  I just don't have the expertise to trace the RPC calls through the kernel any deeper.
Comment 1 Jonathan Manton 2010-08-03 18:06:28 EDT
This bug exists in the newest mainline kernel (2.6.35) as well.

When sunrpc formats the upcall to the user space daemon (svcgssd), it does it in ASCII.  This is done in a function called qword_addhex() in net/sunrpc/cache.c.  The buffer passed in to qword_addhex() is set to be PAGE_SIZE, which on my kernel is 4096.  Each byte encoded takes two bytes in ASCII.  So it tries to encode the upcall, fails, and returns a -1 value.

What ends up happening is that each nfsd process continuously tries to process the RPC request (checking the cache), without pause.

Ultimately what this means is that the current system silently fails (and hangs the NFS server for 30 seconds) if the GSSAPI token is 2048 bytes or larger.
Comment 2 J. Bruce Fields 2013-02-27 13:21:11 EST
Note this is one of the things that should be fixed by switch to gss-proxy, kernel code for which will probably land in 3.10.
Comment 3 John Hodrien 2013-03-07 04:38:57 EST
So the general workaround I've used for this against AD is to set userAccountControl such that the PAC is not included in the ticket (NO_AUTH_DATA_REQUIRED), thus reducing the size and keeping it below this threshold.  In general, that seems like a perfectly acceptable workaround.

But I'm not clear what the solution is if other services on the same machine require the PAC to be present to function properly.  Samba and winbind seem keen to have the PAC, and misbehave somewhat when it's not present, but I can't believe that NFS and Samba together on a fileserver authenticated against AD is an unusual situation.  Any suggestions?

Will this solution landing in 3.10 actually have any bearing on it being backported to RHEL 5 or 6 (or even 7)?
Comment 4 RHEL Product and Program Management 2014-03-07 07:47:20 EST
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Comment 5 RHEL Product and Program Management 2014-06-02 09:17:19 EDT
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).

Note You need to log in before you can comment on or make changes to this bug.