Bug 565507

Summary: libtirpc not handling large numbers of supplemental groups correctly
Product: [Fedora] Fedora Reporter: Peter Engel <peter.engel>
Component: libtirpcAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 12CC: chuck.lever, fdanapfe, helge.deller, jlayton, rwheeler, steved
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: libtirpc-0.2.1-1.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 575909 (view as bug list) Environment:
Last Closed: 2010-04-03 04:47:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 575909    
Attachments:
Description Flags
strace /sbin/umount.nfs
none
patch 1 -- handle >NGRPS groups correctly
none
patch 2 -- handle more error cases
none
patch 3 -- address Chuck's concerns none

Description Peter Engel 2010-02-15 14:31:15 UTC
Description of problem:
/sbin/umount.nfs terminates (Program terminated with signal 6, Aborted.)
Partition is 100% full after many cores written. Every 2 sec a core file of app 1MB is written probably by automounter

This morning the disk was full. Further Analysis found /sbin/umount.nfs as terminating program. No idea why this started 3 days ago (oldest cores found). Maybe the package was updated?

Version-Release number of selected component (if applicable):
[root@server /]# rpm -qf /sbin/umount.nfs
nfs-utils-1.2.1-4.fc12.i686

How reproducible:
Always


Steps to Reproduce:
1. Manually type /sbin/umount.nfs <...>

or 

1. Wait until disk is full (probably by automounter)

  
Actual results:
Program terminates. Disk is full.

Expected results:
Program works. Disk not full.



Additional info:
[root@server /]# file core.15246 
core.15246: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from '/sbin/umount.nfs <...>'

(gdb) core core.15246
Missing separate debuginfo for the main executable file
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/79/a4a1c416246d09ff344aab5ee8564de47d5c81
Core was generated by `/sbin/umount.nfs /net/sapmnt.appl_sw'.
Program terminated with signal 6, Aborted.
#0  0x0032b416 in __kernel_vsyscall ()

Comment 1 Jeff Layton 2010-02-15 15:01:55 UTC
> Program terminated with signal 6, Aborted.

SIGABRT? I'm not even sure what can cause that to get issued...

I'm not seeing this on my i686 machine with nfs-utils-1.2.1-4.fc12.i686.

# umount.nfs /mnt/test ; echo $?
0

...I've also straced it and didn't see any problems. Perhaps you can strace your umount.nfs command so we can see what it's doing when it catches this signal?

Comment 2 Peter Engel 2010-02-15 15:27:22 UTC
Created attachment 394338 [details]
strace /sbin/umount.nfs

Please find the output of  strace /sbin/umount.nfs on my machine

Comment 3 Jeff Layton 2010-02-15 15:59:15 UTC
cc'ing Chuck Lever...

Ok, I think I see what's going on. libtirpc does this in authunix_create_default:

        if ((len = getgroups(NGRPS, gids)) < 0)
                abort();

...and that's causing the issue. I assume that the user in this case is a member of >16 groups?

I suppose we'll have to come up with an appropriate way to deal with this possibility. Since this is just for a umount call, perhaps we should just do a setgroups call before any RPC's to zero out the supplemental group list?

Chuck, thoughts?

Comment 4 Chuck Lever 2010-02-15 16:35:30 UTC
(In reply to comment #3)
> cc'ing Chuck Lever...
> 
> Ok, I think I see what's going on. libtirpc does this in
> authunix_create_default:
> 
>         if ((len = getgroups(NGRPS, gids)) < 0)
>                 abort();
> 
> ...and that's causing the issue. I assume that the user in this case is a
> member of >16 groups?
> 
> I suppose we'll have to come up with an appropriate way to deal with this
> possibility. Since this is just for a umount call, perhaps we should just do a
> setgroups call before any RPC's to zero out the supplemental group list?
> 
> Chuck, thoughts?    

Assuming there are more than 16 gids, why not fix the library?  I'm willing to bet money there are other RPC applications that will hit this issue.

According to the getgroups(2) man page, you should be able to call getgroups(2) with a zero size to determine how large the gid buffer should be.  Let the buffer size be min(NGRPS, getgroups result), and malloc the buffer.  This appears to be what glibc's authunix_create_default(3) does.

Comment 5 Peter Engel 2010-02-15 16:48:21 UTC
root belongs to 17 groups on my machine ...
[root@server /]# groups | wc
      1      17     100

Comment 6 Jeff Layton 2010-02-15 18:15:24 UTC
Created attachment 394367 [details]
patch 1 -- handle >NGRPS groups correctly

I think this patch will fix it. Tested for compilation only so far. Peter would you be able to test it and let me know if it corrects your issue?

Comment 7 Jeff Layton 2010-02-15 18:24:08 UTC
Created attachment 394370 [details]
patch 2 -- handle more error cases

Oops...attached an earlier version of the patch by mistake, this one should be correct.

Comment 8 Chuck Lever 2010-02-15 18:30:22 UTC
(In reply to comment #7)
> Created an attachment (id=394370) [details]
> patch 2 -- handle more error cases
> 
> Oops...attached an earlier version of the patch by mistake, this one should be
> correct.    

That's the idea, but there are some problems.

authunix_create() may not handle a gids array smaller than NGRPS.  I would use
the min() before the mem_alloc() to be certain the gids array is always large
enough.

The glibc code I looked at suggested that the number of groups can change
between the two getgroups(2) calls.  Maybe you need to account for that by
looping until you succeed in populating the gids array.

And, I would get rid of that abort(3) altogether.  Just return an error.  If
you really want to know that it failed, maybe it could print a message on
stderr?

Peter, it's still worth trying this patch out to see if we are on the right
track.  It applies to libtirpc, not umount.nfs.

Comment 9 Jeff Layton 2010-02-15 19:42:42 UTC
Created attachment 394390 [details]
patch 3 -- address Chuck's concerns

Third patch.

This should address Chuck's concerns. I looked over the lower functions and it
appears that they should handle a buffer smaller than NGRPS without any problem,
but these are tiny allocations so it doesn't hurt to be a little paranoid.

I'm a little leery of the possibility of an infinite loop in this patch if we get continual EINVALs for some reason. I wouldn't mind considering an upper bound on the number of retries we'll do, but maybe it's not anything to be concerned about since glibc seems to do the same thing.

I'm using warnx() here since authunix_create also uses it. I'll note that most of those calls in that function are wrapped in "#ifndef _KERNEL". I didn't do that here though since it wasn't clear to me how that define gets set.

Comment 10 Chuck Lever 2010-02-15 20:18:42 UTC
(In reply to comment #9)
> Created an attachment (id=394390) [details]
> patch 3 -- address Chuck's concerns
> 
> Third patch.
> 
> This should address Chuck's concerns. I looked over the lower functions and it
> appears that they should handle a buffer smaller than NGRPS without any
> problem,
> but these are tiny allocations so it doesn't hurt to be a little paranoid.

My assumption was incorrect.  The gids array always had NGRPS elements, but len, which is passed to authunix_create(), could be less than NGRPS.  I think the len argument is actually used on the wire, so we would end up putting extra gids in the credential in that case.

I would check the credentials on the wire in three cases (len = 0, len < NGRPS, len > NGRPS) to verify that we understand this correctly.  The first two cases can be checked with the existing libtirpc for comparison.

> I'm a little leery of the possibility of an infinite loop in this patch if we
> get continual EINVALs for some reason. I wouldn't mind considering an upper
> bound on the number of retries we'll do, but maybe it's not anything to be
> concerned about since glibc seems to do the same thing.

Agreed.

> I'm using warnx() here since authunix_create also uses it. I'll note that most
> of those calls in that function are wrapped in "#ifndef _KERNEL". I didn't do
> that here though since it wasn't clear to me how that define gets set.    

Looks OK.  I note many instances of warnx() call sites without the _KERNEL wrapping.

Comment 11 Jeff Layton 2010-02-16 13:52:51 UTC
Ok, tested out the patch this morning and it seems to do the right thing. You can easily reproduce this problem using showmount too.

I tested both the greater than and less than 16 groups case. getgroups seems to always return the primary group as part of the list, so it's a little difficult to test the case where there are 0 groups, but I think what we have will do the right thing in that case as well.

I'll send the patch out to the list a little later today. Chuck, any thoughts on whether we should guard against infinite retries in this function?

Comment 12 Chuck Lever 2010-02-16 17:52:19 UTC
(In reply to comment #11)
> Ok, tested out the patch this morning and it seems to do the right thing. You
> can easily reproduce this problem using showmount too.
> 
> I tested both the greater than and less than 16 groups case. getgroups seems to
> always return the primary group as part of the list, so it's a little difficult
> to test the case where there are 0 groups, but I think what we have will do the
> right thing in that case as well.
> 
> I'll send the patch out to the list a little later today. Chuck, any thoughts
> on whether we should guard against infinite retries in this function?    

I wouldn't worry about the infinite retries case if glibc's authunix_create_default() doesn't care.  Perhaps you could consult a glibc or security expert.  I don't know that area well enough to say how the number of groups could change between those two calls.  I'm guessing it's probably OS-dependent behavior, so it would pay to ask someone who knows.

But, I'm not clear why we need to be concerned about the zero groups case, as glibc doesn't appear to care about that at all.  When would getgroups(2) return zero, if ever?  How is your array allocation logic not going to handle zero groups?

And again, I now think the paranoia bit is unneeded.  If you want to keep it, though, the paranoia comment doesn't explain why the extra complexity is necessary, and it really should do that.

Lastly, I think it's worth adding a comment in front of the "if (len > NGRPS)" statement, as that is the crux of this bug fix.  Something like "RPC doesn't support more than NGRPS supplemental groups on the wire, so truncate the gids array passed to authunix_create()."

I'm interested in clearing up the comments because the next time any one is back here may likely be a long time in the future.  It's worth it to leave good sign posts.

Comment 13 Jeff Layton 2010-02-16 18:31:54 UTC
(In reply to comment #12)
> I wouldn't worry about the infinite retries case if glibc's
> authunix_create_default() doesn't care.  Perhaps you could consult a glibc or
> security expert.  I don't know that area well enough to say how the number of
> groups could change between those two calls.  I'm guessing it's probably
> OS-dependent behavior, so it would pay to ask someone who knows.
> 

Ok, I'm not going to sweat it too much. I'll leave it in there since it looks reasonable. getgroups is a syscall and I think it's plausible that something can happen to the process asynchronously to change the number of groups.

> But, I'm not clear why we need to be concerned about the zero groups case, as
> glibc doesn't appear to care about that at all.  When would getgroups(2) return
> zero, if ever?  How is your array allocation logic not going to handle zero
> groups?
> 

I guess you're talking about the "goto no_groups" ?

We do have to handle that case specially. After we allocate the buffer, we do this:

        len = getgroups(len, gids);

...now if len == 0 and the number of groups changes between the two getgroups calls, then we'll end up getting back the number of groups with an unpopulated gids list.

Regardless of whether the current Linux kernel ever returns a 0 length list, I think we ought to code defensively here and assume that it can occur.

> And again, I now think the paranoia bit is unneeded.  If you want to keep it,
> though, the paranoia comment doesn't explain why the extra complexity is
> necessary, and it really should do that.
> 

Ok, I'll drop that part. It seemed unnecessary to me too.

> Lastly, I think it's worth adding a comment in front of the "if (len > NGRPS)"
> statement, as that is the crux of this bug fix.  Something like "RPC doesn't
> support more than NGRPS supplemental groups on the wire, so truncate the gids
> array passed to authunix_create()."
> 

Ok, I'll add that.

> I'm interested in clearing up the comments because the next time any one is
> back here may likely be a long time in the future.  It's worth it to leave good
> sign posts.    

Agreed.

Comment 14 Jeff Layton 2010-02-16 18:36:21 UTC
...also I'm a little leery of calling malloc(0) -- the manpages seem to indicate that the behavior there is poorly defined (returning NULL In some cases and a bogus pointer in others).

Comment 15 Jeff Layton 2010-03-01 18:45:17 UTC
Actually, we ended up fixing this in nfs-utils:

http://marc.info/?l=linux-nfs&m=126663900618686&w=2

Steve just committed it to the upstream nfs-utils repo this morning. We still want to fix libtirpc not to abort(), but that's lower priority now that nfs-utils has its own fix.

The objection to the original patch was that we shouldn't pick a subset of groups and should instead just return NULL if there are too many. I have a patch that should fix this in libtirpc, but it needs a bit of testing...

Comment 16 Jeff Layton 2010-03-08 18:53:35 UTC
The other part of the fix for this has now been pushed into libtirpc as well. That's the change to make it not use abort() on errors.

With those changes, both problems should be fixed upstream.

Comment 17 Fedora Update System 2010-03-22 16:44:30 UTC
libtirpc-0.2.1-1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/libtirpc-0.2.1-1.fc12

Comment 18 Fedora Update System 2010-03-23 23:22:35 UTC
libtirpc-0.2.1-1.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libtirpc'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/libtirpc-0.2.1-1.fc12

Comment 19 Fedora Update System 2010-04-03 04:47:44 UTC
libtirpc-0.2.1-1.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.