Bug 575909 - libtirpc not handling large numbers of supplemental groups correctly
Summary: libtirpc not handling large numbers of supplemental groups correctly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: libtirpc
Version: 13
Hardware: i686
OS: Linux
low
high
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 565507
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-03-22 17:14 UTC by Steve Dickson
Modified: 2010-04-09 03:53 UTC (History)
7 users (show)

Fixed In Version: libtirpc-0.2.1-2.fc13
Doc Type: Bug Fix
Doc Text:
Clone Of: 565507
Environment:
Last Closed: 2010-04-09 03:53:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Steve Dickson 2010-03-22 17:14:29 UTC
+++ This bug was initially created as a clone of Bug #565507 +++

Description of problem:
/sbin/umount.nfs terminates (Program terminated with signal 6, Aborted.)
Partition is 100% full after many cores written. Every 2 sec a core file of app 1MB is written probably by automounter

This morning the disk was full. Further Analysis found /sbin/umount.nfs as terminating program. No idea why this started 3 days ago (oldest cores found). Maybe the package was updated?

Version-Release number of selected component (if applicable):
[root@server /]# rpm -qf /sbin/umount.nfs
nfs-utils-1.2.1-4.fc12.i686

How reproducible:
Always


Steps to Reproduce:
1. Manually type /sbin/umount.nfs <...>

or 

1. Wait until disk is full (probably by automounter)

  
Actual results:
Program terminates. Disk is full.

Expected results:
Program works. Disk not full.



Additional info:
[root@server /]# file core.15246 
core.15246: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from '/sbin/umount.nfs <...>'

(gdb) core core.15246
Missing separate debuginfo for the main executable file
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/79/a4a1c416246d09ff344aab5ee8564de47d5c81
Core was generated by `/sbin/umount.nfs /net/sapmnt.appl_sw'.
Program terminated with signal 6, Aborted.
#0  0x0032b416 in __kernel_vsyscall ()

--- Additional comment from jlayton on 2010-02-15 10:01:55 EST ---

> Program terminated with signal 6, Aborted.

SIGABRT? I'm not even sure what can cause that to get issued...

I'm not seeing this on my i686 machine with nfs-utils-1.2.1-4.fc12.i686.

# umount.nfs /mnt/test ; echo $?
0

...I've also straced it and didn't see any problems. Perhaps you can strace your umount.nfs command so we can see what it's doing when it catches this signal?

--- Additional comment from peter.engel on 2010-02-15 10:27:22 EST ---

Created an attachment (id=394338)
strace /sbin/umount.nfs

Please find the output of  strace /sbin/umount.nfs on my machine

--- Additional comment from jlayton on 2010-02-15 10:59:15 EST ---

cc'ing Chuck Lever...

Ok, I think I see what's going on. libtirpc does this in authunix_create_default:

        if ((len = getgroups(NGRPS, gids)) < 0)
                abort();

...and that's causing the issue. I assume that the user in this case is a member of >16 groups?

I suppose we'll have to come up with an appropriate way to deal with this possibility. Since this is just for a umount call, perhaps we should just do a setgroups call before any RPC's to zero out the supplemental group list?

Chuck, thoughts?

--- Additional comment from chuck.lever on 2010-02-15 11:35:30 EST ---

(In reply to comment #3)
> cc'ing Chuck Lever...
> 
> Ok, I think I see what's going on. libtirpc does this in
> authunix_create_default:
> 
>         if ((len = getgroups(NGRPS, gids)) < 0)
>                 abort();
> 
> ...and that's causing the issue. I assume that the user in this case is a
> member of >16 groups?
> 
> I suppose we'll have to come up with an appropriate way to deal with this
> possibility. Since this is just for a umount call, perhaps we should just do a
> setgroups call before any RPC's to zero out the supplemental group list?
> 
> Chuck, thoughts?    

Assuming there are more than 16 gids, why not fix the library?  I'm willing to bet money there are other RPC applications that will hit this issue.

According to the getgroups(2) man page, you should be able to call getgroups(2) with a zero size to determine how large the gid buffer should be.  Let the buffer size be min(NGRPS, getgroups result), and malloc the buffer.  This appears to be what glibc's authunix_create_default(3) does.

--- Additional comment from peter.engel on 2010-02-15 11:48:21 EST ---

root belongs to 17 groups on my machine ...
[root@server /]# groups | wc
      1      17     100

--- Additional comment from jlayton on 2010-02-15 13:15:24 EST ---

Created an attachment (id=394367)
patch 1 -- handle >NGRPS groups correctly

I think this patch will fix it. Tested for compilation only so far. Peter would you be able to test it and let me know if it corrects your issue?

--- Additional comment from jlayton on 2010-02-15 13:24:08 EST ---

Created an attachment (id=394370)
patch 2 -- handle more error cases

Oops...attached an earlier version of the patch by mistake, this one should be correct.

--- Additional comment from chuck.lever on 2010-02-15 13:30:22 EST ---

(In reply to comment #7)
> Created an attachment (id=394370) [details]
> patch 2 -- handle more error cases
> 
> Oops...attached an earlier version of the patch by mistake, this one should be
> correct.    

That's the idea, but there are some problems.

authunix_create() may not handle a gids array smaller than NGRPS.  I would use
the min() before the mem_alloc() to be certain the gids array is always large
enough.

The glibc code I looked at suggested that the number of groups can change
between the two getgroups(2) calls.  Maybe you need to account for that by
looping until you succeed in populating the gids array.

And, I would get rid of that abort(3) altogether.  Just return an error.  If
you really want to know that it failed, maybe it could print a message on
stderr?

Peter, it's still worth trying this patch out to see if we are on the right
track.  It applies to libtirpc, not umount.nfs.

--- Additional comment from jlayton on 2010-02-15 14:42:42 EST ---

Created an attachment (id=394390)
patch 3 -- address Chuck's concerns

Third patch.

This should address Chuck's concerns. I looked over the lower functions and it
appears that they should handle a buffer smaller than NGRPS without any problem,
but these are tiny allocations so it doesn't hurt to be a little paranoid.

I'm a little leery of the possibility of an infinite loop in this patch if we get continual EINVALs for some reason. I wouldn't mind considering an upper bound on the number of retries we'll do, but maybe it's not anything to be concerned about since glibc seems to do the same thing.

I'm using warnx() here since authunix_create also uses it. I'll note that most of those calls in that function are wrapped in "#ifndef _KERNEL". I didn't do that here though since it wasn't clear to me how that define gets set.

--- Additional comment from chuck.lever on 2010-02-15 15:18:42 EST ---

(In reply to comment #9)
> Created an attachment (id=394390) [details]
> patch 3 -- address Chuck's concerns
> 
> Third patch.
> 
> This should address Chuck's concerns. I looked over the lower functions and it
> appears that they should handle a buffer smaller than NGRPS without any
> problem,
> but these are tiny allocations so it doesn't hurt to be a little paranoid.

My assumption was incorrect.  The gids array always had NGRPS elements, but len, which is passed to authunix_create(), could be less than NGRPS.  I think the len argument is actually used on the wire, so we would end up putting extra gids in the credential in that case.

I would check the credentials on the wire in three cases (len = 0, len < NGRPS, len > NGRPS) to verify that we understand this correctly.  The first two cases can be checked with the existing libtirpc for comparison.

> I'm a little leery of the possibility of an infinite loop in this patch if we
> get continual EINVALs for some reason. I wouldn't mind considering an upper
> bound on the number of retries we'll do, but maybe it's not anything to be
> concerned about since glibc seems to do the same thing.

Agreed.

> I'm using warnx() here since authunix_create also uses it. I'll note that most
> of those calls in that function are wrapped in "#ifndef _KERNEL". I didn't do
> that here though since it wasn't clear to me how that define gets set.    

Looks OK.  I note many instances of warnx() call sites without the _KERNEL wrapping.

--- Additional comment from jlayton on 2010-02-16 08:52:51 EST ---

Ok, tested out the patch this morning and it seems to do the right thing. You can easily reproduce this problem using showmount too.

I tested both the greater than and less than 16 groups case. getgroups seems to always return the primary group as part of the list, so it's a little difficult to test the case where there are 0 groups, but I think what we have will do the right thing in that case as well.

I'll send the patch out to the list a little later today. Chuck, any thoughts on whether we should guard against infinite retries in this function?

--- Additional comment from chuck.lever on 2010-02-16 12:52:19 EST ---

(In reply to comment #11)
> Ok, tested out the patch this morning and it seems to do the right thing. You
> can easily reproduce this problem using showmount too.
> 
> I tested both the greater than and less than 16 groups case. getgroups seems to
> always return the primary group as part of the list, so it's a little difficult
> to test the case where there are 0 groups, but I think what we have will do the
> right thing in that case as well.
> 
> I'll send the patch out to the list a little later today. Chuck, any thoughts
> on whether we should guard against infinite retries in this function?    

I wouldn't worry about the infinite retries case if glibc's authunix_create_default() doesn't care.  Perhaps you could consult a glibc or security expert.  I don't know that area well enough to say how the number of groups could change between those two calls.  I'm guessing it's probably OS-dependent behavior, so it would pay to ask someone who knows.

But, I'm not clear why we need to be concerned about the zero groups case, as glibc doesn't appear to care about that at all.  When would getgroups(2) return zero, if ever?  How is your array allocation logic not going to handle zero groups?

And again, I now think the paranoia bit is unneeded.  If you want to keep it, though, the paranoia comment doesn't explain why the extra complexity is necessary, and it really should do that.

Lastly, I think it's worth adding a comment in front of the "if (len > NGRPS)" statement, as that is the crux of this bug fix.  Something like "RPC doesn't support more than NGRPS supplemental groups on the wire, so truncate the gids array passed to authunix_create()."

I'm interested in clearing up the comments because the next time any one is back here may likely be a long time in the future.  It's worth it to leave good sign posts.

--- Additional comment from jlayton on 2010-02-16 13:31:54 EST ---

(In reply to comment #12)
> I wouldn't worry about the infinite retries case if glibc's
> authunix_create_default() doesn't care.  Perhaps you could consult a glibc or
> security expert.  I don't know that area well enough to say how the number of
> groups could change between those two calls.  I'm guessing it's probably
> OS-dependent behavior, so it would pay to ask someone who knows.
> 

Ok, I'm not going to sweat it too much. I'll leave it in there since it looks reasonable. getgroups is a syscall and I think it's plausible that something can happen to the process asynchronously to change the number of groups.

> But, I'm not clear why we need to be concerned about the zero groups case, as
> glibc doesn't appear to care about that at all.  When would getgroups(2) return
> zero, if ever?  How is your array allocation logic not going to handle zero
> groups?
> 

I guess you're talking about the "goto no_groups" ?

We do have to handle that case specially. After we allocate the buffer, we do this:

        len = getgroups(len, gids);

...now if len == 0 and the number of groups changes between the two getgroups calls, then we'll end up getting back the number of groups with an unpopulated gids list.

Regardless of whether the current Linux kernel ever returns a 0 length list, I think we ought to code defensively here and assume that it can occur.

> And again, I now think the paranoia bit is unneeded.  If you want to keep it,
> though, the paranoia comment doesn't explain why the extra complexity is
> necessary, and it really should do that.
> 

Ok, I'll drop that part. It seemed unnecessary to me too.

> Lastly, I think it's worth adding a comment in front of the "if (len > NGRPS)"
> statement, as that is the crux of this bug fix.  Something like "RPC doesn't
> support more than NGRPS supplemental groups on the wire, so truncate the gids
> array passed to authunix_create()."
> 

Ok, I'll add that.

> I'm interested in clearing up the comments because the next time any one is
> back here may likely be a long time in the future.  It's worth it to leave good
> sign posts.    

Agreed.

--- Additional comment from jlayton on 2010-02-16 13:36:21 EST ---

...also I'm a little leery of calling malloc(0) -- the manpages seem to indicate that the behavior there is poorly defined (returning NULL In some cases and a bogus pointer in others).

--- Additional comment from jlayton on 2010-03-01 13:45:17 EST ---

Actually, we ended up fixing this in nfs-utils:

http://marc.info/?l=linux-nfs&m=126663900618686&w=2

Steve just committed it to the upstream nfs-utils repo this morning. We still want to fix libtirpc not to abort(), but that's lower priority now that nfs-utils has its own fix.

The objection to the original patch was that we shouldn't pick a subset of groups and should instead just return NULL if there are too many. I have a patch that should fix this in libtirpc, but it needs a bit of testing...

--- Additional comment from jlayton on 2010-03-08 13:53:35 EST ---

The other part of the fix for this has now been pushed into libtirpc as well. That's the change to make it not use abort() on errors.

With those changes, both problems should be fixed upstream.

--- Additional comment from updates on 2010-03-22 12:44:30 EDT ---

libtirpc-0.2.1-1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/libtirpc-0.2.1-1.fc12

Comment 1 Fedora Update System 2010-03-22 17:15:58 UTC
libtirpc-0.2.1-2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/libtirpc-0.2.1-2.fc13

Comment 2 Fedora Update System 2010-03-23 23:38:07 UTC
libtirpc-0.2.1-2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libtirpc'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/libtirpc-0.2.1-2.fc13

Comment 3 Fedora Update System 2010-04-09 03:53:34 UTC
libtirpc-0.2.1-2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.